Facebook
TwitterThis dataset contains two tables: creative_stats and removed_creative_stats. The creative_stats table contains information about advertisers that served ads in the European Economic Area or Turkey: their legal name, verification status, disclosed name, and location. It also includes ad specific information: impression ranges per region (including aggregate impressions for the European Economic Area), first shown and last shown dates, which criteria were used in audience selection, the format of the ad, the ad topic and whether the ad is funded by Google Ad Grants program. A link to the ad in the Google Ads Transparency Center is also provided. The removed_creative_stats table contains information about ads that served in the European Economic Area that Google removed: where and why they were removed and per-region information on when they served. The removed_creative_stats table also contains a link to the Google Ads Transparency Center for the removed ad. Data for both tables updates periodically and may be delayed from what appears on the Google Ads Transparency Center website. About BigQuery This data is hosted in Google BigQuery for users to easily query using SQL. Note that to use BigQuery, users must have a Google account and create a GCP project. This public dataset is included in BigQuery's 1TB/mo of free tier processing. Each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery . Download Dataset This public dataset is also hosted in Google Cloud Storage here and available free to use. Use this quick start guide to quickly learn how to access public datasets on Google Cloud Storage. We provide the raw data in JSON format, sharded across multiple files to support easier download of the large dataset. A README file which describes the data structure and our Terms of Service (also listed below) is included with the dataset. You can also download the results from a custom query. See here for options and instructions. Signed out users can download the full dataset by using the gCloud CLI. Follow the instructions here to download and install the gCloud CLI. To remove the login requirement, run "$ gcloud config set auth/disable_credentials True" To download the dataset, run "$ gcloud storage cp gs://ads-transparency-center/* . -R" This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
Facebook
Twitterhttps://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
The importance of effectively using Google Cloud Platform (GCP) billing data to gain actionable insights into cloud spending. It emphasizes the need for strategic cost management, offering guidance on how to analyze billing data, optimize resource usage, and implement best practices to minimize costs while maximizing the value derived from cloud services. The subtitle is geared towards businesses and technical teams looking to maintain financial control and improve their cloud operations.
This dataset contains the data of GCP billing cloud cost. For a updated one, comment ! contact !
Facebook
TwitterMultiversX is a highly scalable, secure and decentralized blockchain network created to enable radically new applications, for users, businesses, society, and the new metaverse frontier. This dataset is one of many crypto datasets that are available within Google Cloud Public Datasets . As with other Google Cloud public datasets, you can query this dataset for free, up to 1TB/month of free processing, every month. Watch this short video to learn how to get started with the public datasets.
Facebook
TwitterIn partnership with the Harvard Global Health Institute, Google Cloud is releasing the COVID-19 Public Forecasts to serve as an additional resource for first responders in healthcare, the public sector, and other impacted organizations preparing for what lies ahead. These forecasts are available for free and provide a projection of COVID-19 cases, deaths, and other metrics over the next 14 days for US counties and states. For more info, see https://cloud.google.com/blog/products/ai-machine-learning/google-cloud-is-releasing-the-covid-19-public-forecasts and https://storage.googleapis.com/covid-external/COVID-19ForecastWhitePaper.pdf
A projection of COVID-19 cases, deaths, and other metrics over the next 14 days for US counties and states
Released on BigQuery by Google Cloud:
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Google Patents Public Data, provided by IFI CLAIMS Patent Services, is a worldwide bibliographic and US full-text dataset of patent publications. Patent information accessibility is critical for examining new patents, informing public policy decisions, managing corporate investment in intellectual property, and promoting future scientific innovation. The growing number of available patent data sources means researchers often spend more time downloading, parsing, loading, syncing and managing local databases than conducting analysis. With these new datasets, researchers and companies can access the data they need from multiple sources in one place, thus spending more time on analysis than data preparation.
The Google Patents Public Data dataset contains a collection of publicly accessible, connected database tables for empirical analysis of the international patent system.
Data Origin: https://bigquery.cloud.google.com/dataset/patents-public-data:patents
For more info, see the documentation at https://developers.google.com/web/tools/chrome-user-experience-report/
“Google Patents Public Data” by IFI CLAIMS Patent Services and Google is licensed under a Creative Commons Attribution 4.0 International License.
Banner photo by Helloquence on Unsplash
Facebook
TwitterThis table contains release notes for the majority of generally available Google Cloud products found on cloud.google.com . You can use this BigQuery public dataset to consume release notes programmatically across all products. HTML versions of release notes are available within each product's documentation and also in a filterable format at https://console.cloud.google.com/release-notes . This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
Facebook
TwitterCSV version of Looker Ecommerce Dataset.
Overview Dataset in BigQuery TheLook is a fictitious eCommerce clothing site developed by the Looker team. The dataset contains information >about customers, products, orders, logistics, web events and digital marketing campaigns. The contents of this >dataset are synthetic, and are provided to industry practitioners for the purpose of product discovery, testing, and >evaluation. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This >means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on >this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public >datasets.
distribution_centers.csvid: Unique identifier for each distribution center.name: Name of the distribution center.latitude: Latitude coordinate of the distribution center.longitude: Longitude coordinate of the distribution center.events.csvid: Unique identifier for each event.user_id: Identifier for the user associated with the event.sequence_number: Sequence number of the event.session_id: Identifier for the session during which the event occurred.created_at: Timestamp indicating when the event took place.ip_address: IP address from which the event originated.city: City where the event occurred.state: State where the event occurred.postal_code: Postal code of the event location.browser: Web browser used during the event.traffic_source: Source of the traffic leading to the event.uri: Uniform Resource Identifier associated with the event.event_type: Type of event recorded.inventory_items.csvid: Unique identifier for each inventory item.product_id: Identifier for the associated product.created_at: Timestamp indicating when the inventory item was created.sold_at: Timestamp indicating when the item was sold.cost: Cost of the inventory item.product_category: Category of the associated product.product_name: Name of the associated product.product_brand: Brand of the associated product.product_retail_price: Retail price of the associated product.product_department: Department to which the product belongs.product_sku: Stock Keeping Unit (SKU) of the product.product_distribution_center_id: Identifier for the distribution center associated with the product.order_items.csvid: Unique identifier for each order item.order_id: Identifier for the associated order.user_id: Identifier for the user who placed the order.product_id: Identifier for the associated product.inventory_item_id: Identifier for the associated inventory item.status: Status of the order item.created_at: Timestamp indicating when the order item was created.shipped_at: Timestamp indicating when the order item was shipped.delivered_at: Timestamp indicating when the order item was delivered.returned_at: Timestamp indicating when the order item was returned.orders.csvorder_id: Unique identifier for each order.user_id: Identifier for the user who placed the order.status: Status of the order.gender: Gender information of the user.created_at: Timestamp indicating when the order was created.returned_at: Timestamp indicating when the order was returned.shipped_at: Timestamp indicating when the order was shipped.delivered_at: Timestamp indicating when the order was delivered.num_of_item: Number of items in the order.products.csvid: Unique identifier for each product.cost: Cost of the product.category: Category to which the product belongs.name: Name of the product.brand: Brand of the product.retail_price: Retail price of the product.department: Department to which the product belongs.sku: Stock Keeping Unit (SKU) of the product.distribution_center_id: Identifier for the distribution center associated with the product.users.csvid: Unique identifier for each user.first_name: First name of the user.last_name: Last name of the user.email: Email address of the user.age: Age of the user.gender: Gender of the user.state: State where t...
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Stack Overflow is the largest online community for programmers to learn, share their knowledge, and advance their careers.
Updated on a quarterly basis, this BigQuery dataset includes an archive of Stack Overflow content, including posts, votes, tags, and badges. This dataset is updated to mirror the Stack Overflow content on the Internet Archive, and is also available through the Stack Exchange Data Explorer.
Fork this kernel to get started with this dataset.
Dataset Source: https://archive.org/download/stackexchange
https://bigquery.cloud.google.com/dataset/bigquery-public-data:stackoverflow
https://cloud.google.com/bigquery/public-data/stackoverflow
Banner Photo by Caspar Rubin from Unplash.
What is the percentage of questions that have been answered over the years?
What is the reputation and badge count of users across different tenures on StackOverflow?
What are 10 of the “easier” gold badges to earn?
Which day of the week has most questions answered within an hour?
Facebook
TwitterThe Metropolitan Museum of Art, better known as the Met, provides a public domain dataset with over 200,000 objects including metadata and images. In early 2017, the Met debuted their Open Access policy to make part of their collection freely available for unrestricted use under the Creative Commons Zero designation and their own terms and conditions. This dataset provides a new view to one of the world’s premier collections of fine art. The data includes both image in Google Cloud Storage, and associated structured data in two BigQuery two tables, objects and images (1:N). Locations to images on both The Met’s website and in Google Cloud Storage are available in the BigQuery table. The meta data for this public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery . The image data for this public dataset is hosted in Google Cloud Storage and available free to use. Use this quick start guide to quickly learn how to access public datasets on Google Cloud Storage.
Facebook
TwitterCompany Datasets for valuable business insights!
Discover new business prospects, identify investment opportunities, track competitor performance, and streamline your sales efforts with comprehensive Company Datasets.
These datasets are sourced from top industry providers, ensuring you have access to high-quality information:
We provide fresh and ready-to-use company data, eliminating the need for complex scraping and parsing. Our data includes crucial details such as:
You can choose your preferred data delivery method, including various storage options, delivery frequency, and input/output formats.
Receive datasets in CSV, JSON, and other formats, with storage options like AWS S3 and Google Cloud Storage. Opt for one-time, monthly, quarterly, or bi-annual data delivery.
With Oxylabs Datasets, you can count on:
Pricing Options:
Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.
Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.
Experience a seamless journey with Oxylabs:
Unlock the power of data with Oxylabs' Company Datasets and supercharge your business insights today!
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Adapted from Wikipedia: OpenStreetMap (OSM) is a collaborative project to create a free editable map of the world. Created in 2004, it was inspired by the success of Wikipedia and more than two million registered users who can add data by manual survey, GPS devices, aerial photography, and other free sources. We've made available a number of tables (explained in detail below): history_* tables: full history of OSM objects planet_* tables: snapshot of current OSM objects as of Nov 2019 The history_* and planet_* table groups are composed of node, way, relation, and changeset tables. These contain the primary OSM data types and an additional changeset corresponding to OSM edits for convenient access. These objects are encoded using the BigQuery GEOGRAPHY data type so that they can be operated upon with the built-in geography functions to perform geometry and feature selection, additional processing. Example analyses are given below. This dataset is part of a larger effort to make data available in BigQuery through the Google Cloud Public Datasets program . OSM itself is produced as a public good by volunteers, and there are no guarantees about data quality. Interested in learning more about how these data were brought into BigQuery and how you can use them? Check out the sample queries below to get started. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
NYC Open Data is an opportunity to engage New Yorkers in the information that is produced and used by City government. We believe that every New Yorker can benefit from Open Data, and Open Data can benefit from every New Yorker. Source: https://opendata.cityofnewyork.us/overview/
Thanks to NYC Open Data, which makes public data generated by city agencies available for public use, and Citi Bike, we've incorporated over 150 GB of data in 5 open datasets into Google BigQuery Public Datasets, including:
Over 8 million 311 service requests from 2012-2016
More than 1 million motor vehicle collisions 2012-present
Citi Bike stations and 30 million Citi Bike trips 2013-present
Over 1 billion Yellow and Green Taxi rides from 2009-present
Over 500,000 sidewalk trees surveyed decennially in 1995, 2005, and 2015
This dataset is deprecated and not being updated.
Fork this kernel to get started with this dataset.
https://opendata.cityofnewyork.us/
This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - https://data.cityofnewyork.us/ - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
By accessing datasets and feeds available through NYC Open Data, the user agrees to all of the Terms of Use of NYC.gov as well as the Privacy Policy for NYC.gov. The user also agrees to any additional terms of use defined by the agencies, bureaus, and offices providing data. Public data sets made available on NYC Open Data are provided for informational purposes. The City does not warranty the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set made available on NYC Open Data, nor are any such warranties to be implied or inferred with respect to the public data sets furnished therein.
The City is not liable for any deficiencies in the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set, or application utilizing such data set, provided by any third party.
Banner Photo by @bicadmedia from Unplash.
On which New York City streets are you most likely to find a loud party?
Can you find the Virginia Pines in New York City?
Where was the only collision caused by an animal that injured a cyclist?
What’s the Citi Bike record for the Longest Distance in the Shortest Time (on a route with at least 100 rides)?
https://cloud.google.com/blog/big-data/2017/01/images/148467900588042/nyc-dataset-6.png" alt="enter image description here">
https://cloud.google.com/blog/big-data/2017/01/images/148467900588042/nyc-dataset-6.png
Facebook
TwitterIn an effort to help combat COVID-19, we created a COVID-19 Public Datasets program to make data more accessible to researchers, data scientists and analysts. The program will host a repository of public datasets that relate to the COVID-19 crisis and make them free to access and analyze. These include datasets from the New York Times, European Centre for Disease Prevention and Control, Google, Global Health Data from the World Bank, and OpenStreetMap. Free hosting and queries of COVID datasets As with all data in the Google Cloud Public Datasets Program , Google pays for storage of datasets in the program. BigQuery also provides free queries over certain COVID-related datasets to support the response to COVID-19. Queries on COVID datasets will not count against the BigQuery sandbox free tier , where you can query up to 1TB free each month. Limitations and duration Queries of COVID data are free. If, during your analysis, you join COVID datasets with non-COVID datasets, the bytes processed in the non-COVID datasets will be counted against the free tier, then charged accordingly, to prevent abuse. Queries of COVID datasets will remain free until Sept 15, 2021. The contents of these datasets are provided to the public strictly for educational and research purposes only. We are not onboarding or managing PHI or PII data as part of the COVID-19 Public Dataset Program. Google has practices & policies in place to ensure that data is handled in accordance with widely recognized patient privacy and data security policies. See the list of all datasets included in the program
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from the vast branches of physics to the many subdisciplines of computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics. This rich corpus of information offers significant, but sometimes overwhelming depth.
In these times of unique global challenges, efficient extraction of insights from data is essential. To help make the arXiv more accessible, we present a free, open pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more.
Our hope is to empower new use cases that can lead to the exploration of richer machine learning techniques that combine multi-modal features towards applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.
The dataset is freely available via Google Cloud Storage buckets (more info here). Stay tuned for weekly updates to the dataset!
ArXiv is a collaboratively funded, community-supported resource founded by Paul Ginsparg in 1991 and maintained and operated by Cornell University.
The release of this dataset was featured further in a Kaggle blog post here.
https://storage.googleapis.com/kaggle-public-downloads/arXiv.JPG" alt="">
See here for more information.
This dataset is a mirror of the original ArXiv data. Because the full dataset is rather large (1.1TB and growing), this dataset provides only a metadata file in the json format. This file contains an entry for each paper, containing:
- id: ArXiv ID (can be used to access the paper, see below)
- submitter: Who submitted the paper
- authors: Authors of the paper
- title: Title of the paper
- comments: Additional info, such as number of pages and figures
- journal-ref: Information about the journal the paper was published in
- doi: https://www.doi.org
- abstract: The abstract of the paper
- categories: Categories / tags in the ArXiv system
- versions: A version history
You can access each paper directly on ArXiv using these links:
- https://arxiv.org/abs/{id}: Page for this paper including its abstract and further links
- https://arxiv.org/pdf/{id}: Direct link to download the PDF
The full set of PDFs is available for free in the GCS bucket gs://arxiv-dataset or through Google API (json documentation and xml documentation).
You can use for example gsutil to download the data to your local machine. ```
gsutil cp gs://arxiv-dataset/arxiv/
gsutil cp gs://arxiv-dataset/arxiv/arxiv/pdf/2003/ ./a_local_directory/
gsutil cp -r gs://arxiv-dataset/arxiv/ ./a_local_directory/ ```
We're automatically updating the metadata as well as the GCS bucket on a weekly basis.
Creative Commons CC0 1.0 Universal Public Domain Dedication applies to the metadata in this dataset. See https://arxiv.org/help/license for further details and licensing on individual papers.
The original data is maintained by ArXiv, huge thanks to the team for building and maintaining this dataset.
We're using https://github.com/mattbierbaum/arxiv-public-datasets to pull the original data, thanks to Matt Bierbaum for providing this tool.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Adapted from Wikipedia: OpenStreetMap (OSM) is a collaborative project to create a free editable map of the world. Created in 2004, it was inspired by the success of Wikipedia and more than two million registered users who can add data by manual survey, GPS devices, aerial photography, and other free sources.
To aid researchers, data scientists, and analysts in the effort to combat COVID-19, Google is making a hosted repository of public datasets including OpenStreetMap data, free to access. To facilitate the Kaggle community to access the BigQuery dataset, it is onboarded to Kaggle platform which allows querying it without a linked GCP account. Please note that due to the large size of the dataset, Kaggle applies a quota of 5 TB of data scanned per user per 30-days.
This is the OpenStreetMap (OSM) planet-wide dataset loaded to BigQuery.
Tables:
- history_* tables: full history of OSM objects.
- planet_* tables: snapshot of current OSM objects as of Nov 2019.
The history_* and planet_* table groups are composed of node, way, relation, and changeset tables. These contain the primary OSM data types and an additional changeset corresponding to OSM edits for convenient access. These objects are encoded using the BigQuery GEOGRAPHY data type so that they can be operated upon with the built-in geography functions to perform geometry and feature selection, additional processing.
You can read more about OSM elements on the OSM Wiki. This dataset uses BigQuery GEOGRAPHY datatype which supports a set of functions that can be used to analyze geographical data, determine spatial relationships between geographical features, and construct or manipulate GEOGRAPHYs.
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This dataset contains augmented software architecture diagrams specifically focused on cloud service components from major cloud providers (AWS, Azure, GCP). The dataset was developed as part of a FIAP POS Tech Hackathon project for training computer vision models to detect and classify cloud architecture components in technical diagrams.
The dataset was created using a sophisticated augmentation pipeline that applies multiple transformations while preserving bounding box annotations:
This dataset is ideal for: - Object detection in software architecture diagrams - Cloud service recognition in technical documentation - Automated diagram analysis tools - Computer vision research in technical domain - Training custom models for architecture diagram parsing
dataset_augmented/
├── image_xpto.png # Augmented PNG images
├── image_xpto.xml # Pascal VOC XML files
Perfect for training: - YOLO object detection models - Faster R-CNN for precise component detection - Custom CNN architectures for diagram analysis - Multi-class classification models
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Learning Path Index Dataset is a comprehensive collection of byte-sized courses and learning materials tailored for individuals eager to delve into the fields of Data Science, Machine Learning, and Artificial Intelligence (AI), making it an indispensable reference for students, professionals, and educators in the Data Science and AI communities.
This Kaggle Dataset along with the KaggleX Learning Path Index GitHub Repo were created by the mentors and mentees of Cohort 3 KaggleX BIPOC Mentorship Program (between August 2023 and November 2023, also see this). See Credits section at the bottom of the long description.
This dataset was created out of a commitment to facilitate learning and growth within the Data Science, Machine Learning, and AI communities. It started off as an idea at the end of Cohort 2 of the KaggleX BIPOC Mentorship Program brainstorming and feedback session. It was one of the ideas to create byte-sized learning material to help our KaggleX mentees learn things faster. It aspires to simplify the process of finding, evaluating, and selecting the most fitting educational resources.
This dataset was meticulously curated to assist learners in navigating the vast landscape of Data Science, Machine Learning, and AI education. It serves as a compass for those aiming to develop their skills and expertise in these rapidly evolving fields.
The mentors and mentees communicated via Discord, Trello, Google Hangout, etc... to put together these artifacts and made them public for everyone to use and contribute back.
The dataset compiles data from a curated selection of reputable sources including leading educational platforms such as Google Developer, Google Cloud Skill Boost, IBM, Fast AI, etc. By drawing from these trusted sources, we ensure that the data is both accurate and pertinent. The raw data and other artifacts as a result of this exercise can be found on the GitHub Repo i.e. KaggleX Learning Path Index GitHub Repo.
The dataset encompasses the following attributes:
The Learning Path Index Dataset is openly shared under a permissive license, allowing users to utilize the data for educational, analytical, and research purposes within the Data Science, Machine Learning, and AI domains. Feel free to fork the dataset and make it your own, we would be delighted if you contributed back to the dataset and/or our KaggleX Learning Path Index GitHub Repo as well.
Credits for all the work done to create this Kaggle Dataset and the KaggleX [Learnin...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains Ethereum transaction data from September 1, 2022 to September 11, 2022.
This data is downloaded directly from Google BigQuery - Ethereum, a public dataset that provides comprehensive blockchain data.
Free and Open Source: This dataset is released under the CC0-1.0 (Creative Commons Zero) license, meaning it is in the public domain and free to use for any purpose without restrictions.
The dataset includes detailed transaction information from the Ethereum blockchain during the specified time period, including transaction hashes, addresses, values, gas fees, and timestamps.
ethtx_220901_220911_*: Ethereum transaction data files split into 41 shards (~9.6GB total)
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Landsat and Sentinel-2 acquisitions are among the most frequently used medium-resolution (i.e., 10-30 m) optical data. The data are extensively used in terrestrial vegetation applications, including but not limited to, land cover and land use mapping, vegetation condition and phenology monitoring, and disturbance and change mapping. While the Landsat archives alone provide over 40 years, and counting, of continuous and consistent observations, since mid-2015 Sentinel-2 has enabled a revisit frequency of up to 2-days. Although the spatio-temporal availability of both data archives is well-known at the scene level, information on the actual availability of usable (i.e., cloud-, snow-, and shade-free) observations at the pixel level needs to be explored for each study to ensure correct parametrization of used algorithms, thus robustness of subsequent analyses. However, a priori data exploration is time and resource‑consuming, thus is rarely performed. As a result, the spatio-temporal heterogeneity of usable data is often inadequately accounted for in the analysis design, risking ill-advised selection of algorithms and hypotheses, and thus inferior quality of final results. Here we present a global dataset comprising precomputed daily availability of usable Landsat and Sentinel-2 data sampled at a pixel-level in a regular 0.18°-point grid. We based the dataset on the complete 1982-2024 Landsat surface reflectance data (Collection 2) and 2015-2024 Seninel-2 top-of-the-atmosphere reflectance scenes (pre‑Collection-1 and Collection-1). Derivation of cloud-, snow-, and shade-free observations followed the methodology developed in our recent study on data availability over Europe (Lewińska et al., 2023; https://doi.org/10.20944/preprints202308.2174.v2). Furthermore, we expanded the dataset with growing season information derived based on the 2001‑2019 time series of the yearly 500 m MODIS land cover dynamics product (MCD12Q2; Collection 6). As such, our dataset presents a unique overview of the spatio-temporal availability of usable daily Landsat and Sentinel-2 data at the global scale, hence offering much-needed a priori information aiding the identification of appropriate methods and challenges for terrestrial vegetation analyses at the local to global scales. The dataset can be viewed using the dedicated GEE App (link in Related Works). As of February 2025 the dataset has been extended with the 2024 data. Methods We based our analyses on freely and openly accessible Landsat and Sentinel-2 data archives available in Google Earth Engine (Gorelick et al., 2017). We used all Landsat surface reflectance Level 2, Tier 1, Collection 2 scenes acquired with the Thematic Mapper (TM) (Earth Resources Observation And Science (EROS) Center, 1982), Enhanced Thematic Mapper (ETM+) (Earth Resources Observation And Science (EROS) Center, 1999), and Operational Land Imager (OLI) (Earth Resources Observation And Science (EROS) Center, 2013) scanners between 22nd August 1982 and 31st December 2024, and Sentinel-2 TOA reflectance Level-1C scenes (pre‑Collection-1 (European Space Agency, 2015, 2021) and Collection-1 (European Space Agency, 2022)) acquired with the MultiSpectral Instrument (MSI) between 23rd June 2015 and 31st December 2024. We implemented a conservative pixel-quality screening to identify cloud-, snow-, and shade-free land pixels. For the Landsat time series, we relied on the inherent pixel quality bands (Foga et al., 2017; Zhu & Woodcock, 2012) excluding all pixels flagged as cloud, snow, or shadow as well as pixels with the fill-in value of 20,000 (scale factor 0.0001; (Zhang et al., 2022)). Furthermore, due to the Landsat 7 orbit drift (Qiu et al., 2021) we excluded all ETM+ scenes acquired after 31st December 2020. Because Sentinel-2 Level-2A quality masks lack the desired scope and accuracy (Baetens et al., 2019; Coluzzi et al., 2018), we resorted to Level-1C scenes accompanied by the supporting Cloud Probability product. Furthermore, we employed a selection of conditions, including a threshold on Band 10 (SWIR-Cirrus), which is not available at Level‑2A. Overall, our Sentinel-2-specific cloud, shadow, and snow screening comprised:
exclusion of all pixels flagged as clouds and cirrus in the inherent ‘QA60’ cloud mask band; exclusion of all pixels with cloud probability >50% as defined in the corresponding Cloud Probability product available for each scene; exclusion of cirrus clouds (B10 reflectance >0.01); exclusion of clouds based on Cloud Displacement Analysis (CDI<‑0.5) (Frantz et al., 2018); exclusion of dark pixels (B8 reflectance <0.16) within cloud shadows modelled for each scene with scene‑specific sun parameters for the clouds identified in the previous steps. Here we assumed a cloud height of 2,000 m. exclusion of pixels within a 40-m buffer (two pixels at 20-m resolution) around each identified cloud and cloud shadow object. exclusion of snow pixels identified with a snow mask branch of the Sen2Cor processor (Main-Knorn et al., 2017).
Through applying the data screening, we generated a collection of daily availability records for Landsat and Sentinel-2 data archives. We next subsampled the resulting binary time series with a regular 0.18° x 0.18°‑point grid defined in the EPSG:4326 projection, obtaining 475,150 points located over land between ‑179.8867°W and 179.5733°E and 83.50834°N and ‑59.05167°S. Owing to the substantial amount of data comprised in the Landsat and Sentinel-2 archives and the computationally demanding process of cloud-, snow-, and shade-screening, we performed the subsampling in batches corresponding to a 4° x 4° regular grid and consolidated the final data in post-processing. We derived the pixel-specific growing season information from the 2001-2019 time series of the yearly 500‑m MODIS land cover dynamics product (MCD12Q2; Collection 6) available in Google Earth Engine. We only used information on the start and the end of a growing season, excluding all pixels with quality below ‘best’. When a pixel went through more than one growing cycle per year, we approximated a growing season as the period between the beginning of the first growing cycle and the end of the last growing cycle. To fill in data gaps arising from low-quality data and insufficiently pronounced seasonality (Friedl et al., 2019), we used a 5x5 mean moving window filter to ensure better spatial continuity of our growing season datasets. Following (Lewińska et al., 2023), we defined the start of the season as the pixel-specific 25th percentile of the 2001-2019 distribution for the start of the season dates, and the end of the season as the pixel-specific 75th percentile of the 2001-2019 distribution for end of the season dates. Finally, we subsampled the start and end of the season datasets with the same regular 0.18° x 0.18°-point grid defined in the EPSG:4326 projection. References:
Baetens, L., Desjardins, C., & Hagolle, O. (2019). Validation of Copernicus Sentinel-2 Cloud Masks Obtained from MAJA, Sen2Cor, and FMask Processors Using Reference Cloud Masks Generated with a Supervised Active Learning Procedure. Remote Sensing, 11(4), 433. https://doi.org/10.3390/rs11040433 Coluzzi, R., Imbrenda, V., Lanfredi, M., & Simoniello, T. (2018). A first assessment of the Sentinel-2 Level 1-C cloud mask product to support informed surface analyses. Remote Sensing of Environment, 217, 426–443. https://doi.org/10.1016/j.rse.2018.08.009 Earth Resources Observation And Science (EROS) Center. (1982). Collection-2 Landsat 4-5 Thematic Mapper (TM) Level-1 Data Products [Other]. U.S. Geological Survey. https://doi.org/10.5066/P918ROHC Earth Resources Observation And Science (EROS) Center. (1999). Collection-2 Landsat 7 Enhanced Thematic Mapper Plus (ETM+) Level-1 Data Products [dataset]. U.S. Geological Survey. https://doi.org/10.5066/P9TU80IG Earth Resources Observation And Science (EROS) Center. (2013). Collection-2 Landsat 8-9 OLI (Operational Land Imager) and TIRS (Thermal Infrared Sensor) Level-1 Data Products [Other]. U.S. Geological Survey. https://doi.org/10.5066/P975CC9B European Space Agency. (2015). Sentinel-2 MSI Level-1C TOA Reflectance [dataset]. European Space Agency. https://doi.org/10.5270/S2_-d8we2fl European Space Agency. (2021). Sentinel-2 MSI Level-1C TOA Reflectance, Collection 0 [dataset]. European Space Agency. https://doi.org/10.5270/S2_-d8we2fl European Space Agency. (2022). Sentinel-2 MSI Level-1C TOA Reflectance [dataset]. European Space Agency. https://doi.org/10.5270/S2_-742ikth Foga, S., Scaramuzza, P. L., Guo, S., Zhu, Z., Dilley, R. D., Beckmann, T., Schmidt, G. L., Dwyer, J. L., Joseph Hughes, M., & Laue, B. (2017). Cloud detection algorithm comparison and validation for operational Landsat data products. Remote Sensing of Environment, 194, 379–390. https://doi.org/10.1016/j.rse.2017.03.026 Frantz, D., Haß, E., Uhl, A., Stoffels, J., & Hill, J. (2018). Improvement of the Fmask algorithm for Sentinel-2 images: Separating clouds from bright surfaces based on parallax effects. Remote Sensing of Environment, 215, 471–481. https://doi.org/10.1016/j.rse.2018.04.046 Friedl, M., Josh, G., & Sulla-Menashe, D. (2019). MCD12Q2 MODIS/Terra+Aqua Land Cover Dynamics Yearly L3 Global 500m SIN Grid V006 [dataset]. NASA EOSDIS Land Processes DAAC. https://doi.org/10.5067/MODIS/MCD12Q2.006 Gorelick, N., Hancher, M., Dixon, M., Ilyushchenko, S., Thau, D., & Moore, R. (2017). Google Earth Engine: Planetary-scale geospatial analysis for everyone. Remote Sensing of Environment, 202, 18–27. https://doi.org/10.1016/j.rse.2017.06.031Lewińska K.E., Ernst S., Frantz D., Leser U., Hostert P., Global Overview of Usable Landsat and Sentinel-2 Data for 1982–2023. Data in Brief 57, (2024) https://doi.org/10.1016/j.dib.2024.111054 Main-Knorn, M., Pflug, B., Louis, J., Debaecker, V., Müller-Wilm, U., & Gascon, F. (2017). Sen2Cor for Sentinel-2. In L. Bruzzone, F. Bovolo,
Facebook
TwitterThe Google Reviews & Ratings Dataset provides businesses with structured insights into customer sentiment, satisfaction, and trends based on reviews from Google. Unlike broad review datasets, this product is location-specific—businesses provide the locations they want to track, and we retrieve as much historical data as possible, with daily updates moving forward.
This dataset enables businesses to monitor brand reputation, analyze consumer feedback, and enhance decision-making with real-world insights. For deeper analysis, optional AI-driven sentiment analysis and review summaries are available on a weekly, monthly, or yearly basis.
Dataset Highlights
Use Cases
Data Updates & Delivery
Data Fields Include:
Optional Add-Ons:
Ideal for
Why Choose This Dataset?
By leveraging Google Reviews & Ratings Data, businesses can gain valuable insights into customer sentiment, enhance reputation management, and stay ahead of the competition.
Facebook
TwitterThis dataset contains two tables: creative_stats and removed_creative_stats. The creative_stats table contains information about advertisers that served ads in the European Economic Area or Turkey: their legal name, verification status, disclosed name, and location. It also includes ad specific information: impression ranges per region (including aggregate impressions for the European Economic Area), first shown and last shown dates, which criteria were used in audience selection, the format of the ad, the ad topic and whether the ad is funded by Google Ad Grants program. A link to the ad in the Google Ads Transparency Center is also provided. The removed_creative_stats table contains information about ads that served in the European Economic Area that Google removed: where and why they were removed and per-region information on when they served. The removed_creative_stats table also contains a link to the Google Ads Transparency Center for the removed ad. Data for both tables updates periodically and may be delayed from what appears on the Google Ads Transparency Center website. About BigQuery This data is hosted in Google BigQuery for users to easily query using SQL. Note that to use BigQuery, users must have a Google account and create a GCP project. This public dataset is included in BigQuery's 1TB/mo of free tier processing. Each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery . Download Dataset This public dataset is also hosted in Google Cloud Storage here and available free to use. Use this quick start guide to quickly learn how to access public datasets on Google Cloud Storage. We provide the raw data in JSON format, sharded across multiple files to support easier download of the large dataset. A README file which describes the data structure and our Terms of Service (also listed below) is included with the dataset. You can also download the results from a custom query. See here for options and instructions. Signed out users can download the full dataset by using the gCloud CLI. Follow the instructions here to download and install the gCloud CLI. To remove the login requirement, run "$ gcloud config set auth/disable_credentials True" To download the dataset, run "$ gcloud storage cp gs://ads-transparency-center/* . -R" This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .