30 datasets found

Project Sunroof
console.cloud.google.com
Updated Aug 15, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:Google%20Project%20Sunroof&inv=1&invt=AbzpTQ (2017). Project Sunroof [Dataset]. https://console.cloud.google.com/marketplace/product/project-sunroof/project-sunroof
Explore at:
Dataset updated
Aug 15, 2017
Dataset provided by
Googlehttp://google.com/
Description
As the price of installing solar has gotten less expensive, more homeowners are turning to it as a possible option for decreasing their energy bill. We want to make installing solar panels easy and understandable for anyone. Project Sunroof puts Google's expansive data in mapping and computing resources to use, helping calculate the best solar plan for you. How does it work? When you enter your address, Project Sunroof looks up your home in Google Maps and combines that information with other databases to create your personalized roof analysis. Don’t worry, Project Sunroof doesn't give the address to anybody else. Learn more about Project Sunroof and see the tool at Project Sunroof’s site . Project Sunroof computes how much sunlight hits roofs in a year, based on shading calculations, typical meteorological data, and estimates of the size and shape of the roofs. You can see more details about how solar viability is determined by checking out methodology here. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
geo-openstreetmap
kaggle.com
zip
Updated Apr 17, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2020). geo-openstreetmap [Dataset]. https://www.kaggle.com/bigquery/geo-openstreetmap
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Apr 17, 2020
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Context

Adapted from Wikipedia: OpenStreetMap (OSM) is a collaborative project to create a free editable map of the world. Created in 2004, it was inspired by the success of Wikipedia and more than two million registered users who can add data by manual survey, GPS devices, aerial photography, and other free sources.

To aid researchers, data scientists, and analysts in the effort to combat COVID-19, Google is making a hosted repository of public datasets including OpenStreetMap data, free to access. To facilitate the Kaggle community to access the BigQuery dataset, it is onboarded to Kaggle platform which allows querying it without a linked GCP account. Please note that due to the large size of the dataset, Kaggle applies a quota of 5 TB of data scanned per user per 30-days.

Content

This is the OpenStreetMap (OSM) planet-wide dataset loaded to BigQuery.

Tables: - history_* tables: full history of OSM objects. - planet_* tables: snapshot of current OSM objects as of Nov 2019.

The history_* and planet_* table groups are composed of node, way, relation, and changeset tables. These contain the primary OSM data types and an additional changeset corresponding to OSM edits for convenient access. These objects are encoded using the BigQuery GEOGRAPHY data type so that they can be operated upon with the built-in geography functions to perform geometry and feature selection, additional processing.

Resources

You can read more about OSM elements on the OSM Wiki. This dataset uses BigQuery GEOGRAPHY datatype which supports a set of functions that can be used to analyze geographical data, determine spatial relationships between geographical features, and construct or manipulate GEOGRAPHYs.
GDELT 2.0 Event Database
console.cloud.google.com
Updated Jul 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:The%20GDELT%20Project&hl=de&inv=1&invt=AbyoXQ (2023). GDELT 2.0 Event Database [Dataset]. https://console.cloud.google.com/marketplace/product/the-gdelt-project/gdelt-2-events?hl=de
Explore at:
Dataset updated
Jul 15, 2023
Dataset provided by
Googlehttp://google.com/
Description
The GDELT 2.0 Event Database is a global catalog of worldwide activities (“events”) in over 300 categories from protests and military attacks to peace appeals and diplomatic exchanges. Each event record details 58 fields capturing many different attributes of the event. The GDELT 2.0 Event Database currently runs from February 2015 to present, updated every 15 minutes and is comprised of 326 million mentions of 103 million distinct events as of February 19, 2016. This dataset uses machine translation coverage of all monitored content in 65 core languages, with a sample of an additional 35 languages hand translated. It also expands upon GDELT 1.0 by providing a separate MENTIONS table that records every mention of each event, along with the offset, context and confidence of each of those mentions. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
Influence of Continuous Integration on the Development Activity in GitHub...
zenodo.org
csv
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sebastian Baltes; Sebastian Baltes; Jascha Knack; Jascha Knack (2020). Influence of Continuous Integration on the Development Activity in GitHub Projects [Dataset]. http://doi.org/10.5281/zenodo.1140261
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1140261
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sebastian Baltes; Sebastian Baltes; Jascha Knack; Jascha Knack
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is based on the TravisTorrent dataset released 2017-01-11 (https://travistorrent.testroots.org), the Google BigQuery GHTorrent dataset accessed 2017-07-03, and the Git log history of all projects in the dataset, retrieved 2017-07-16 - 2017-07-17.

We selected projects hosted on GitHub that employ the Continuous Integration (CI) system Travis CI. We identified the projects using the TravisTorrent data set and considered projects that:

were active for one year before the first build with Travis CI (before_ci),

used Travis CI at least for one year (during_ci),

had commit or merge activity on the default branch in both of these phases, and

used the default branch to trigger builds.

To derive the time frames, we employed the GHTorrent Big Query data set. The resulting sample contains 321 projects. Of these projects, 214 are Ruby projects and 107 are Java projects. The mean time span before_ci was 2.9 years (SD=1.9, Mdn=2.3), the mean time span during_ci was 3.2 years (SD=1.1, Mdn=3.3). For our analysis, we only consider the activity one year before and after the first build.

We cloned the selected project repositories and extracted the version history for all branches (see https://github.com/sbaltes/git-log-parser). For each repo and branch, we created one log file with all regular commits and one log file with all merges. We only considered commits changing non-binary files and applied a file extension filter to only consider changes to Java or Ruby source code files. From the log files, we then extracted metadata about the commits and stored this data in CSV files (see https://github.com/sbaltes/git-log-parser).

The dataset contains the following files:

tr_projects_sample_filtered.csv
A CSV file with information about the 321 selected projects.

tr_sample_commits_default_branch_before_ci.csv
tr_sample_commits_default_branch_during_ci.csv
One CSV file with information about all commits to the default branch before and after the first CI build. Only commits modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:

project: GitHub project name ("/" replaced by "_").
branch: The branch to which the commit was made.
hash_value: The SHA1 hash value of the commit.
author_name: The author name.
author_email: The author email address.
author_date: The authoring timestamp.
commit_name: The committer name.
commit_email: The committer email address.
commit_date: The commit timestamp.
log_message_length: The length of the git commit messages (in characters).
file_count: Files changed with this commit.
lines_added: Lines added to all files changed with this commit.
lines_deleted: Lines deleted in all files changed with this commit.
file_extensions: Distinct file extensions of files changed with this commit.

tr_sample_merges_default_branch_before_ci.csv
tr_sample_merges_default_branch_during_ci.csv
One CSV file with information about all merges into the default branch before and after the first CI build. Only merges modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:

project: GitHub project name ("/" replaced by "_").
branch: The destination branch of the merge.
hash_value: The SHA1 hash value of the merge commit.
merged_commits: Unique hash value prefixes of the commits merged with this commit.
author_name: The author name.
author_email: The author email address.
author_date: The authoring timestamp.
commit_name: The committer name.
commit_email: The committer email address.
commit_date: The commit timestamp.
log_message_length: The length of the git commit messages (in characters).
file_count: Files changed with this commit.
lines_added: Lines added to all files changed with this commit.
lines_deleted: Lines deleted in all files changed with this commit.
file_extensions: Distinct file extensions of files changed with this commit.
pull_request_id: ID of the GitHub pull request that has been merged with this commit (extracted from log message).
source_user: GitHub login name of the user who initiated the pull request (extracted from log message).
source_branch : Source branch of the pull request (extracted from log message).
Bitcoin Blockchain Historical Data
kaggle.com
zip
Updated Feb 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). Bitcoin Blockchain Historical Data [Dataset]. https://www.kaggle.com/bigquery/bitcoin-blockchain
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 12, 2019
Dataset provided by
Googlehttp://google.com/
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Blockchain technology, first implemented by Satoshi Nakamoto in 2009 as a core component of Bitcoin, is a distributed, public ledger recording transactions. Its usage allows secure peer-to-peer communication by linking blocks containing hash pointers to a previous block, a timestamp, and transaction data. Bitcoin is a decentralized digital currency (cryptocurrency) which leverages the Blockchain to store transactions in a distributed manner in order to mitigate against flaws in the financial industry.

Nearly ten years after its inception, Bitcoin and other cryptocurrencies experienced an explosion in popular awareness. The value of Bitcoin, on the other hand, has experienced more volatility. Meanwhile, as use cases of Bitcoin and Blockchain grow, mature, and expand, hype and controversy have swirled.

Content

In this dataset, you will have access to information about blockchain blocks and transactions. All historical data are in the bigquery-public-data:crypto_bitcoin dataset. It’s updated it every 10 minutes. The data can be joined with historical prices in kernels. See available similar datasets here: https://www.kaggle.com/datasets?search=bitcoin.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.crypto_bitcoin.[TABLENAME]. Fork this kernel to get started.

Method & Acknowledgements

Allen Day (Twitter | Medium), Google Cloud Developer Advocate & Colin Bookman, Google Cloud Customer Engineer retrieve data from the Bitcoin network using a custom client available on GitHub that they built with the bitcoinj Java library. Historical data from the origin block to 2018-01-31 were loaded in bulk to two BigQuery tables, blocks_raw and transactions. These tables contain fresh data, as they are now appended when new blocks are broadcast to the Bitcoin network. For additional information visit the Google Cloud Big Data and Machine Learning Blog post "Bitcoin in BigQuery: Blockchain analytics on public data".

Photo by Andre Francois on Unsplash.

Inspiration

How many bitcoins are sent each day?

How many addresses receive bitcoin each day?

Compare transaction volume to historical prices by joining with other available data sources
gnomAD
console.cloud.google.com
Updated Jun 23, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:Broad%20Institute%20of%20MIT%20and%20Harvard&inv=1&invt=AbzpLw (2020). gnomAD [Dataset]. https://console.cloud.google.com/marketplace/product/broad-institute/gnomad
Explore at:
Dataset updated
Jun 23, 2020
Dataset provided by
Googlehttp://google.com/
Description
The Genome Aggregation Database (gnomAD) is maintained by an international coalition of investigators to aggregate and harmonize data from large-scale sequencing projects. These public datasets are available in VCF format in Google Cloud Storage and in Google BigQuery as integer range partitioned tables . Each dataset is sharded by chromosome meaning variants are distributed across 24 tables (indicated with “_chr*” suffix). Utilizing the sharded tables reduces query costs significantly. Variant Transforms was used to process these VCF files and import them to BigQuery. VEP annotations were parsed into separate columns for easier analysis using Variant Transforms’ annotation support . These public datasets are included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. Use this quick start guide to quickly learn how to access public datasets on Google Cloud Storage. Find out more in our blog post, Providing open access to gnomAD on Google Cloud . Questions? Contact gcp-life-sciences-discuss@googlegroups.com.
T
Iowa Liquor Sales
arjunrana.com
datadiscoverystudio.org
+4more
Updated May 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Iowa Department of Revenue, Alcoholic Beverages (2025). Iowa Liquor Sales [Dataset]. https://arjunrana.com/projects/bigquery_ML/
Explore at:
csv, kml, application/geo+json, kmz, application/rssxml, tsv, xml, application/rdfxmlAvailable download formats
Dataset updated
May 1, 2025
Dataset authored and provided by
Iowa Department of Revenue, Alcoholic Beverages
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains the spirits purchase information of Iowa Class “E” liquor licensees by product and date of purchase from January 1, 2012 to current. The dataset can be used to analyze total spirits sales in Iowa of individual products at the store level.

Class E liquor license, for grocery stores, liquor stores, convenience stores, etc., allows commercial establishments to sell liquor for off-premises consumption in original unopened containers.
Reddit
redivis.com
application/jsonl +7
Updated Oct 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Redivis Demo Organization (2021). Reddit [Dataset]. https://redivis.com/datasets/prpw-49sqq9ehv
Explore at:
sas, stata, csv, avro, parquet, spss, application/jsonl, arrowAvailable download formats
Dataset updated
Oct 27, 2021
Dataset provided by
Redivis Inc.
Authors
Redivis Demo Organization
Description
Abstract

Reddit posts, 2019-01-01 thru 2019-08-01.

Documentation

Source: https://console.cloud.google.com/bigquery?p=fh-bigquery&page=project
The GDELT Project
kaggle.com
zip
Updated Feb 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The GDELT Project (2019). The GDELT Project [Dataset]. https://www.kaggle.com/datasets/gdelt/gdelt
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 12, 2019
Dataset authored and provided by
The GDELT Project
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The GDELT Project is the largest, most comprehensive, and highest resolution open database of human society ever created. Just the 2015 data alone records nearly three quarters of a trillion emotional snapshots and more than 1.5 billion location references, while its total archives span more than 215 years, making it one of the largest open-access spatio-temporal datasets in existance and pushing the boundaries of "big data" study of global human society. Its Global Knowledge Graph connects the world's people, organizations, locations, themes, counts, images and emotions into a single holistic network over the entire planet. How can you query, explore, model, visualize, interact, and even forecast this vast archive of human society?

Content

GDELT 2.0 has a wealth of features in the event database which includes events reported in articles published in 65 live translated languages, measurements of 2,300 emotions and themes, high resolution views of the non-Western world, relevant imagery, videos, and social media embeds, quotes, names, amounts, and more.

You may find these code books helpful:
GDELT Global Knowledge Graph Codebook V2.1 (PDF)
GDELT Event Codebook V2.0 (PDF)

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. [Fork this kernel to get started][98] to learn how to safely manage analyzing large BigQuery datasets.

Acknowledgements

You may redistribute, rehost, republish, and mirror any of the GDELT datasets in any form. However, any use or redistribution of the data must include a citation to the GDELT Project and a link to the website (https://www.gdeltproject.org/).
Google Ads Transparency Center
console.cloud.google.com
Updated Aug 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:BigQuery%20Public%20Data&hl=ko&inv=1&invt=AbzpIw (2023). Google Ads Transparency Center [Dataset]. https://console.cloud.google.com/marketplace/product/bigquery-public-data/google-ads-transparency-center?hl=ko
Explore at:
Dataset updated
Aug 23, 2023
Dataset provided by
Googlehttp://google.com/
BigQueryhttps://cloud.google.com/bigquery
Description
This dataset contains two tables: creative_stats and removed_creative_stats. The creative_stats table contains information about advertisers that served ads in the European Economic Area or Turkey: their legal name, verification status, disclosed name, and location. It also includes ad specific information: impression ranges per region (including aggregate impressions for the European Economic Area), first shown and last shown dates, which criteria were used in audience selection, the format of the ad, the ad topic and whether the ad is funded by Google Ad Grants program. A link to the ad in the Google Ads Transparency Center is also provided. The removed_creative_stats table contains information about ads that served in the European Economic Area that Google removed: where and why they were removed and per-region information on when they served. The removed_creative_stats table also contains a link to the Google Ads Transparency Center for the removed ad. Data for both tables updates periodically and may be delayed from what appears on the Google Ads Transparency Center website. About BigQuery This data is hosted in Google BigQuery for users to easily query using SQL. Note that to use BigQuery, users must have a Google account and create a GCP project. This public dataset is included in BigQuery's 1TB/mo of free tier processing. Each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery . Download Dataset This public dataset is also hosted in Google Cloud Storage here and available free to use. Use this quick start guide to quickly learn how to access public datasets on Google Cloud Storage. We provide the raw data in JSON format, sharded across multiple files to support easier download of the large dataset. A README file which describes the data structure and our Terms of Service (also listed below) is included with the dataset. You can also download the results from a custom query. See here for options and instructions. Signed out users can download the full dataset by using the gCloud CLI. Follow the instructions here to download and install the gCloud CLI. To remove the login requirement, run "$ gcloud config set auth/disable_credentials True" To download the dataset, run "$ gcloud storage cp gs://ads-transparency-center/* . -R" This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
GitHub Repos
kaggle.com
zip
Updated Mar 20, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset provided by
GitHubhttps://github.com/
Authors
Github
Description
GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

Acknowledgements

This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

Inspiration

This is the perfect dataset for fighting language wars.

Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?
Libraries.io Data
console.cloud.google.com
Updated Jul 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:Libraries.io&hl=en_GB&inv=1&invt=Abyn4Q (2023). Libraries.io Data [Dataset]. https://console.cloud.google.com/marketplace/product/libraries-io/librariesio?hl=en_GB
Explore at:
Dataset updated
Jul 16, 2023
Dataset provided by
Libraries.iohttps://libraries.io/
Googlehttp://google.com/
Description
Libraries.io gathers data on open source software from 33 package managers and 3 source code repositories. We track over 2.4m unique open source projects, 25m repositories and 121m interdependencies between them. This gives Libraries.io a unique understanding of open source software. In this release you will find data about software distributed and/or crafted publicly on the Internet. You will find information about its development, its distribution and its relationship with other software included as a dependency. You will not find any information about the individuals who create and maintain these projects. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
PatCit: A Comprehensive Dataset of Patent Citations
zenodo.org
application/gzip, bin
Updated Dec 23, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gaétan de Rassenfosse; Gaétan de Rassenfosse; Cyril Verluise; Cyril Verluise (2020). PatCit: A Comprehensive Dataset of Patent Citations [Dataset]. http://doi.org/10.5281/zenodo.3710994
Explore at:
application/gzip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3710994
Dataset updated
Dec 23, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gaétan de Rassenfosse; Gaétan de Rassenfosse; Cyril Verluise; Cyril Verluise
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PATCIT: A Comprehensive Dataset of Patent Citations [Website, Newsletter, GitHub]

Patents are at the crossroads of many innovation nodes: science, industry, products, competition, etc. Such interactions can be identified through citations in a broad sense.

It is now common to use front-page patent citations to study some aspects of the innovation system. However, there is much more buried in the Non Patent Literature (NPL) citations and in the patent text itself.

Good news: Natural Language Processing (NLP) tools now enable social scientists to excavate and structure this long hidden information. That's the purpose of this project

IN PRACTICE

A detailed presentation of the current state of the project is available in our March 2020 presentation.

So far, we have:

classified the 40 million NPL citations reported in the DOCDB database in 9 distinct research oriented classes with a 90% accuracy rate.

parsed and consolidated the 27 million NPL citations classified as bibliographical references.

extracted, parsed and consolidated in-text bibliographical references and patent citations from the body of all time USPTO patents.

The latest version of the dataset is the v0.15. It is made of the v0.1 of the US contextual citations dataset and v0.2 of the front-page NPL citations dataset.

Give it a try! The dataset is publicly available on Google Cloud BigQuery, just click here.

FEATURES

Open

The code is licensed under MIT-2 and the dataset is licensed under CC4. Two highly permissive licenses.

The project is thought to be dynamically improved by and for the community. Anyone should feel free to open discussions, raise issues, request features and contribute to the project.

Comprehensive

We address worldwide patents, as long as the data is available.

We address all classes of citations, not only bibliographical references.

We address front-page and in-text citations.

Highest standards

We use and implement state-of-the art machine learning solutions.

We take great care to implement only the most efficient solutions. We believe that computational resources should be used sparsely, for both environmental sustainability and long term financial sustainability of the project.
NOAA GOES-16
kaggle.com
zip
Updated Aug 30, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NOAA (2019). NOAA GOES-16 [Dataset]. https://www.kaggle.com/noaa/goes16
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Aug 30, 2019
Dataset provided by
National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
Authors
NOAA
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Overview

The Geostationary Operational Environmental Satellite-R Series (GOES-R) is the next generation of geostationary weather satellites. The GOES-R series will significantly improve the detection and observation of environmental phenomena that directly affect public safety, protection of property and our nation’s economic health and prosperity.

The GOES-16 satellite, known as GOES-R prior to launch, is the first satellite in the series. It will provide images of weather pattern and severe storms as frequently as every 30 seconds, which will contribute to more accurate and reliable weather forecasts and severe weather outlooks.

Content

The raw dataset includes a feed of the Advanced Baseline Imager (ABI) radiance data (Level 1b) and Cloud and Moisture Imager (CMI) products (Level 2) which are freely available through the NOAA Big Data Project.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

Acknowledgments

The NOAA Big Data Project (BDP) is an experimental collaboration between NOAA and infrastructure-as-a-service (IaaS) providers to explore methods of expand the accessibility of NOAA’s data in order to facilitate innovation and collaboration. The goal of this approach is to help form new lines of business and economic growth while making NOAA's data more discoverable for the American public. https://storage.googleapis.com/public-dataset-images/noaa-goes-16-sample.png" alt="Sample images">

Key metadata for this dataset has been extracted into convenient BigQuery tables (one each for L1b radiance, L2 CMIP, and L2 MCMIP). These tables can be used to query metadata in order to filter the data down to only a subset of raw netcdf4 files available in Google Cloud Storage.
OpenAQ
kaggle.com
zip
Updated Dec 1, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Open AQ (2017). OpenAQ [Dataset]. https://www.kaggle.com/datasets/open-aq/openaq
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Dec 1, 2017
Dataset authored and provided by
Open AQ
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
OpenAQ is an open-source project to surface live, real-time air quality data from around the world. Their “mission is to enable previously impossible science, impact policy and empower the public to fight air pollution.” The data includes air quality measurements from 5490 locations in 47 countries.

Scientists, researchers, developers, and citizens can use this data to understand the quality of air near them currently. The dataset only includes the most current measurement available for the location (no historical data).

Update Frequency: Weekly

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.openaq.[TABLENAME]. Fork this kernel to get started.

Acknowledgements

Dataset Source: openaq.org

Use: This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source and is provided "AS IS" without any warranty, express or implied.
OpenStreetMap Public Dataset
console.cloud.google.com
Updated Jul 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:OpenStreetMap&hl=en_GB&inv=1&invt=AbxNwQ (2023). OpenStreetMap Public Dataset [Dataset]. https://console.cloud.google.com/marketplace/product/openstreetmap/geo-openstreetmap?hl=en_GB
Explore at:
Dataset updated
Jul 17, 2023
Dataset provided by
Googlehttp://google.com/
OpenStreetMap//www.openstreetmap.org/
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Adapted from Wikipedia: OpenStreetMap (OSM) is a collaborative project to create a free editable map of the world. Created in 2004, it was inspired by the success of Wikipedia and more than two million registered users who can add data by manual survey, GPS devices, aerial photography, and other free sources. We've made available a number of tables (explained in detail below): history_* tables: full history of OSM objects planet_* tables: snapshot of current OSM objects as of Nov 2019 The history_* and planet_* table groups are composed of node, way, relation, and changeset tables. These contain the primary OSM data types and an additional changeset corresponding to OSM edits for convenient access. These objects are encoded using the BigQuery GEOGRAPHY data type so that they can be operated upon with the built-in geography functions to perform geometry and feature selection, additional processing. Example analyses are given below. This dataset is part of a larger effort to make data available in BigQuery through the Google Cloud Public Datasets program . OSM itself is produced as a public good by volunteers, and there are no guarantees about data quality. Interested in learning more about how these data were brought into BigQuery and how you can use them? Check out the sample queries below to get started. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
Z
SOTorrent Data Set 2017-07-25
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sebastian Baltes (2020). SOTorrent Data Set 2017-07-25 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_834571
Explore at:
Dataset updated
Jan 24, 2020
Dataset authored and provided by
Sebastian Baltes
Description
Stack Overflow (SO) is the largest Q&A website for software developers, providing a huge amount of copyable code snippets. Recent studies have shown that developers regularly copy those snippets into their software projects, often without the required attribution. Beside possible licensing issues, maintenance issues may arise, because the snippets evolve on SO, but the developers who copied the code are not aware of these changes. To help researchers investigate the evolution of code snippets on SO and their relation to other platforms like GitHub, we build SOTorrent, an open data set based on data from the official SO data dump and the Google BigQuery GitHub data set. SOTorrent provides access to the version history of SO content on the level of whole posts and individual text or code blocks. Moreover, it links SO content to external resources in two ways: (1) by extracting linked URLs from text blocks of SO posts and (2) by providing a table with links to SO posts found in the source code of all projects in the BigQuery GitHub data set.
World Bank: Education Data
kaggle.com
zip
Updated Mar 20, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
World Bank (2019). World Bank: Education Data [Dataset]. https://www.kaggle.com/datasets/theworldbank/world-bank-intl-education
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset authored and provided by
World Bankhttp://worldbank.org/
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The World Bank is an international financial institution that provides loans to countries of the world for capital projects. The World Bank's stated goal is the reduction of poverty. Source: https://en.wikipedia.org/wiki/World_Bank

Content

This dataset combines key education statistics from a variety of sources to provide a look at global literacy, spending, and access.

For more information, see the World Bank website.

Fork this kernel to get started with this dataset.

Acknowledgements

https://bigquery.cloud.google.com/dataset/bigquery-public-data:world_bank_health_population

http://data.worldbank.org/data-catalog/ed-stats

https://cloud.google.com/bigquery/public-data/world-bank-education

Citation: The World Bank: Education Statistics

Dataset Source: World Bank. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - http://www.data.gov/privacy-policy#data_policy - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

Banner Photo by @till_indeman from Unplash.

Inspiration

Of total government spending, what percentage is spent on education?
r
CMS Synthetic Patient Data OMOP
redivis.com
Updated Aug 7, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). CMS Synthetic Patient Data OMOP [Dataset]. https://redivis.com/workflows/y8de-d3fnwt33n
Explore at:
Dataset updated
Aug 7, 2020
Description
This is a synthetic patient dataset in the OMOP common data model, originally released by CMS and accessed via BigQuery. The dataset includes 24 tables and records for 2 million synthetic patients from 2008 to 2010.
Eclipse Megamovie
console.cloud.google.com
Updated Jul 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:Google%20Cloud%20Public%20Datasets%20Program&hl=fr&inv=1&invt=AbzpoA (2023). Eclipse Megamovie [Dataset]. https://console.cloud.google.com/marketplace/product/google-cloud-public-datasets/eclipse-megamovie?hl=fr
Explore at:
Dataset updated
Jul 20, 2023
Dataset provided by
Googlehttp://google.com/
Description
This is the full set of images submitted for the Eclipse Megamovie project, a citizen science project to capture images of the Sun’s corona during the August 21, 2017 total solar eclipse. These images were taken by volunteer photographers (as well as the general public) from across the country using consumer camera equipment. The Eclipse Megamovie project was a collaboration between UC Berkeley, Google, the Astronomical Society of the Pacific, and many more.* In addition to the dataset, the code used by the project to create the website and process individual movies can be found in GitHub For a full description of the data fields, see below. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery . *Additional partners: Center for Research on Lifelong STEM Learning, Oregon State University, Eclipse Across America, Foothill College, High Altitude Observatory of the National Center for Atmospheric Research, Ideum, Lick Observatory, Space Sciences Laboratory, University of California, Berkeley, University of Colorado at Boulder, Williams College and the IAU Working Group.

Facebook

Twitter

Click to copy link

Link copied

Cite

https://console.cloud.google.com/marketplace/browse?filter=partner:Google%20Project%20Sunroof&inv=1&invt=AbzpTQ (2017). Project Sunroof [Dataset]. https://console.cloud.google.com/marketplace/product/project-sunroof/project-sunroof

Project Sunroof

Explore at:

Dataset updated

Aug 15, 2017

Dataset provided by

Googlehttp://google.com/

Description

As the price of installing solar has gotten less expensive, more homeowners are turning to it as a possible option for decreasing their energy bill. We want to make installing solar panels easy and understandable for anyone. Project Sunroof puts Google's expansive data in mapping and computing resources to use, helping calculate the best solar plan for you. How does it work? When you enter your address, Project Sunroof looks up your home in Google Maps and combines that information with other databases to create your personalized roof analysis. Don’t worry, Project Sunroof doesn't give the address to anybody else. Learn more about Project Sunroof and see the tool at Project Sunroof’s site . Project Sunroof computes how much sunlight hits roofs in a year, based on shading calculations, typical meteorological data, and estimates of the size and shape of the roofs. You can see more details about how solar viability is determined by checking out methodology here. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .

Clear search

Close search

Google apps

Main menu

Project Sunroof

geo-openstreetmap

Context

Content

Resources

GDELT 2.0 Event Database

Influence of Continuous Integration on the Development Activity in GitHub...

Bitcoin Blockchain Historical Data

Context

Content

Querying BigQuery tables

Method & Acknowledgements

Inspiration

gnomAD

Iowa Liquor Sales

Reddit

Abstract

Documentation

The GDELT Project

Context

Content

Querying BigQuery tables

Acknowledgements

Google Ads Transparency Center

GitHub Repos

Querying BigQuery tables

Acknowledgements

Inspiration

Libraries.io Data

PatCit: A Comprehensive Dataset of Patent Citations

NOAA GOES-16

Overview

Content

Querying BigQuery tables

Acknowledgments

OpenAQ

Querying BigQuery tables

Acknowledgements

OpenStreetMap Public Dataset

SOTorrent Data Set 2017-07-25

World Bank: Education Data

Context

Content

Acknowledgements

Inspiration

CMS Synthetic Patient Data OMOP

Eclipse Megamovie

Project Sunroof