30 datasets found

Libraries.io Data
kaggle.com
zip
Updated Mar 20, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Libraries.io (2019). Libraries.io Data [Dataset]. https://www.kaggle.com/librariesdotio/libraries-io
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset provided by
Libraries.iohttps://libraries.io/
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

In this release you will find data about software distributed and/or crafted publicly on the Internet. You will find information about its development, its distribution and its relationship with other software included as a dependency. You will not find any information about the individuals who create and maintain these projects.

Content

Libraries.io gathers data on open source software from 33 package managers and 3 source code repositories. We track over 2.4m unique open source projects, 25m repositories and 121m interdependencies between them. This gives Libraries.io a unique understanding of open source software.

https://libraries.io/data

Fork this kernel to get started with this dataset.

Acknowledgements

This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — https://libraries.io/data — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

https://libraries.io/data

https://bigquery.cloud.google.com/dataset/bigquery-public-data:libraries_io?_ga=2.42277601.-577194880.1523455401

https://console.cloud.google.com/marketplace/details/libraries-io/librariesio

Banner Photo by Caspar Rubin from Unplash.

Inspiration

What are the repositories, avg project size, and avg # of stars?

What are the top dependencies per platform?

What are the top unmaintained or deprecated projects?
GitHub Activity Data
console.cloud.google.com
Updated Jun 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:GitHub&inv=1&invt=Ab41nA (2022). GitHub Activity Data [Dataset]. https://console.cloud.google.com/marketplace/product/github/github-repos
Explore at:
Dataset updated
Jun 23, 2022
Dataset provided by
GitHubhttps://github.com/
Googlehttp://google.com/
Description
GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008. This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
1000 Cannabis Genomes Project
kaggle.com
zip
Updated Feb 26, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). 1000 Cannabis Genomes Project [Dataset]. https://www.kaggle.com/bigquery/genomics-cannabis
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 26, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Cannabis is a genus of flowering plants in the family Cannabaceae.

Source: https://en.wikipedia.org/wiki/Cannabis

Content

In October 2016, Phylos Bioscience released a genomic open dataset of approximately 850 strains of Cannabis via the Open Cannabis Project. In combination with other genomics datasets made available by Courtagen Life Sciences, Michigan State University, NCBI, Sunrise Medicinal, University of Calgary, University of Toronto, and Yunnan Academy of Agricultural Sciences, the total amount of publicly available data exceeds 1,000 samples taken from nearly as many unique strains.

https://medium.com/google-cloud/dna-sequencing-of-1000-cannabis-strains-publicly-available-in-google-bigquery-a33430d63998

These data were retrieved from the National Center for Biotechnology Information’s Sequence Read Archive (NCBI SRA), processed using the BWA aligner and FreeBayes variant caller, indexed with the Google Genomics API, and exported to BigQuery for analysis. Data are available directly from Google Cloud Storage at gs://gcs-public-data--genomics/cannabis, as well as via the Google Genomics API as dataset ID 918853309083001239, and an additional duplicated subset of only transcriptome data as dataset ID 94241232795910911, as well as in the BigQuery dataset bigquery-public-data:genomics_cannabis.

All tables in the Cannabis Genomes Project dataset have a suffix like _201703. The suffix is referred to as [BUILD_DATE] in the descriptions below. The dataset is updated frequently as new releases become available.

The following tables are included in the Cannabis Genomes Project dataset:

Sample_info contains fields extracted for each SRA sample, including the SRA sample ID and other data that give indications about the type of sample. Sample types include: strain, library prep methods, and sequencing technology. See SRP008673 for an example of upstream sample data. SRP008673 is the University of Toronto sequencing of Cannabis Sativa subspecies Purple Kush.

MNPR01_reference_[BUILD_DATE] contains reference sequence names and lengths for the draft assembly of Cannabis Sativa subspecies Cannatonic produced by Phylos Bioscience. This table contains contig identifiers and their lengths.

MNPR01_[BUILD_DATE] contains variant calls for all included samples and types (genomic, transcriptomic) aligned to the MNPR01_reference_[BUILD_DATE] table. Samples can be found in the sample_info table. The MNPR01_[BUILD_DATE] table is exported using the Google Genomics BigQuery variants schema. This table is useful for general analysis of the Cannabis genome.

MNPR01_transcriptome_[BUILD_DATE] is similar to the MNPR01_[BUILD_DATE] table, but it includes only the subset transcriptomic samples. This table is useful for transcribed gene-level analysis of the Cannabis genome.

Fork this kernel to get started with this dataset.

Acknowledgements

Dataset Source: http://opencannabisproject.org/ Category: Genomics Use: This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - https://www.ncbi.nlm.nih.gov/home/about/policies.shtml - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset. Update frequency: As additional data are released to GenBank View in BigQuery: https://bigquery.cloud.google.com/dataset/bigquery-public-data:genomics_cannabis View in Google Cloud Storage: gs://gcs-public-data--genomics/cannabis

Banner Photo by Rick Proctor from Unplash.

Inspiration

Which Cannabis samples are included in the variants table?

Which contigs in the MNPR01_reference_[BUILD_DATE] table have the highest density of variants?

How many variants does each sample have at the THC Synthase gene (THCA1) locus?
Project Sunroof
console.cloud.google.com
Updated Aug 15, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:Google%20Project%20Sunroof&inv=1&invt=Ab43Fg (2017). Project Sunroof [Dataset]. https://console.cloud.google.com/marketplace/product/project-sunroof/project-sunroof
Explore at:
Dataset updated
Aug 15, 2017
Dataset provided by
Googlehttp://google.com/
Description
As the price of installing solar has gotten less expensive, more homeowners are turning to it as a possible option for decreasing their energy bill. We want to make installing solar panels easy and understandable for anyone. Project Sunroof puts Google's expansive data in mapping and computing resources to use, helping calculate the best solar plan for you. How does it work? When you enter your address, Project Sunroof looks up your home in Google Maps and combines that information with other databases to create your personalized roof analysis. Don’t worry, Project Sunroof doesn't give the address to anybody else. Learn more about Project Sunroof and see the tool at Project Sunroof’s site . Project Sunroof computes how much sunlight hits roofs in a year, based on shading calculations, typical meteorological data, and estimates of the size and shape of the roofs. You can see more details about how solar viability is determined by checking out methodology here. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
Reddit
redivis.com
application/jsonl +7
Updated Oct 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Redivis Demo Organization (2021). Reddit [Dataset]. https://redivis.com/datasets/prpw-49sqq9ehv
Explore at:
sas, stata, csv, avro, parquet, spss, application/jsonl, arrowAvailable download formats
Dataset updated
Oct 27, 2021
Dataset provided by
Redivis Inc.
Authors
Redivis Demo Organization
Description
Abstract

Reddit posts, 2019-01-01 thru 2019-08-01.

Documentation

Source: https://console.cloud.google.com/bigquery?p=fh-bigquery&page=project
geo-openstreetmap
kaggle.com
zip
Updated Apr 17, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2020). geo-openstreetmap [Dataset]. https://www.kaggle.com/bigquery/geo-openstreetmap
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Apr 17, 2020
Dataset provided by
Googlehttp://google.com/
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Context

Adapted from Wikipedia: OpenStreetMap (OSM) is a collaborative project to create a free editable map of the world. Created in 2004, it was inspired by the success of Wikipedia and more than two million registered users who can add data by manual survey, GPS devices, aerial photography, and other free sources.

To aid researchers, data scientists, and analysts in the effort to combat COVID-19, Google is making a hosted repository of public datasets including OpenStreetMap data, free to access. To facilitate the Kaggle community to access the BigQuery dataset, it is onboarded to Kaggle platform which allows querying it without a linked GCP account. Please note that due to the large size of the dataset, Kaggle applies a quota of 5 TB of data scanned per user per 30-days.

Content

This is the OpenStreetMap (OSM) planet-wide dataset loaded to BigQuery.

Tables: - history_* tables: full history of OSM objects. - planet_* tables: snapshot of current OSM objects as of Nov 2019.

The history_* and planet_* table groups are composed of node, way, relation, and changeset tables. These contain the primary OSM data types and an additional changeset corresponding to OSM edits for convenient access. These objects are encoded using the BigQuery GEOGRAPHY data type so that they can be operated upon with the built-in geography functions to perform geometry and feature selection, additional processing.

Resources

You can read more about OSM elements on the OSM Wiki. This dataset uses BigQuery GEOGRAPHY datatype which supports a set of functions that can be used to analyze geographical data, determine spatial relationships between geographical features, and construct or manipulate GEOGRAPHYs.
Influence of Continuous Integration on the Development Activity in GitHub...
zenodo.org
csv
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sebastian Baltes; Sebastian Baltes; Jascha Knack; Jascha Knack (2020). Influence of Continuous Integration on the Development Activity in GitHub Projects [Dataset]. http://doi.org/10.5281/zenodo.1140261
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1140261
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sebastian Baltes; Sebastian Baltes; Jascha Knack; Jascha Knack
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is based on the TravisTorrent dataset released 2017-01-11 (https://travistorrent.testroots.org), the Google BigQuery GHTorrent dataset accessed 2017-07-03, and the Git log history of all projects in the dataset, retrieved 2017-07-16 - 2017-07-17.

We selected projects hosted on GitHub that employ the Continuous Integration (CI) system Travis CI. We identified the projects using the TravisTorrent data set and considered projects that:

were active for one year before the first build with Travis CI (before_ci),

used Travis CI at least for one year (during_ci),

had commit or merge activity on the default branch in both of these phases, and

used the default branch to trigger builds.

To derive the time frames, we employed the GHTorrent Big Query data set. The resulting sample contains 321 projects. Of these projects, 214 are Ruby projects and 107 are Java projects. The mean time span before_ci was 2.9 years (SD=1.9, Mdn=2.3), the mean time span during_ci was 3.2 years (SD=1.1, Mdn=3.3). For our analysis, we only consider the activity one year before and after the first build.

We cloned the selected project repositories and extracted the version history for all branches (see https://github.com/sbaltes/git-log-parser). For each repo and branch, we created one log file with all regular commits and one log file with all merges. We only considered commits changing non-binary files and applied a file extension filter to only consider changes to Java or Ruby source code files. From the log files, we then extracted metadata about the commits and stored this data in CSV files (see https://github.com/sbaltes/git-log-parser).

The dataset contains the following files:

tr_projects_sample_filtered.csv
A CSV file with information about the 321 selected projects.

tr_sample_commits_default_branch_before_ci.csv
tr_sample_commits_default_branch_during_ci.csv
One CSV file with information about all commits to the default branch before and after the first CI build. Only commits modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:

project: GitHub project name ("/" replaced by "_").
branch: The branch to which the commit was made.
hash_value: The SHA1 hash value of the commit.
author_name: The author name.
author_email: The author email address.
author_date: The authoring timestamp.
commit_name: The committer name.
commit_email: The committer email address.
commit_date: The commit timestamp.
log_message_length: The length of the git commit messages (in characters).
file_count: Files changed with this commit.
lines_added: Lines added to all files changed with this commit.
lines_deleted: Lines deleted in all files changed with this commit.
file_extensions: Distinct file extensions of files changed with this commit.

tr_sample_merges_default_branch_before_ci.csv
tr_sample_merges_default_branch_during_ci.csv
One CSV file with information about all merges into the default branch before and after the first CI build. Only merges modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:

project: GitHub project name ("/" replaced by "_").
branch: The destination branch of the merge.
hash_value: The SHA1 hash value of the merge commit.
merged_commits: Unique hash value prefixes of the commits merged with this commit.
author_name: The author name.
author_email: The author email address.
author_date: The authoring timestamp.
commit_name: The committer name.
commit_email: The committer email address.
commit_date: The commit timestamp.
log_message_length: The length of the git commit messages (in characters).
file_count: Files changed with this commit.
lines_added: Lines added to all files changed with this commit.
lines_deleted: Lines deleted in all files changed with this commit.
file_extensions: Distinct file extensions of files changed with this commit.
pull_request_id: ID of the GitHub pull request that has been merged with this commit (extracted from log message).
source_user: GitHub login name of the user who initiated the pull request (extracted from log message).
source_branch : Source branch of the pull request (extracted from log message).
The GDELT Project
kaggle.com
zip
Updated Feb 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The GDELT Project (2019). The GDELT Project [Dataset]. https://www.kaggle.com/datasets/gdelt/gdelt
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 12, 2019
Dataset authored and provided by
The GDELT Project
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The GDELT Project is the largest, most comprehensive, and highest resolution open database of human society ever created. Just the 2015 data alone records nearly three quarters of a trillion emotional snapshots and more than 1.5 billion location references, while its total archives span more than 215 years, making it one of the largest open-access spatio-temporal datasets in existance and pushing the boundaries of "big data" study of global human society. Its Global Knowledge Graph connects the world's people, organizations, locations, themes, counts, images and emotions into a single holistic network over the entire planet. How can you query, explore, model, visualize, interact, and even forecast this vast archive of human society?

Content

GDELT 2.0 has a wealth of features in the event database which includes events reported in articles published in 65 live translated languages, measurements of 2,300 emotions and themes, high resolution views of the non-Western world, relevant imagery, videos, and social media embeds, quotes, names, amounts, and more.

You may find these code books helpful:
GDELT Global Knowledge Graph Codebook V2.1 (PDF)
GDELT Event Codebook V2.0 (PDF)

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. [Fork this kernel to get started][98] to learn how to safely manage analyzing large BigQuery datasets.

Acknowledgements

You may redistribute, rehost, republish, and mirror any of the GDELT datasets in any form. However, any use or redistribution of the data must include a citation to the GDELT Project and a link to the website (https://www.gdeltproject.org/).
Google Ads Transparency Center
console.cloud.google.com
Updated Aug 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:BigQuery%20Public%20Data&hl=ko&inv=1&invt=Ab4DDA (2023). Google Ads Transparency Center [Dataset]. https://console.cloud.google.com/marketplace/product/bigquery-public-data/google-ads-transparency-center?hl=ko
Explore at:
Dataset updated
Aug 23, 2023
Dataset provided by
Googlehttp://google.com/
BigQueryhttps://cloud.google.com/bigquery
Description
This dataset contains two tables: creative_stats and removed_creative_stats. The creative_stats table contains information about advertisers that served ads in the European Economic Area or Turkey: their legal name, verification status, disclosed name, and location. It also includes ad specific information: impression ranges per region (including aggregate impressions for the European Economic Area), first shown and last shown dates, which criteria were used in audience selection, the format of the ad, the ad topic and whether the ad is funded by Google Ad Grants program. A link to the ad in the Google Ads Transparency Center is also provided. The removed_creative_stats table contains information about ads that served in the European Economic Area that Google removed: where and why they were removed and per-region information on when they served. The removed_creative_stats table also contains a link to the Google Ads Transparency Center for the removed ad. Data for both tables updates periodically and may be delayed from what appears on the Google Ads Transparency Center website. About BigQuery This data is hosted in Google BigQuery for users to easily query using SQL. Note that to use BigQuery, users must have a Google account and create a GCP project. This public dataset is included in BigQuery's 1TB/mo of free tier processing. Each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery . Download Dataset This public dataset is also hosted in Google Cloud Storage here and available free to use. Use this quick start guide to quickly learn how to access public datasets on Google Cloud Storage. We provide the raw data in JSON format, sharded across multiple files to support easier download of the large dataset. A README file which describes the data structure and our Terms of Service (also listed below) is included with the dataset. You can also download the results from a custom query. See here for options and instructions. Signed out users can download the full dataset by using the gCloud CLI. Follow the instructions here to download and install the gCloud CLI. To remove the login requirement, run "$ gcloud config set auth/disable_credentials True" To download the dataset, run "$ gcloud storage cp gs://ads-transparency-center/* . -R" This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
GitHub Repos
kaggle.com
zip
Updated Mar 20, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset provided by
GitHubhttps://github.com/
Authors
Github
Description
GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

Acknowledgements

This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

Inspiration

This is the perfect dataset for fighting language wars.

Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?
OpenStreetMap Public Dataset
console.cloud.google.com
Updated Jan 16, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:OpenStreetMap&inv=1&invt=Ab3rZg (2020). OpenStreetMap Public Dataset [Dataset]. https://console.cloud.google.com/marketplace/details/openstreetmap/geo-openstreetmap
Explore at:
Dataset updated
Jan 16, 2020
Dataset provided by
OpenStreetMap//www.openstreetmap.org/
Googlehttp://google.com/
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Adapted from Wikipedia: OpenStreetMap (OSM) is a collaborative project to create a free editable map of the world. Created in 2004, it was inspired by the success of Wikipedia and more than two million registered users who can add data by manual survey, GPS devices, aerial photography, and other free sources. We've made available a number of tables (explained in detail below): history_* tables: full history of OSM objects planet_* tables: snapshot of current OSM objects as of Nov 2019 The history_* and planet_* table groups are composed of node, way, relation, and changeset tables. These contain the primary OSM data types and an additional changeset corresponding to OSM edits for convenient access. These objects are encoded using the BigQuery GEOGRAPHY data type so that they can be operated upon with the built-in geography functions to perform geometry and feature selection, additional processing. Example analyses are given below. This dataset is part of a larger effort to make data available in BigQuery through the Google Cloud Public Datasets program . OSM itself is produced as a public good by volunteers, and there are no guarantees about data quality. Interested in learning more about how these data were brought into BigQuery and how you can use them? Check out the sample queries below to get started. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
Ethereum Blockchain
kaggle.com
zip
Updated Mar 4, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). Ethereum Blockchain [Dataset]. https://www.kaggle.com/datasets/bigquery/ethereum-blockchain
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 4, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Bitcoin and other cryptocurrencies have captured the imagination of technologists, financiers, and economists. Digital currencies are only one application of the underlying blockchain technology. Like its predecessor, Bitcoin, the Ethereum blockchain can be described as an immutable distributed ledger. However, creator Vitalik Buterin also extended the set of capabilities by including a virtual machine that can execute arbitrary code stored on the blockchain as smart contracts.

Both Bitcoin and Ethereum are essentially OLTP databases, and provide little in the way of OLAP (analytics) functionality. However the Ethereum dataset is notably distinct from the Bitcoin dataset:

The Ethereum blockchain has as its primary unit of value Ether, while the Bitcoin blockchain has Bitcoin. However, the majority of value transfer on the Ethereum blockchain is composed of so-called tokens. Tokens are created and managed by smart contracts.

Ether value transfers are precise and direct, resembling accounting ledger debits and credits. This is in contrast to the Bitcoin value transfer mechanism, for which it can be difficult to determine the balance of a given wallet address.

Addresses can be not only wallets that hold balances, but can also contain smart contract bytecode that allows the programmatic creation of agreements and automatic triggering of their execution. An aggregate of coordinated smart contracts could be used to build a decentralized autonomous organization.

Content

The Ethereum blockchain data are now available for exploration with BigQuery. All historical data are in the ethereum_blockchain dataset, which updates daily.

Our hope is that by making the data on public blockchain systems more readily available it promotes technological innovation and increases societal benefits.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.crypto_ethereum.[TABLENAME]. Fork this kernel to get started.

Acknowledgements

Cover photo by Thought Catalog on Unsplash

Inspiration

What are the most popularly exchanged digital tokens, represented by ERC-721 and ERC-20 smart contracts?

Compare transaction volume and transaction networks over time

Compare transaction volume to historical prices by joining with other available data sources like Bitcoin Historical Data
NOAA GOES-16
kaggle.com
zip
Updated Aug 30, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NOAA (2019). NOAA GOES-16 [Dataset]. https://www.kaggle.com/noaa/goes16
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Aug 30, 2019
Dataset provided by
National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
Authors
NOAA
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Overview

The Geostationary Operational Environmental Satellite-R Series (GOES-R) is the next generation of geostationary weather satellites. The GOES-R series will significantly improve the detection and observation of environmental phenomena that directly affect public safety, protection of property and our nation’s economic health and prosperity.

The GOES-16 satellite, known as GOES-R prior to launch, is the first satellite in the series. It will provide images of weather pattern and severe storms as frequently as every 30 seconds, which will contribute to more accurate and reliable weather forecasts and severe weather outlooks.

Content

The raw dataset includes a feed of the Advanced Baseline Imager (ABI) radiance data (Level 1b) and Cloud and Moisture Imager (CMI) products (Level 2) which are freely available through the NOAA Big Data Project.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

Acknowledgments

The NOAA Big Data Project (BDP) is an experimental collaboration between NOAA and infrastructure-as-a-service (IaaS) providers to explore methods of expand the accessibility of NOAA’s data in order to facilitate innovation and collaboration. The goal of this approach is to help form new lines of business and economic growth while making NOAA's data more discoverable for the American public. https://storage.googleapis.com/public-dataset-images/noaa-goes-16-sample.png" alt="Sample images">

Key metadata for this dataset has been extracted into convenient BigQuery tables (one each for L1b radiance, L2 CMIP, and L2 MCMIP). These tables can be used to query metadata in order to filter the data down to only a subset of raw netcdf4 files available in Google Cloud Storage.
PatCit: A Comprehensive Dataset of Patent Citations
zenodo.org
application/gzip, bin
Updated Dec 23, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gaétan de Rassenfosse; Gaétan de Rassenfosse; Cyril Verluise; Cyril Verluise (2020). PatCit: A Comprehensive Dataset of Patent Citations [Dataset]. http://doi.org/10.5281/zenodo.3710994
Explore at:
application/gzip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3710994
Dataset updated
Dec 23, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gaétan de Rassenfosse; Gaétan de Rassenfosse; Cyril Verluise; Cyril Verluise
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PATCIT: A Comprehensive Dataset of Patent Citations [Website, Newsletter, GitHub]

Patents are at the crossroads of many innovation nodes: science, industry, products, competition, etc. Such interactions can be identified through citations in a broad sense.

It is now common to use front-page patent citations to study some aspects of the innovation system. However, there is much more buried in the Non Patent Literature (NPL) citations and in the patent text itself.

Good news: Natural Language Processing (NLP) tools now enable social scientists to excavate and structure this long hidden information. That's the purpose of this project

IN PRACTICE

A detailed presentation of the current state of the project is available in our March 2020 presentation.

So far, we have:

classified the 40 million NPL citations reported in the DOCDB database in 9 distinct research oriented classes with a 90% accuracy rate.

parsed and consolidated the 27 million NPL citations classified as bibliographical references.

extracted, parsed and consolidated in-text bibliographical references and patent citations from the body of all time USPTO patents.

The latest version of the dataset is the v0.15. It is made of the v0.1 of the US contextual citations dataset and v0.2 of the front-page NPL citations dataset.

Give it a try! The dataset is publicly available on Google Cloud BigQuery, just click here.

FEATURES

Open

The code is licensed under MIT-2 and the dataset is licensed under CC4. Two highly permissive licenses.

The project is thought to be dynamically improved by and for the community. Anyone should feel free to open discussions, raise issues, request features and contribute to the project.

Comprehensive

We address worldwide patents, as long as the data is available.

We address all classes of citations, not only bibliographical references.

We address front-page and in-text citations.

Highest standards

We use and implement state-of-the art machine learning solutions.

We take great care to implement only the most efficient solutions. We believe that computational resources should be used sparsely, for both environmental sustainability and long term financial sustainability of the project.
Zip Code Tabulation Area (ZCTA)
console.cloud.google.com
Updated Aug 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:United%20States%20Census%20Bureau&hl=zh-CN&inv=1&invt=Ab4A3A (2023). Zip Code Tabulation Area (ZCTA) [Dataset]. https://console.cloud.google.com/marketplace/product/united-states-census-bureau/zcta?hl=zh-CN
Explore at:
Dataset updated
Aug 15, 2023
Dataset provided by
Googlehttp://google.com/
Description
These are the full-resolution boundary zip code tabular areas (ZCTA), derived from the US Census Bureau's TIGER/Line Shapefiles. The dataset contains polygons that roughly approximate each of the USPS 5-digit zip codes. It is one of many geography datasets available in BigQuery through the Google Cloud Public Dataset Program to support geospatial analysis. You can find more information on the other datasets at the US Geographic Boundaries Marketplace page . Though they do not continuously cover all land and water areas in the United States, ZCTAs are a great way to visualize geospatial data in an understandable format with excellent spatial resolution. This dataset gives the area of land and water within each zip code, as well as the corresponding city and state for each zip code. This makes the dataset an excellent candidate for JOINs to support geospatial queries with BigQuery’s GIS capabilities. Note: BQ-GIS is in public beta, so your GCP project will need to be whitelisted to try out these queries. You can sign up to request access here . This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
Data from: SOTorrent Data Set
zenodo.org
application/gzip, bin
Updated Jan 7, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sebastian Baltes; Sebastian Baltes; Lorik Dumani; Lorik Dumani (2021). SOTorrent Data Set [Dataset]. http://doi.org/10.5281/zenodo.1135263
Explore at:
bin, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1135263
Dataset updated
Jan 7, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sebastian Baltes; Sebastian Baltes; Lorik Dumani; Lorik Dumani
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Stack Overflow (SO) is the largest Q&A website for software developers, providing a huge amount of copyable code snippets. Recent studies have shown that developers regularly copy those snippets into their software projects, often without the required attribution. Beside possible licensing issues, maintenance issues may arise, because the snippets evolve on SO, but the developers who copied the code are not aware of these changes. To help researchers investigate the evolution of code snippets on SO and their relation to other platforms like GitHub, we build SOTorrent, an open data set based on data from the official SO data dump and the Google BigQuery GitHub data set. SOTorrent provides access to the version history of SO content on the level of whole posts and individual text or code blocks. Moreover, it links SO content to external resources in two ways: (1) by extracting linked URLs from text blocks of SO posts and (2) by providing a table with links to SO posts found in the source code of all projects in the BigQuery GitHub data set.
SOTorrent Data Set 2017-07-25
zenodo.org
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sebastian Baltes; Sebastian Baltes (2020). SOTorrent Data Set 2017-07-25 [Dataset]. http://doi.org/10.5281/zenodo.834864
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.834864
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sebastian Baltes; Sebastian Baltes
Description
Moved to this Zenodo record: https://zenodo.org/record/1135262

Stack Overflow (SO) is the largest Q&A website for software developers, providing a huge amount of copyable code snippets. Recent studies have shown that developers regularly copy those snippets into their software projects, often without the required attribution. Beside possible licensing issues, maintenance issues may arise, because the snippets evolve on SO, but the developers who copied the code are not aware of these changes. To help researchers investigate the evolution of code snippets on SO and their relation to other platforms like GitHub, we build SOTorrent, an open data set based on data from the official SO data dump and the Google BigQuery GitHub data set. SOTorrent provides access to the version history of SO content on the level of whole posts and individual text or code blocks. Moreover, it links SO content to external resources in two ways: (1) by extracting linked URLs from text blocks of SO posts and (2) by providing a table with links to SO posts found in the source code of all projects in the BigQuery GitHub data set.
USPTO OCE Patent Litigation Docket Reports Data
kaggle.com
zip
Updated Feb 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). USPTO OCE Patent Litigation Docket Reports Data [Dataset]. https://www.kaggle.com/bigquery/uspto-oce-litigation
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 12, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Fork this notebook to get started on accessing data in the BigQuery dataset by writing SQL queries using the BQhelper module.

Context

OCE collected all of the data from the Public Access to Court Electronics Records (PACER) and RECAP, an independent project designed to serve as a repository for litigation data sourced from PACER. The final output datasets include information on the litigating parties involved and their attorneys; the cause of action; the court location; important dates in the litigation history; and descriptions of all documents submitted in a given case, which cover more than 5 million separate documents contained in the case docket reports.

Content

USPTO OCE Patent Litigation Docket Reports Data contains detailed patent litigation data on 74,623 unique district court cases filed during the period 1963-2015.

Acknowledgements

"USPTO OCE Patent Litigation Docket Reports Data" by the USPTO, for public use. Marco, A., A. Tesfayesus, A. Toole (2017). “Patent Litigation Data from US District Court Electronic Records (1963-2015).” USPTO Economic Working Paper No. 2017-06.

Data Origin: https://bigquery.cloud.google.com/dataset/patents-public-data:uspto_oce_litigation

Banner photo by Samuel Zeller on Unsplash
r
CMS Synthetic Patient Data OMOP
redivis.com
Updated Aug 7, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). CMS Synthetic Patient Data OMOP [Dataset]. https://redivis.com/workflows/y8de-d3fnwt33n
Explore at:
Dataset updated
Aug 7, 2020
Description
This is a synthetic patient dataset in the OMOP common data model, originally released by CMS and accessed via BigQuery. The dataset includes 24 tables and records for 2 million synthetic patients from 2008 to 2010.
Eclipse Megamovie
console.cloud.google.com
Updated Apr 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:Google%20Cloud%20Public%20Datasets%20Program&hl=de&inv=1&invt=Ab3zCg (2023). Eclipse Megamovie [Dataset]. https://console.cloud.google.com/marketplace/product/google-cloud-public-datasets/eclipse-megamovie?hl=de
Explore at:
Dataset updated
Apr 22, 2023
Dataset provided by
Googlehttp://google.com/
Description
This is the full set of images submitted for the Eclipse Megamovie project, a citizen science project to capture images of the Sun’s corona during the August 21, 2017 total solar eclipse. These images were taken by volunteer photographers (as well as the general public) from across the country using consumer camera equipment. The Eclipse Megamovie project was a collaboration between UC Berkeley, Google, the Astronomical Society of the Pacific, and many more.* In addition to the dataset, the code used by the project to create the website and process individual movies can be found in GitHub For a full description of the data fields, see below. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery . *Additional partners: Center for Research on Lifelong STEM Learning, Oregon State University, Eclipse Across America, Foothill College, High Altitude Observatory of the National Center for Atmospheric Research, Ideum, Lick Observatory, Space Sciences Laboratory, University of California, Berkeley, University of Colorado at Boulder, Williams College and the IAU Working Group.

Facebook

Twitter

Click to copy link

Link copied

Cite

Libraries.io (2019). Libraries.io Data [Dataset]. https://www.kaggle.com/librariesdotio/libraries-io

Libraries.io Data

Dependency and usage metadata from 25m open source projects (BigQuery Dataset)

Explore at:

87 scholarly articles cite this dataset (View in Google Scholar)

zip(0 bytes)Available download formats

Dataset updated

Mar 20, 2019

Dataset provided by

Libraries.iohttps://libraries.io/

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Context

In this release you will find data about software distributed and/or crafted publicly on the Internet. You will find information about its development, its distribution and its relationship with other software included as a dependency. You will not find any information about the individuals who create and maintain these projects.

Content

Libraries.io gathers data on open source software from 33 package managers and 3 source code repositories. We track over 2.4m unique open source projects, 25m repositories and 121m interdependencies between them. This gives Libraries.io a unique understanding of open source software.

https://libraries.io/data

Fork this kernel to get started with this dataset.

Acknowledgements

This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — https://libraries.io/data — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

https://libraries.io/data

https://bigquery.cloud.google.com/dataset/bigquery-public-data:libraries_io?_ga=2.42277601.-577194880.1523455401

https://console.cloud.google.com/marketplace/details/libraries-io/librariesio

Banner Photo by Caspar Rubin from Unplash.

Inspiration

What are the repositories, avg project size, and avg # of stars?

What are the top dependencies per platform?

What are the top unmaintained or deprecated projects?

Clear search

Close search

Google apps

Main menu

Libraries.io Data

Context

Content

Acknowledgements

Inspiration

GitHub Activity Data

1000 Cannabis Genomes Project

Context

Content

Acknowledgements

Inspiration

Project Sunroof

Reddit

Abstract

Documentation

geo-openstreetmap

Context

Content

Resources

Influence of Continuous Integration on the Development Activity in GitHub...

The GDELT Project

Context

Content

Querying BigQuery tables

Acknowledgements

Google Ads Transparency Center

GitHub Repos

Querying BigQuery tables

Acknowledgements

Inspiration

OpenStreetMap Public Dataset

Ethereum Blockchain

Context

Content

Querying BigQuery tables

Acknowledgements

Inspiration

NOAA GOES-16

Overview

Content

Querying BigQuery tables

Acknowledgments

PatCit: A Comprehensive Dataset of Patent Citations

Zip Code Tabulation Area (ZCTA)

Data from: SOTorrent Data Set

SOTorrent Data Set 2017-07-25

USPTO OCE Patent Litigation Docket Reports Data

Fork this notebook to get started on accessing data in the BigQuery dataset by writing SQL queries using the BQhelper module.

Context

Content

Acknowledgements

CMS Synthetic Patient Data OMOP

Eclipse Megamovie

Libraries.io Data

Dependency and usage metadata from 25m open source projects (BigQuery Dataset)

Context

Content

Acknowledgements

Inspiration