7 datasets found

geo-openstreetmap
kaggle.com
zip
Updated Apr 17, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2020). geo-openstreetmap [Dataset]. https://www.kaggle.com/bigquery/geo-openstreetmap
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Apr 17, 2020
Dataset provided by
Googlehttp://google.com/
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Context

Adapted from Wikipedia: OpenStreetMap (OSM) is a collaborative project to create a free editable map of the world. Created in 2004, it was inspired by the success of Wikipedia and more than two million registered users who can add data by manual survey, GPS devices, aerial photography, and other free sources.

To aid researchers, data scientists, and analysts in the effort to combat COVID-19, Google is making a hosted repository of public datasets including OpenStreetMap data, free to access. To facilitate the Kaggle community to access the BigQuery dataset, it is onboarded to Kaggle platform which allows querying it without a linked GCP account. Please note that due to the large size of the dataset, Kaggle applies a quota of 5 TB of data scanned per user per 30-days.

Content

This is the OpenStreetMap (OSM) planet-wide dataset loaded to BigQuery.

Tables: - history_* tables: full history of OSM objects. - planet_* tables: snapshot of current OSM objects as of Nov 2019.

The history_* and planet_* table groups are composed of node, way, relation, and changeset tables. These contain the primary OSM data types and an additional changeset corresponding to OSM edits for convenient access. These objects are encoded using the BigQuery GEOGRAPHY data type so that they can be operated upon with the built-in geography functions to perform geometry and feature selection, additional processing.

Resources

You can read more about OSM elements on the OSM Wiki. This dataset uses BigQuery GEOGRAPHY datatype which supports a set of functions that can be used to analyze geographical data, determine spatial relationships between geographical features, and construct or manipulate GEOGRAPHYs.
OpenStreetMap Public Dataset
console.cloud.google.com
Updated Apr 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:OpenStreetMap&hl=de (2023). OpenStreetMap Public Dataset [Dataset]. https://console.cloud.google.com/marketplace/product/openstreetmap/geo-openstreetmap?hl=de
Explore at:
Dataset updated
Apr 23, 2023
Dataset provided by
OpenStreetMap//www.openstreetmap.org/
Googlehttp://google.com/
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Adapted from Wikipedia: OpenStreetMap (OSM) is a collaborative project to create a free editable map of the world. Created in 2004, it was inspired by the success of Wikipedia and more than two million registered users who can add data by manual survey, GPS devices, aerial photography, and other free sources. We've made available a number of tables (explained in detail below): history_* tables: full history of OSM objects planet_* tables: snapshot of current OSM objects as of Nov 2019 The history_* and planet_* table groups are composed of node, way, relation, and changeset tables. These contain the primary OSM data types and an additional changeset corresponding to OSM edits for convenient access. These objects are encoded using the BigQuery GEOGRAPHY data type so that they can be operated upon with the built-in geography functions to perform geometry and feature selection, additional processing. Example analyses are given below. This dataset is part of a larger effort to make data available in BigQuery through the Google Cloud Public Datasets program . OSM itself is produced as a public good by volunteers, and there are no guarantees about data quality. Interested in learning more about how these data were brought into BigQuery and how you can use them? Check out the sample queries below to get started. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
ChEMBL EBI Small Molecules Database
kaggle.com
zip
Updated Feb 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). ChEMBL EBI Small Molecules Database [Dataset]. https://www.kaggle.com/bigquery/ebi-chembl
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 12, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Context

ChEMBL is maintained by the European Bioinformatics Institute (EBI), of the European Molecular Biology Laboratory (EMBL), based at the Wellcome Trust Genome Campus, Hinxton, UK.

Content

ChEMBL is a manually curated database of bioactive molecules with drug-like properties used in drug discovery, including information about existing patented drugs.

Schema: http://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_23/chembl_23_schema.png

Documentation: http://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_23/schema_documentation.html

Fork this notebook to get started on accessing data in the BigQuery dataset using the BQhelper package to write SQL queries.

Acknowledgements

“ChEMBL” by the European Bioinformatics Institute (EMBL-EBI), used under CC BY-SA 3.0. Modifications have been made to add normalized publication numbers.

Data Origin: https://bigquery.cloud.google.com/dataset/patents-public-data:ebi_chembl

Banner photo by rawpixel on Unsplash
Data and code for: The Contributor Role Taxonomy (CRediT) at ten: a...
figshare.com
pdf
Updated Nov 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simon Porter; Ruth Whittam; Liz Allen; Veronique Kiermer (2025). Data and code for: The Contributor Role Taxonomy (CRediT) at ten: a retrospective analysis of the diversity ofcontributions to published research output [Dataset]. http://doi.org/10.6084/m9.figshare.28816703.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28816703.v1
Dataset updated
Nov 19, 2025
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Simon Porter; Ruth Whittam; Liz Allen; Veronique Kiermer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
About this notebookThis notebook was created using a helper script: Base.ipynb. This script has some helper functions that push output directly to Datawrapper to generate the graphs that have been included in the opnion piece. To run without the helper functions and bigquery alone use!pip install google-cloud-bigquerythen add:from google.cloud.bigquery import magicsproject_id = "your_project" # update as neededmagics.context.project = project_idbq_params = {}client = bigquery.Client(project=project_id)%load_ext google.cloud.bigqueryfinally, comment out the make_chart lines.### About dimensions-ai-integrity.ds_dp_pipeline_ripeta_staging.trust_markers_rawdimensions-ai-integrity.ds_dp_pipeline_ripeta_staging.trust_markers_raw is an internal table that is the result of runing a process over the text of publications in order to identify trustmarker segments including authors contributions.The process works as follows:The process aims to automatically segment research papers into their constituent sections. It operates by identifying headings within the text based on a pre-defined set of patterns and a rule-based system. The system first cleans and normalizes the input text. It then employs regular expressions to detect potential section headings. These potential headings are validated against a set of rules that consider factors such as capitalization, the context of surrounding words, and the typical order of sections within a research paper (e.g., certain sections not appearing after "References" or before "Abstract"). Specific rules also handle exceptions for particular heading types like "Keywords" or "Appendices." Once valid headings are identified, the system extracts the corresponding textual content for each section. The output is a structured representation of the paper, categorizing text segments under their respective heading types. Any text that doesn't fall under a recognized heading is also identified as unlabeled content. The overall process aims to provide a structured understanding of the document's organization for subsequent analysis.Author Contributions segments are identified using the following regex:"author_contributions": ["((credit|descript(ion(?:s)?|ive)| )*author(s|'s|ship|s')?( |contribution(?:s)?|statement(?:s)?|role(?:s)?){2,})","contribution(?:s)"]Access to dimensions-ai-integrity.ds_dp_pipeline_ripeta_staging.trust_markers_raw is available to peer reveiwers of the opinion piece.Datasets that allow external validation of the credit ontology process identification process have also been produced.
Cyclistic Summary Data
kaggle.com
zip
Updated Oct 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sen Zhong (2023). Cyclistic Summary Data [Dataset]. https://www.kaggle.com/datasets/senzhong/cyclistic-summary-data
Explore at:
zip(83436633 bytes)Available download formats
Dataset updated
Oct 12, 2023
Authors
Sen Zhong
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
-- Queries in SQL for the ETL process.

-- creating the first target table to capture the entire year.

SELECT TRI.usertype, ZIPSTART.zip_code AS zip_code_start, ZIPSTARTNAME.borough borough_start, ZIPSTARTNAME.neighborhood AS neighborhood_start, ZIPEND.zip_code AS zip_code_end, ZIPENDNAME.borough borough_end, ZIPENDNAME.neighborhood AS neighborhood_end, -- Since this is a fictional dashboard, we will add 6 years to make it look recent DATE_ADD(DATE(TRI.starttime), INTERVAL 6 YEAR) AS start_day, DATE_ADD(DATE(TRI.stoptime), INTERVAL 6 YEAR) AS stop_day, WEA.temp AS day_mean_temperature, -- Mean temperature WEA.wdsp AS day_mean_wind_speed, -- Mean wind speed WEA.prcp day_total_precipitation, -- Total precipitation -- Group trips into 10 minute intervals to reduces the number of rows ROUND(CAST(TRI.tripduration / 60 AS INT64), -1) AS trip_minutes, COUNT(TRI.bikeid) AS trip_count FROM bigquery-public-data.new_york_citibike.citibike_trips AS TRI INNER JOIN bigquery-public-data.geo_us_boundaries.zip_codes ZIPSTART ON ST_WITHIN( ST_GEOGPOINT(TRI.start_station_longitude, TRI.start_station_latitude), ZIPSTART.zip_code_geom) INNER JOIN bigquery-public-data.geo_us_boundaries.zip_codes ZIPEND ON ST_WITHIN( ST_GEOGPOINT(TRI.end_station_longitude, TRI.end_station_latitude), ZIPEND.zip_code_geom) INNER JOIN bigquery-public-data.noaa_gsod.gsod20* AS WEA ON PARSE_DATE("%Y%m%d", CONCAT(WEA.year, WEA.mo, WEA.da)) = DATE(TRI.starttime) INNER JOIN my-project-for-da-cert-1.cyclistic.nyc_zips AS ZIPSTARTNAME ON ZIPSTART.zip_code = CAST(ZIPSTARTNAME.zip AS STRING) INNER JOIN my-project-for-da-cert-1.cyclistic.nyc_zips AS ZIPENDNAME ON ZIPEND.zip_code = CAST(ZIPENDNAME.zip AS STRING) WHERE -- This takes the weather data from new york central park, weather station id 94728 WEA.wban = '94728' -- Use data from 2014 and 2015 AND EXTRACT(YEAR FROM DATE(TRI.starttime)) BETWEEN 2014 AND 2015 GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13;

-- creating the second target table to capture summer seasons. -- we will define summer as June to August.

SELECT TRI.usertype, TRI.start_station_longitude, TRI.start_station_latitude, TRI.end_station_longitude, TRI.end_station_latitude, ZIPSTART.zip_code AS zip_code_start, ZIPSTARTNAME.borough borough_start, ZIPSTARTNAME.neighborhood AS neighborhood_start, ZIPEND.zip_code AS zip_code_end, ZIPENDNAME.borough borough_end, ZIPENDNAME.neighborhood AS neighborhood_end, -- Since we're using trips from 2014 and 2015, we will add 6 years to make it look recent DATE_ADD(DATE(TRI.starttime), INTERVAL 6 YEAR) AS start_day, DATE_ADD(DATE(TRI.stoptime), INTERVAL 6 YEAR) AS stop_day, WEA.temp AS day_mean_temperature, -- Mean temperature WEA.wdsp AS day_mean_wind_speed, -- Mean wind speed WEA.prcp day_total_precipitation, -- Total precipitation -- We will group trips into 10 minute intervals, which also reduces the number of rows ROUND(CAST(TRI.tripduration / 60 AS INT64), -1) AS trip_minutes, TRI.bikeid FROM bigquery-public-data.new_york_citibike.citibike_trips AS TRI INNER JOIN bigquery-public-data.geo_us_boundaries.zip_codes ZIPSTART ON ST_WITHIN( ST_GEOGPOINT(TRI.start_station_longitude, TRI.start_station_latitude), ZIPSTART.zip_code_geom) INNER JOIN bigquery-public-data.geo_us_boundaries.zip_codes ZIPEND ON ST_WITHIN( ST_GEOGPOINT(TRI.end_station_longitude, TRI.end_station_latitude), ZIPEND.zip_code_geom) INNER JOIN bigquery-public-data.noaa_gsod.gsod20* AS WEA ON PARSE_DATE("%Y%m%d", CONCAT(WEA.year, WEA.mo, WEA.da)) = DATE(TRI.starttime) INNER JOIN my-project-for-da-cert-1.cyclistic.nyc_zips AS ZIPSTARTNAME ON ZIPSTART.zip_code = CAST(ZIPSTARTNAME.zip AS STRING) INNER JOIN my-project-for-da-cert-1.cyclistic.nyc_zips AS ZIPENDNAME ON ZIPEND.zip_code = CAST(ZIPENDNAME.zip AS STRING) WHERE -- Take the weather from the same new york central park weather station, id 94728 WEA.wban = '94728' -- Use data for the three summer months AND DATE(TRI.starttime) BETWEEN DATE('2015-06-01') AND DATE('2015-08-31');
solar_panel_data
kaggle.com
zip
Updated Nov 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jessica K (2025). solar_panel_data [Dataset]. https://www.kaggle.com/datasets/klementine86/solar-cleaned
Explore at:
zip(27203308 bytes)Available download formats
Dataset updated
Nov 5, 2025
Authors
Jessica K
Description
From the public dataset in BigQuery. Duplicate entries were removed, along with entries containing null values. bigquery-public-data.sunroof_solar.solar_potential_by_censustract

The data was cleaned using the following script:
```SQL WITH solar AS ( SELECT * FROM ( SELECT rn, region_name, count_qualified FROM ( SELECT ROW_NUMBER() OVER(PARTITION BY region_name ORDER BY count_qualified DESC) rn, region_name, count_qualified, kw_total FROM bigquery-public-data.sunroof_solar.solar_potential_by_censustract AS solar ORDER BY region_name) WHERE kw_total IS NOT NULL ) WHERE rn = 1)

SELECT b.* FROM solar s JOIN bigquery-public-data.sunroof_solar.solar_potential_by_censustract b ON s.region_name = b.region_name AND s.count_qualified = b.count_qualified ORDER BY s.region_name ```

To filter out regions in Alaska, Hawaii, and Puerto Rico, add SQL WHERE SUBSTRING(region_name,1,2) NOT IN('02','15','72')
(dataset solar_contiguous)
Methane Emissions Around The World (1990-2018)-BQ
kaggle.com
zip
Updated Jun 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamed Sofiene Kadri (2022). Methane Emissions Around The World (1990-2018)-BQ [Dataset]. https://www.kaggle.com/datasets/kadrisofiane/methane-emissions-around-the-world-19902018bq
Explore at:
zip(65585 bytes)Available download formats
Dataset updated
Jun 5, 2022
Authors
Mohamed Sofiene Kadri
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
World
Description
This dataset is just a copy of the Methane Emissions Around The World (1990-2018) dataset provided by @koustubhk. The only difference is that I modified the "2018".."1990" columns' names by adding manually the word "year" to each year column name.

This endeavor was for the sole purpose of making the dataset “loadable” to BigQuery. BigQuery schema detection for header setting works by detecting strings in the first row of a .csv file. Since the original .csv file has numerical values in the first row, the schema detection process will result in a detection error, which consequently will load the dataset with generic columns names such as "string_field_0" or fail if the schema was edited manually.

The only solution to successfully load this dataset to BigQuery is to transform the header values into string values. Now you can upload this dataset to BigQuery with no error messages and generic columns' names.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Google BigQuery (2020). geo-openstreetmap [Dataset]. https://www.kaggle.com/bigquery/geo-openstreetmap

geo-openstreetmap

The OpenStreetMap planet-wide dataset loaded to BigQuery

Explore at:

23 scholarly articles cite this dataset (View in Google Scholar)

zip(0 bytes)Available download formats

Dataset updated

Apr 17, 2020

Dataset provided by

Googlehttp://google.com/
BigQueryhttps://cloud.google.com/bigquery

Authors

Google BigQuery

License

http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

Description

Context

Adapted from Wikipedia: OpenStreetMap (OSM) is a collaborative project to create a free editable map of the world. Created in 2004, it was inspired by the success of Wikipedia and more than two million registered users who can add data by manual survey, GPS devices, aerial photography, and other free sources.

To aid researchers, data scientists, and analysts in the effort to combat COVID-19, Google is making a hosted repository of public datasets including OpenStreetMap data, free to access. To facilitate the Kaggle community to access the BigQuery dataset, it is onboarded to Kaggle platform which allows querying it without a linked GCP account. Please note that due to the large size of the dataset, Kaggle applies a quota of 5 TB of data scanned per user per 30-days.

Content

This is the OpenStreetMap (OSM) planet-wide dataset loaded to BigQuery.

Tables: - history_* tables: full history of OSM objects. - planet_* tables: snapshot of current OSM objects as of Nov 2019.

The history_* and planet_* table groups are composed of node, way, relation, and changeset tables. These contain the primary OSM data types and an additional changeset corresponding to OSM edits for convenient access. These objects are encoded using the BigQuery GEOGRAPHY data type so that they can be operated upon with the built-in geography functions to perform geometry and feature selection, additional processing.

Resources

You can read more about OSM elements on the OSM Wiki. This dataset uses BigQuery GEOGRAPHY datatype which supports a set of functions that can be used to analyze geographical data, determine spatial relationships between geographical features, and construct or manipulate GEOGRAPHYs.

Clear search

Close search

Google apps

Main menu

geo-openstreetmap

Context

Content

Resources

OpenStreetMap Public Dataset

ChEMBL EBI Small Molecules Database

Context

Content

Acknowledgements

Data and code for: The Contributor Role Taxonomy (CRediT) at ten: a...

Cyclistic Summary Data

solar_panel_data

Methane Emissions Around The World (1990-2018)-BQ

geo-openstreetmap

The OpenStreetMap planet-wide dataset loaded to BigQuery

Context

Content

Resources