7 datasets found
  1. geo-openstreetmap

    • kaggle.com
    zip
    Updated Apr 17, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google BigQuery (2020). geo-openstreetmap [Dataset]. https://www.kaggle.com/bigquery/geo-openstreetmap
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Apr 17, 2020
    Dataset provided by
    Googlehttp://google.com/
    BigQueryhttps://cloud.google.com/bigquery
    Authors
    Google BigQuery
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Context

    Adapted from Wikipedia: OpenStreetMap (OSM) is a collaborative project to create a free editable map of the world. Created in 2004, it was inspired by the success of Wikipedia and more than two million registered users who can add data by manual survey, GPS devices, aerial photography, and other free sources.

    To aid researchers, data scientists, and analysts in the effort to combat COVID-19, Google is making a hosted repository of public datasets including OpenStreetMap data, free to access. To facilitate the Kaggle community to access the BigQuery dataset, it is onboarded to Kaggle platform which allows querying it without a linked GCP account. Please note that due to the large size of the dataset, Kaggle applies a quota of 5 TB of data scanned per user per 30-days.

    Content

    This is the OpenStreetMap (OSM) planet-wide dataset loaded to BigQuery.

    Tables: - history_* tables: full history of OSM objects. - planet_* tables: snapshot of current OSM objects as of Nov 2019.

    The history_* and planet_* table groups are composed of node, way, relation, and changeset tables. These contain the primary OSM data types and an additional changeset corresponding to OSM edits for convenient access. These objects are encoded using the BigQuery GEOGRAPHY data type so that they can be operated upon with the built-in geography functions to perform geometry and feature selection, additional processing.

    Resources

    You can read more about OSM elements on the OSM Wiki. This dataset uses BigQuery GEOGRAPHY datatype which supports a set of functions that can be used to analyze geographical data, determine spatial relationships between geographical features, and construct or manipulate GEOGRAPHYs.

  2. OpenStreetMap Public Dataset

    • console.cloud.google.com
    Updated Apr 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:OpenStreetMap&hl=de (2023). OpenStreetMap Public Dataset [Dataset]. https://console.cloud.google.com/marketplace/product/openstreetmap/geo-openstreetmap?hl=de
    Explore at:
    Dataset updated
    Apr 23, 2023
    Dataset provided by
    OpenStreetMap//www.openstreetmap.org/
    Googlehttp://google.com/
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Adapted from Wikipedia: OpenStreetMap (OSM) is a collaborative project to create a free editable map of the world. Created in 2004, it was inspired by the success of Wikipedia and more than two million registered users who can add data by manual survey, GPS devices, aerial photography, and other free sources. We've made available a number of tables (explained in detail below): history_* tables: full history of OSM objects planet_* tables: snapshot of current OSM objects as of Nov 2019 The history_* and planet_* table groups are composed of node, way, relation, and changeset tables. These contain the primary OSM data types and an additional changeset corresponding to OSM edits for convenient access. These objects are encoded using the BigQuery GEOGRAPHY data type so that they can be operated upon with the built-in geography functions to perform geometry and feature selection, additional processing. Example analyses are given below. This dataset is part of a larger effort to make data available in BigQuery through the Google Cloud Public Datasets program . OSM itself is produced as a public good by volunteers, and there are no guarantees about data quality. Interested in learning more about how these data were brought into BigQuery and how you can use them? Check out the sample queries below to get started. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .

  3. ChEMBL EBI Small Molecules Database

    • kaggle.com
    zip
    Updated Feb 12, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google BigQuery (2019). ChEMBL EBI Small Molecules Database [Dataset]. https://www.kaggle.com/bigquery/ebi-chembl
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Feb 12, 2019
    Dataset provided by
    BigQueryhttps://cloud.google.com/bigquery
    Authors
    Google BigQuery
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Context

    ChEMBL is maintained by the European Bioinformatics Institute (EBI), of the European Molecular Biology Laboratory (EMBL), based at the Wellcome Trust Genome Campus, Hinxton, UK.

    Content

    ChEMBL is a manually curated database of bioactive molecules with drug-like properties used in drug discovery, including information about existing patented drugs.

    Schema: http://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_23/chembl_23_schema.png

    Documentation: http://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_23/schema_documentation.html

    Fork this notebook to get started on accessing data in the BigQuery dataset using the BQhelper package to write SQL queries.

    Acknowledgements

    “ChEMBL” by the European Bioinformatics Institute (EMBL-EBI), used under CC BY-SA 3.0. Modifications have been made to add normalized publication numbers.

    Data Origin: https://bigquery.cloud.google.com/dataset/patents-public-data:ebi_chembl

    Banner photo by rawpixel on Unsplash

  4. Data and code for: The Contributor Role Taxonomy (CRediT) at ten: a...

    • figshare.com
    pdf
    Updated Nov 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simon Porter; Ruth Whittam; Liz Allen; Veronique Kiermer (2025). Data and code for: The Contributor Role Taxonomy (CRediT) at ten: a retrospective analysis of the diversity ofcontributions to published research output [Dataset]. http://doi.org/10.6084/m9.figshare.28816703.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Nov 19, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Simon Porter; Ruth Whittam; Liz Allen; Veronique Kiermer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    About this notebookThis notebook was created using a helper script: Base.ipynb. This script has some helper functions that push output directly to Datawrapper to generate the graphs that have been included in the opnion piece. To run without the helper functions and bigquery alone use!pip install google-cloud-bigquerythen add:from google.cloud.bigquery import magicsproject_id = "your_project" # update as neededmagics.context.project = project_idbq_params = {}client = bigquery.Client(project=project_id)%load_ext google.cloud.bigqueryfinally, comment out the make_chart lines.### About dimensions-ai-integrity.ds_dp_pipeline_ripeta_staging.trust_markers_rawdimensions-ai-integrity.ds_dp_pipeline_ripeta_staging.trust_markers_raw is an internal table that is the result of runing a process over the text of publications in order to identify trustmarker segments including authors contributions.The process works as follows:The process aims to automatically segment research papers into their constituent sections. It operates by identifying headings within the text based on a pre-defined set of patterns and a rule-based system. The system first cleans and normalizes the input text. It then employs regular expressions to detect potential section headings. These potential headings are validated against a set of rules that consider factors such as capitalization, the context of surrounding words, and the typical order of sections within a research paper (e.g., certain sections not appearing after "References" or before "Abstract"). Specific rules also handle exceptions for particular heading types like "Keywords" or "Appendices." Once valid headings are identified, the system extracts the corresponding textual content for each section. The output is a structured representation of the paper, categorizing text segments under their respective heading types. Any text that doesn't fall under a recognized heading is also identified as unlabeled content. The overall process aims to provide a structured understanding of the document's organization for subsequent analysis.Author Contributions segments are identified using the following regex:"author_contributions": ["((credit|descript(ion(?:s)?|ive)| )*author(s|'s|ship|s')?( |contribution(?:s)?|statement(?:s)?|role(?:s)?){2,})","contribution(?:s)"]Access to dimensions-ai-integrity.ds_dp_pipeline_ripeta_staging.trust_markers_raw is available to peer reveiwers of the opinion piece.Datasets that allow external validation of the credit ontology process identification process have also been produced.

  5. Cyclistic Summary Data

    • kaggle.com
    zip
    Updated Oct 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sen Zhong (2023). Cyclistic Summary Data [Dataset]. https://www.kaggle.com/datasets/senzhong/cyclistic-summary-data
    Explore at:
    zip(83436633 bytes)Available download formats
    Dataset updated
    Oct 12, 2023
    Authors
    Sen Zhong
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    -- Queries in SQL for the ETL process.

    -- creating the first target table to capture the entire year.

    SELECT TRI.usertype, ZIPSTART.zip_code AS zip_code_start, ZIPSTARTNAME.borough borough_start, ZIPSTARTNAME.neighborhood AS neighborhood_start, ZIPEND.zip_code AS zip_code_end, ZIPENDNAME.borough borough_end, ZIPENDNAME.neighborhood AS neighborhood_end, -- Since this is a fictional dashboard, we will add 6 years to make it look recent DATE_ADD(DATE(TRI.starttime), INTERVAL 6 YEAR) AS start_day, DATE_ADD(DATE(TRI.stoptime), INTERVAL 6 YEAR) AS stop_day, WEA.temp AS day_mean_temperature, -- Mean temperature WEA.wdsp AS day_mean_wind_speed, -- Mean wind speed WEA.prcp day_total_precipitation, -- Total precipitation -- Group trips into 10 minute intervals to reduces the number of rows ROUND(CAST(TRI.tripduration / 60 AS INT64), -1) AS trip_minutes, COUNT(TRI.bikeid) AS trip_count FROM bigquery-public-data.new_york_citibike.citibike_trips AS TRI INNER JOIN bigquery-public-data.geo_us_boundaries.zip_codes ZIPSTART ON ST_WITHIN( ST_GEOGPOINT(TRI.start_station_longitude, TRI.start_station_latitude), ZIPSTART.zip_code_geom) INNER JOIN bigquery-public-data.geo_us_boundaries.zip_codes ZIPEND ON ST_WITHIN( ST_GEOGPOINT(TRI.end_station_longitude, TRI.end_station_latitude), ZIPEND.zip_code_geom) INNER JOIN bigquery-public-data.noaa_gsod.gsod20* AS WEA ON PARSE_DATE("%Y%m%d", CONCAT(WEA.year, WEA.mo, WEA.da)) = DATE(TRI.starttime) INNER JOIN my-project-for-da-cert-1.cyclistic.nyc_zips AS ZIPSTARTNAME ON ZIPSTART.zip_code = CAST(ZIPSTARTNAME.zip AS STRING) INNER JOIN my-project-for-da-cert-1.cyclistic.nyc_zips AS ZIPENDNAME ON ZIPEND.zip_code = CAST(ZIPENDNAME.zip AS STRING) WHERE -- This takes the weather data from new york central park, weather station id 94728 WEA.wban = '94728' -- Use data from 2014 and 2015 AND EXTRACT(YEAR FROM DATE(TRI.starttime)) BETWEEN 2014 AND 2015 GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13;

    -- creating the second target table to capture summer seasons. -- we will define summer as June to August.

    SELECT TRI.usertype, TRI.start_station_longitude, TRI.start_station_latitude, TRI.end_station_longitude, TRI.end_station_latitude, ZIPSTART.zip_code AS zip_code_start, ZIPSTARTNAME.borough borough_start, ZIPSTARTNAME.neighborhood AS neighborhood_start, ZIPEND.zip_code AS zip_code_end, ZIPENDNAME.borough borough_end, ZIPENDNAME.neighborhood AS neighborhood_end, -- Since we're using trips from 2014 and 2015, we will add 6 years to make it look recent DATE_ADD(DATE(TRI.starttime), INTERVAL 6 YEAR) AS start_day, DATE_ADD(DATE(TRI.stoptime), INTERVAL 6 YEAR) AS stop_day, WEA.temp AS day_mean_temperature, -- Mean temperature WEA.wdsp AS day_mean_wind_speed, -- Mean wind speed WEA.prcp day_total_precipitation, -- Total precipitation -- We will group trips into 10 minute intervals, which also reduces the number of rows ROUND(CAST(TRI.tripduration / 60 AS INT64), -1) AS trip_minutes, TRI.bikeid FROM bigquery-public-data.new_york_citibike.citibike_trips AS TRI INNER JOIN bigquery-public-data.geo_us_boundaries.zip_codes ZIPSTART ON ST_WITHIN( ST_GEOGPOINT(TRI.start_station_longitude, TRI.start_station_latitude), ZIPSTART.zip_code_geom) INNER JOIN bigquery-public-data.geo_us_boundaries.zip_codes ZIPEND ON ST_WITHIN( ST_GEOGPOINT(TRI.end_station_longitude, TRI.end_station_latitude), ZIPEND.zip_code_geom) INNER JOIN bigquery-public-data.noaa_gsod.gsod20* AS WEA ON PARSE_DATE("%Y%m%d", CONCAT(WEA.year, WEA.mo, WEA.da)) = DATE(TRI.starttime) INNER JOIN my-project-for-da-cert-1.cyclistic.nyc_zips AS ZIPSTARTNAME ON ZIPSTART.zip_code = CAST(ZIPSTARTNAME.zip AS STRING) INNER JOIN my-project-for-da-cert-1.cyclistic.nyc_zips AS ZIPENDNAME ON ZIPEND.zip_code = CAST(ZIPENDNAME.zip AS STRING) WHERE -- Take the weather from the same new york central park weather station, id 94728 WEA.wban = '94728' -- Use data for the three summer months AND DATE(TRI.starttime) BETWEEN DATE('2015-06-01') AND DATE('2015-08-31');

  6. solar_panel_data

    • kaggle.com
    zip
    Updated Nov 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jessica K (2025). solar_panel_data [Dataset]. https://www.kaggle.com/datasets/klementine86/solar-cleaned
    Explore at:
    zip(27203308 bytes)Available download formats
    Dataset updated
    Nov 5, 2025
    Authors
    Jessica K
    Description

    From the public dataset in BigQuery. Duplicate entries were removed, along with entries containing null values. bigquery-public-data.sunroof_solar.solar_potential_by_censustract

    The data was cleaned using the following script:
    ```SQL WITH solar AS ( SELECT * FROM ( SELECT rn, region_name, count_qualified FROM ( SELECT ROW_NUMBER() OVER(PARTITION BY region_name ORDER BY count_qualified DESC) rn, region_name, count_qualified, kw_total FROM bigquery-public-data.sunroof_solar.solar_potential_by_censustract AS solar ORDER BY region_name) WHERE kw_total IS NOT NULL ) WHERE rn = 1)

    SELECT b.* FROM solar s JOIN bigquery-public-data.sunroof_solar.solar_potential_by_censustract b ON s.region_name = b.region_name AND s.count_qualified = b.count_qualified ORDER BY s.region_name ```

    To filter out regions in Alaska, Hawaii, and Puerto Rico, add SQL WHERE SUBSTRING(region_name,1,2) NOT IN('02','15','72')
    (dataset solar_contiguous)

  7. Methane Emissions Around The World (1990-2018)-BQ

    • kaggle.com
    zip
    Updated Jun 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamed Sofiene Kadri (2022). Methane Emissions Around The World (1990-2018)-BQ [Dataset]. https://www.kaggle.com/datasets/kadrisofiane/methane-emissions-around-the-world-19902018bq
    Explore at:
    zip(65585 bytes)Available download formats
    Dataset updated
    Jun 5, 2022
    Authors
    Mohamed Sofiene Kadri
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    World
    Description

    This dataset is just a copy of the Methane Emissions Around The World (1990-2018) dataset provided by @koustubhk. The only difference is that I modified the "2018".."1990" columns' names by adding manually the word "year" to each year column name.

    This endeavor was for the sole purpose of making the dataset “loadable” to BigQuery. BigQuery schema detection for header setting works by detecting strings in the first row of a .csv file. Since the original .csv file has numerical values in the first row, the schema detection process will result in a detection error, which consequently will load the dataset with generic columns names such as "string_field_0" or fail if the schema was edited manually.

    The only solution to successfully load this dataset to BigQuery is to transform the header values into string values. Now you can upload this dataset to BigQuery with no error messages and generic columns' names.

  8. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Google BigQuery (2020). geo-openstreetmap [Dataset]. https://www.kaggle.com/bigquery/geo-openstreetmap
Organization logoOrganization logo

geo-openstreetmap

The OpenStreetMap planet-wide dataset loaded to BigQuery

Explore at:
23 scholarly articles cite this dataset (View in Google Scholar)
zip(0 bytes)Available download formats
Dataset updated
Apr 17, 2020
Dataset provided by
Googlehttp://google.com/
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License

http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

Description

Context

Adapted from Wikipedia: OpenStreetMap (OSM) is a collaborative project to create a free editable map of the world. Created in 2004, it was inspired by the success of Wikipedia and more than two million registered users who can add data by manual survey, GPS devices, aerial photography, and other free sources.

To aid researchers, data scientists, and analysts in the effort to combat COVID-19, Google is making a hosted repository of public datasets including OpenStreetMap data, free to access. To facilitate the Kaggle community to access the BigQuery dataset, it is onboarded to Kaggle platform which allows querying it without a linked GCP account. Please note that due to the large size of the dataset, Kaggle applies a quota of 5 TB of data scanned per user per 30-days.

Content

This is the OpenStreetMap (OSM) planet-wide dataset loaded to BigQuery.

Tables: - history_* tables: full history of OSM objects. - planet_* tables: snapshot of current OSM objects as of Nov 2019.

The history_* and planet_* table groups are composed of node, way, relation, and changeset tables. These contain the primary OSM data types and an additional changeset corresponding to OSM edits for convenient access. These objects are encoded using the BigQuery GEOGRAPHY data type so that they can be operated upon with the built-in geography functions to perform geometry and feature selection, additional processing.

Resources

You can read more about OSM elements on the OSM Wiki. This dataset uses BigQuery GEOGRAPHY datatype which supports a set of functions that can be used to analyze geographical data, determine spatial relationships between geographical features, and construct or manipulate GEOGRAPHYs.

Search
Clear search
Close search
Google apps
Main menu