26 datasets found
  1. Daily Global Historical Climatology Network

    • kaggle.com
    zip
    Updated Aug 30, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NOAA (2019). Daily Global Historical Climatology Network [Dataset]. https://www.kaggle.com/noaa/ghcn-d
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Aug 30, 2019
    Dataset provided by
    National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
    Authors
    NOAA
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Weather is the state of the atmosphere, describing for example the degree to which it is hot or cold, wet or dry, calm or stormy, clear or cloudy. Source: https://en.wikipedia.org/wiki/Weather

    Content

    NOAA’s Global Historical Climatology Network (GHCN) is an integrated database of climate summaries from land surface stations across the globe that have been subjected to a common suite of quality assurance reviews. Two GHCN datasets are available in BigQuery, the GHCN-D (daily) and the GHCN-M (monthly). The data included in the GHCN datasets are obtained from more than 20 sources, including some data from every year since 1763.

    For a complete description of data variables available in this dataset, see NOAA’s readme.txt: https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt

    Update Frequency: daily

    Fork this kernel to get started with this dataset.

    Acknowledgements

    https://bigquery.cloud.google.com/dataset/bigquery-public-data:ghcn_d

    https://cloud.google.com/bigquery/public-data/noaa-ghcn

    Dataset Source: NOAA. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

    Banner Photo by Max LaRochelle from Unplash.

    Inspiration

    Find weather stations close to a specific location?

    Daily rainfall amounts at specific station?

    Pulling daily min/max temperature (in Celsius) and rainfall (in mm) for the past 14 days?

  2. COKI Language Dataset

    • zenodo.org
    application/gzip, csv
    Updated Jun 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    James P. Diprose; James P. Diprose; Cameron Neylon; Cameron Neylon (2022). COKI Language Dataset [Dataset]. http://doi.org/10.5281/zenodo.6636625
    Explore at:
    application/gzip, csvAvailable download formats
    Dataset updated
    Jun 16, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    James P. Diprose; James P. Diprose; Cameron Neylon; Cameron Neylon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The COKI Language Dataset contains predictions for 122 million academic publications. The dataset consists of DOI, title, ISO language code and the fastText language prediction probability score.

    Methodology
    A subset of the COKI Academic Observatory Dataset, which is produced by the Academic Observatory Workflows codebase [1], was extracted and converted to CSV with Bigquery and downloaded to a virtual machine. The subset consists of all publications with DOIs in our dataset, including each publication’s title and abstract from both Crossref Metadata and Microsoft Academic Graph. The CSV files were then processed with a Python script. The titles and abstracts for each record were pre-processed, concatenated together and analysed with fastText. The titles and abstracts from Crossref Metadata were used first, with the MAG titles and abstracts serving as a fallback when the Crossref Metadata information was empty. Language was predicted for each publication using the fastText lid.176.bin language identification model [2]. fastText was chosen because of its high accuracy and fast runtime speed [3]. The final output dataset consists of DOI, title, ISO language code and the fastText language prediction probability score.

    Query or Download
    The data is publicly accessible in BigQuery in the following two tables:

    When you make queries on these tables, make sure that you are in your own Google Cloud project, otherwise the queries will fail.

    See the COKI Language Detection README for instructions on how to download the data from Zenodo and load it into BigQuery.

    Code
    The code that generated this dataset, the BigQuery schemas and instructions for loading the data into BigQuery can be found here: https://github.com/The-Academic-Observatory/coki-language

    License
    COKI Language Dataset © 2022 by Curtin University is licenced under CC BY 4.0.

    Attributions
    This work contains information from:

    References
    [1] https://doi.org/10.5281/zenodo.6366695
    [2] https://fasttext.cc/docs/en/language-identification.html
    [3] https://modelpredict.com/language-identification-survey

  3. Chicago Taxi Trips

    • kaggle.com
    • data.cityofchicago.org
    zip
    Updated Apr 18, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Chicago (2018). Chicago Taxi Trips [Dataset]. https://www.kaggle.com/datasets/chicago/chicago-taxi-trips-bq
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Apr 18, 2018
    Dataset authored and provided by
    City of Chicago
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Chicago
    Description

    Context

    Taxicabs in Chicago, Illinois, are operated by private companies and licensed by the city. There are about seven thousand licensed cabs operating within the city limits. Licenses are obtained through the purchase or lease of a taxi medallion which is then affixed to the top right hood of the car. Source: https://en.wikipedia.org/wiki/Taxicabs_of_the_United_States#Chicago

    Content

    This dataset includes taxi trips from 2013 to the present, reported to the City of Chicago in its role as a regulatory agency. To protect privacy but allow for aggregate analyses, the Taxi ID is consistent for any given taxi medallion number but does not show the number, Census Tracts are suppressed in some cases, and times are rounded to the nearest 15 minutes. Due to the data reporting process, not all trips are reported but the City believes that most are. See http://digital.cityofchicago.org/index.php/chicago-taxi-data-released for more information about this dataset and how it was created.

    Fork this kernel to get started.

    Acknowledgements

    https://bigquery.cloud.google.com/dataset/bigquery-public-data:chicago_taxi_trips

    https://cloud.google.com/bigquery/public-data/chicago-taxi

    Dataset Source: City of Chicago

    This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source —https://data.cityofchicago.org — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

    Banner Photo by Ferdinand Stohr from Unplash.

    Inspiration

    What are the maximum, minimum and average fares for rides lasting 10 minutes or more? Which drop-off areas have the highest average tip? How does trip duration affect fare rates for trips lasting less than 90 minutes?

    https://cloud.google.com/bigquery/images/chicago-taxi-fares-by-duration.png" alt=""> https://cloud.google.com/bigquery/images/chicago-taxi-fares-by-duration.png

  4. Dimensions.ai: Comprehensive Dataset for Research & Innovation

    • console.cloud.google.com
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:Digital%20Science%20%26%20Research%20Solutions%20Inc (2020). Dimensions.ai: Comprehensive Dataset for Research & Innovation [Dataset]. https://console.cloud.google.com/marketplace/product/digitalscience-public/dimensions-ai
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    Googlehttp://google.com/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dimensions is the largest database of research insight in the world. It represents the most comprehensive collection of linked data related to the global research and innovation ecosystem available in a single platform. Because Dimensions maps the entire research lifecycle, you can follow academic and industry research from early stage funding, through to output and on to social and economic impact. Businesses, governments, universities, investors, funders and researchers around the world use Dimensions to inform their research strategy and make evidence-based decisions on the R&D and innovation landscape. With Dimensions on Google BigQuery, you can seamlessly combine Dimensions data with your own private and external datasets; integrate with Business Intelligence and data visualization tools; and analyze billions of data points in seconds to create the actionable insights your organization needs. Examples of usage: Competitive intelligence Horizon-scanning & emerging trends Innovation landscape mapping Academic & industry partnerships and collaboration networks Key Opinion Leader (KOL) identification Recruitment & talent Performance & benchmarking Tracking funding dollar flows and citation patterns Literature gap analysis Marketing and communication strategy Social and economic impact of research About the data: Dimensions is updated daily and constantly growing. It contains over 112m linked research publications, 1.3bn+ citations, 5.6m+ grants worth $1.7trillion+ in funding, 41m+ patents, 600k+ clinical trials, 100k+ organizations, 65m+ disambiguated researchers and more. The data is normalized, linked, and ready for analysis. Dimensions is available as a subscription offering. For more information, please visit www.dimensions.ai/bigquery and a member of our team will be in touch shortly. If you would like to try our data for free, please select "try sample" to see our openly available Covid-19 data.Learn more

  5. geo-openstreetmap

    • kaggle.com
    zip
    Updated Apr 17, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google BigQuery (2020). geo-openstreetmap [Dataset]. https://www.kaggle.com/bigquery/geo-openstreetmap
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Apr 17, 2020
    Dataset provided by
    BigQueryhttps://cloud.google.com/bigquery
    Googlehttp://google.com/
    Authors
    Google BigQuery
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Context

    Adapted from Wikipedia: OpenStreetMap (OSM) is a collaborative project to create a free editable map of the world. Created in 2004, it was inspired by the success of Wikipedia and more than two million registered users who can add data by manual survey, GPS devices, aerial photography, and other free sources.

    To aid researchers, data scientists, and analysts in the effort to combat COVID-19, Google is making a hosted repository of public datasets including OpenStreetMap data, free to access. To facilitate the Kaggle community to access the BigQuery dataset, it is onboarded to Kaggle platform which allows querying it without a linked GCP account. Please note that due to the large size of the dataset, Kaggle applies a quota of 5 TB of data scanned per user per 30-days.

    Content

    This is the OpenStreetMap (OSM) planet-wide dataset loaded to BigQuery.

    Tables: - history_* tables: full history of OSM objects. - planet_* tables: snapshot of current OSM objects as of Nov 2019.

    The history_* and planet_* table groups are composed of node, way, relation, and changeset tables. These contain the primary OSM data types and an additional changeset corresponding to OSM edits for convenient access. These objects are encoded using the BigQuery GEOGRAPHY data type so that they can be operated upon with the built-in geography functions to perform geometry and feature selection, additional processing.

    Resources

    You can read more about OSM elements on the OSM Wiki. This dataset uses BigQuery GEOGRAPHY datatype which supports a set of functions that can be used to analyze geographical data, determine spatial relationships between geographical features, and construct or manipulate GEOGRAPHYs.

  6. GitHub Activity Data

    • console.cloud.google.com
    Updated Jun 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:GitHub&inv=1&invt=Ab41nA (2022). GitHub Activity Data [Dataset]. https://console.cloud.google.com/marketplace/product/github/github-repos
    Explore at:
    Dataset updated
    Jun 23, 2022
    Dataset provided by
    Googlehttp://google.com/
    GitHubhttps://github.com/
    Description

    GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008. This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .

  7. Google Analytics Sample

    • console.cloud.google.com
    Updated Jul 15, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:Obfuscated%20Google%20Analytics%20360%20data (2017). Google Analytics Sample [Dataset]. https://console.cloud.google.com/marketplace/product/obfuscated-ga360-data/obfuscated-ga360-data
    Explore at:
    Dataset updated
    Jul 15, 2017
    Dataset provided by
    Googlehttp://google.com/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The dataset provides 12 months (August 2016 to August 2017) of obfuscated Google Analytics 360 data from the Google Merchandise Store , a real ecommerce store that sells Google-branded merchandise, in BigQuery. It’s a great way analyze business data and learn the benefits of using BigQuery to analyze Analytics 360 data Learn more about the data The data includes The data is typical of what an ecommerce website would see and includes the following information:Traffic source data: information about where website visitors originate, including data about organic traffic, paid search traffic, and display trafficContent data: information about the behavior of users on the site, such as URLs of pages that visitors look at, how they interact with content, etc. Transactional data: information about the transactions on the Google Merchandise Store website.Limitations: All users have view access to the dataset. This means you can query the dataset and generate reports but you cannot complete administrative tasks. Data for some fields is obfuscated such as fullVisitorId, or removed such as clientId, adWordsClickInfo and geoNetwork. “Not available in demo dataset” will be returned for STRING values and “null” will be returned for INTEGER values when querying the fields containing no data.This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery

  8. Libraries.io Data

    • console.cloud.google.com
    Updated Aug 29, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:Libraries.io (2017). Libraries.io Data [Dataset]. https://console.cloud.google.com/marketplace/product/libraries-io/librariesio
    Explore at:
    Dataset updated
    Aug 29, 2017
    Dataset provided by
    Libraries.iohttps://libraries.io/
    Googlehttp://google.com/
    Description

    Libraries.io gathers data on open source software from 33 package managers and 3 source code repositories. We track over 2.4m unique open source projects, 25m repositories and 121m interdependencies between them. This gives Libraries.io a unique understanding of open source software. In this release you will find data about software distributed and/or crafted publicly on the Internet. You will find information about its development, its distribution and its relationship with other software included as a dependency. You will not find any information about the individuals who create and maintain these projects. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .

  9. (No) Influence of Continuous Integration on the Development Activity in...

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sebastian Baltes; Sebastian Baltes; Jascha Knack; Jascha Knack (2020). (No) Influence of Continuous Integration on the Development Activity in GitHub Projects — Dataset [Dataset]. http://doi.org/10.5281/zenodo.1183244
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sebastian Baltes; Sebastian Baltes; Jascha Knack; Jascha Knack
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is based on the TravisTorrent dataset released 2017-01-11 (https://travistorrent.testroots.org), the Google BigQuery GHTorrent dataset accessed 2017-07-03, and the Git log history of all projects in the dataset, retrieved 2017-07-16 and 2017-07-17.

    We selected projects hosted on GitHub that employ the Continuous Integration (CI) system Travis CI. We identified the projects using the TravisTorrent data set and considered projects that:

    1. used GitHub from the beginning (first commit not more than seven days before project creation date according to GHTorrent),
    2. were active for at least one year (365 days) before the first build with Travis CI (before_ci),
    3. used Travis CI at least for one year (during_ci),
    4. had commit or merge activity on the default branch in both of these phases, and
    5. used the default branch to trigger builds.

    To derive the time frames, we employed the GHTorrent Big Query data set. The resulting sample contains 113 projects. Of these projects, 89 are Ruby projects and 24 are Java projects. For our analysis, we only consider the activity one year before and after the first build.

    We cloned the selected project repositories and extracted the version history for all branches (see https://github.com/sbaltes/git-log-parser). For each repo and branch, we created one log file with all regular commits and one log file with all merges. We only considered commits changing non-binary files and applied a file extension filter to only consider changes to Java or Ruby source code files. From the log files, we then extracted metadata about the commits and stored this data in CSV files (see https://github.com/sbaltes/git-log-parser).

    We also retrieved a random sample of GitHub project to validate the effects we observed in the CI project sample. We only considered projects that:

    1. have Java or Ruby as their project language
    2. used GitHub from the beginning (first commit not more than seven days before project creation date according to GHTorrent)
    3. have commit activity for at least two years (730 days)
    4. are engineered software projects (at least 10 watchers)
    5. were not in the TravisTorrent dataset

    In total, 8,046 projects satisfied those constraints. We drew a random sample of 100 projects from this sampling frame and retrieve the commit and merge data ni the same way as for the CI sample.

    This dataset contains the following files:

    tr_projects_sample_filtered_2.csv
    A CSV file with information about the 113 selected projects.

    tr_sample_commits_default_branch_before_ci.csv
    tr_sample_commits_default_branch_during_ci.csv

    One CSV file with information about all commits to the default branch before and after the first CI build. Only commits modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:

    project: GitHub project name ("/" replaced by "_").
    branch: The branch to which the commit was made.
    hash_value: The SHA1 hash value of the commit.
    author_name: The author name.
    author_email: The author email address.
    author_date: The authoring timestamp.
    commit_name: The committer name.
    commit_email: The committer email address.
    commit_date: The commit timestamp.
    log_message_length: The length of the git commit messages (in characters).
    file_count: Files changed with this commit.
    lines_added: Lines added to all files changed with this commit.
    lines_deleted: Lines deleted in all files changed with this commit.
    file_extensions: Distinct file extensions of files changed with this commit.

    tr_sample_merges_default_branch_before_ci.csv
    tr_sample_merges_default_branch_during_ci.csv

    One CSV file with information about all merges into the default branch before and after the first CI build. Only merges modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:

    project: GitHub project name ("/" replaced by "_").
    branch: The destination branch of the merge.
    hash_value: The SHA1 hash value of the merge commit.
    merged_commits: Unique hash value prefixes of the commits merged with this commit.
    author_name: The author name.
    author_email: The author email address.
    author_date: The authoring timestamp.
    commit_name: The committer name.
    commit_email: The committer email address.
    commit_date: The commit timestamp.
    log_message_length: The length of the git commit messages (in characters).
    file_count: Files changed with this commit.
    lines_added: Lines added to all files changed with this commit.
    lines_deleted: Lines deleted in all files changed with this commit.
    file_extensions: Distinct file extensions of files changed with this commit.
    pull_request_id: ID of the GitHub pull request that has been merged with this commit (extracted from log message).
    source_user: GitHub login name of the user who initiated the pull request (extracted from log message).
    source_branch : Source branch of the pull request (extracted from log message).

    comparison_project_sample_100.csv
    A CSV file with information about the 100 projects in the comparison sample.

    commits_default_branch_before_mid.csv
    commits_default_branch_after_mid.csv

    One CSV file with information about all commits to the default branch before and after the medium date of the commit history. Only commits modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the same columns as the commits tables described above.

    merges_default_branch_before_mid.csv
    merges_default_branch_after_mid.csv

    One CSV file with information about all merges into the default branch before and after the medium date of the commit history. Only merges modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the same columns as the merge tables described above.

  10. h

    notional-python

    • huggingface.co
    Updated Dec 24, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Notional Project (2021). notional-python [Dataset]. https://huggingface.co/datasets/notional/notional-python
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 24, 2021
    Authors
    Notional Project
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    Dataset Card for notional-python

      Dataset Summary
    

    The Notional-python dataset contains python code files from 100 well-known repositories gathered from Google Bigquery Github Dataset. The dataset was created to test the ability of programming language models. Follow our repo to do the model evaluation using notional-python dataset.

      Languages
    

    Python

      Dataset Creation
    
    
    
    
    
      Curation Rationale
    

    Notional-python was built to provide a dataset for… See the full description on the dataset page: https://huggingface.co/datasets/notional/notional-python.

  11. h

    stackexchange

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Albert Gong, stackexchange [Dataset]. https://huggingface.co/datasets/ag2435/stackexchange
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Albert Gong
    Description

    StackExchange Dataset

    Working doc: https://docs.google.com/document/d/1h585bH5sYcQW4pkHzqWyQqA4ape2Bq6o1Cya0TkMOQc/edit?usp=sharing

    BigQuery query (see so_bigquery.ipynb): CREATE TEMP TABLE answers AS SELECT * FROM bigquery-public-data.stackoverflow.posts_answers WHERE LOWER(Body) LIKE '%arxiv%';

    CREATE TEMPORARY TABLE questions AS SELECT * FROM bigquery-public-data.stackoverflow.posts_questions;

    SELECT * FROM answers JOIN questions ON questions.id = answers.parent_id;

    NOTE:… See the full description on the dataset page: https://huggingface.co/datasets/ag2435/stackexchange.

  12. cms-medicare

    • kaggle.com
    zip
    Updated Apr 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google BigQuery (2020). cms-medicare [Dataset]. https://www.kaggle.com/datasets/bigquery/cms-medicare
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Apr 21, 2020
    Dataset provided by
    BigQueryhttps://cloud.google.com/bigquery
    Authors
    Google BigQuery
    Description

    Context

    This dataset contains Hospital General Information from the U.S. Department of Health & Human Services. This is the BigQuery COVID-19 public dataset. This data contains a list of all hospitals that have been registered with Medicare. This list includes addresses, phone numbers, hospital types and quality of care information. The quality of care data is provided for over 4,000 Medicare-certified hospitals, including over 130 Veterans Administration (VA) medical centers, across the country. You can use this data to find hospitals and compare the quality of their care

    Querying BigQuery tables

    You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.cms_medicare.hospital_general_info.

    Sample Query

    How do the hospitals in Mountain View, CA compare to the average hospital in the US? With the hospital compare data you can quickly understand how hospitals in one geographic location compare to another location. In this example query we compare Google’s home in Mountain View, California, to the average hospital in the United States. You can also modify the query to learn how the hospitals in your city compare to the US national average.

    “#standardSQL SELECT MTV_AVG_HOSPITAL_RATING, US_AVG_HOSPITAL_RATING FROM ( SELECT ROUND(AVG(CAST(hospital_overall_rating AS int64)),2) AS MTV_AVG_HOSPITAL_RATING FROM bigquery-public-data.cms_medicare.hospital_general_info WHERE city = 'MOUNTAIN VIEW' AND state = 'CA' AND hospital_overall_rating <> 'Not Available') MTV JOIN ( SELECT ROUND(AVG(CAST(hospital_overall_rating AS int64)),2) AS US_AVG_HOSPITAL_RATING FROM bigquery-public-data.cms_medicare.hospital_general_info WHERE hospital_overall_rating <> 'Not Available') ON 1 = 1”

    What are the most common diseases treated at hospitals that do well in the category of patient readmissions? For hospitals that achieved “Above the national average” in the category of patient readmissions, it might be interesting to review the types of diagnoses that are treated at those inpatient facilities. While this query won’t provide the granular detail that went into the readmission calculation, it gives us a quick glimpse into the top disease related groups (DRG)
    , or classification of inpatient stays that are found at those hospitals. By joining the general hospital information to the inpatient charge data, also provided by CMS, you could quickly identify DRGs that may warrant additional research. You can also modify the query to review the top diagnosis related groups for hospital metrics you might be interested in. “#standardSQL SELECT drg_definition, SUM(total_discharges) total_discharge_per_drg FROM bigquery-public-data.cms_medicare.hospital_general_info gi INNER JOIN bigquery-public-data.cms_medicare.inpatient_charges_2015 ic ON gi.provider_id = ic.provider_id WHERE readmission_national_comparison = 'Above the national average' GROUP BY drg_definition ORDER BY total_discharge_per_drg DESC LIMIT 10;”

  13. Google Ads Transparency Center

    • console.cloud.google.com
    Updated Oct 5, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:BigQuery%20Public%20Data&inv=1&invt=Ab6ARQ (2020). Google Ads Transparency Center [Dataset]. https://console.cloud.google.com/marketplace/product/bigquery-public-data/google-ads-transparency-center
    Explore at:
    Dataset updated
    Oct 5, 2020
    Dataset provided by
    BigQueryhttps://cloud.google.com/bigquery
    Googlehttp://google.com/
    Description

    This dataset contains two tables: creative_stats and removed_creative_stats. The creative_stats table contains information about advertisers that served ads in the European Economic Area or Turkey: their legal name, verification status, disclosed name, and location. It also includes ad specific information: impression ranges per region (including aggregate impressions for the European Economic Area), first shown and last shown dates, which criteria were used in audience selection, the format of the ad, the ad topic and whether the ad is funded by Google Ad Grants program. A link to the ad in the Google Ads Transparency Center is also provided. The removed_creative_stats table contains information about ads that served in the European Economic Area that Google removed: where and why they were removed and per-region information on when they served. The removed_creative_stats table also contains a link to the Google Ads Transparency Center for the removed ad. Data for both tables updates periodically and may be delayed from what appears on the Google Ads Transparency Center website. About BigQuery This data is hosted in Google BigQuery for users to easily query using SQL. Note that to use BigQuery, users must have a Google account and create a GCP project. This public dataset is included in BigQuery's 1TB/mo of free tier processing. Each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery . Download Dataset This public dataset is also hosted in Google Cloud Storage here and available free to use. Use this quick start guide to quickly learn how to access public datasets on Google Cloud Storage. We provide the raw data in JSON format, sharded across multiple files to support easier download of the large dataset. A README file which describes the data structure and our Terms of Service (also listed below) is included with the dataset. You can also download the results from a custom query. See here for options and instructions. Signed out users can download the full dataset by using the gCloud CLI. Follow the instructions here to download and install the gCloud CLI. To remove the login requirement, run "$ gcloud config set auth/disable_credentials True" To download the dataset, run "$ gcloud storage cp gs://ads-transparency-center/* . -R" This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .

  14. Medicare and Medicaid Services

    • kaggle.com
    zip
    Updated Apr 22, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google BigQuery (2020). Medicare and Medicaid Services [Dataset]. https://www.kaggle.com/datasets/bigquery/sdoh-hrsa-shortage-areas
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Apr 22, 2020
    Dataset provided by
    BigQueryhttps://cloud.google.com/bigquery
    Authors
    Google BigQuery
    Description

    Context

    This public dataset was created by the Centers for Medicare & Medicaid Services. The data summarize counts of enrollees who are dually-eligible for both Medicare and Medicaid program, including those in Medicare Savings Programs. “Duals” represent 20 percent of all Medicare beneficiaries, yet they account for 34 percent of all spending by the program, according to the Commonwealth Fund . As a representation of this high-needs, high-cost population, these data offer a view of regions ripe for more intensive care coordination that can address complex social and clinical needs. In addition to the high cost savings opportunity to deliver upstream clinical interventions, this population represents the county-by-county volume of patients who are eligible for both state level (Medicaid) and federal level (Medicare) reimbursements and potential funding streams to address unmet social needs across various programs, waivers, and other projects. The dataset includes eligibility type and enrollment by quarter, at both the state and county level. These data represent monthly snapshots submitted by states to the CMS, which are inherently lower than ever-enrolled counts (which include persons enrolled at any time during a calendar year.) For more information on dually eligible beneficiaries

    Querying BigQuery tables

    You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.sdoh_cms_dual_eligible_enrollment.

    Sample Query

    In what counties in Michigan has the number of dual-eligible individuals increased the most from 2015 to 2018? Find the counties in Michigan which have experienced the largest increase of dual enrollment households

    duals_Jan_2015 AS ( SELECT Public_Total AS duals_2015, County_Name, FIPS FROM bigquery-public-data.sdoh_cms_dual_eligible_enrollment.dual_eligible_enrollment_by_county_and_program WHERE State_Abbr = "MI" AND Date = '2015-12-01' ),

    duals_increase AS ( SELECT d18.FIPS, d18.County_Name, d15.duals_2015, d18.duals_2018, (d18.duals_2018 - d15.duals_2015) AS total_duals_diff FROM duals_Jan_2018 d18 JOIN duals_Jan_2015 d15 ON d18.FIPS = d15.FIPS )

    SELECT * FROM duals_increase WHERE total_duals_diff IS NOT NULL ORDER BY total_duals_diff DESC

  15. OpenStreetMap Public Dataset

    • console.cloud.google.com
    Updated Jan 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:OpenStreetMap (2020). OpenStreetMap Public Dataset [Dataset]. https://console.cloud.google.com/marketplace/product/openstreetmap/geo-openstreetmap
    Explore at:
    Dataset updated
    Jan 16, 2020
    Dataset provided by
    OpenStreetMap//www.openstreetmap.org/
    Googlehttp://google.com/
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Adapted from Wikipedia: OpenStreetMap (OSM) is a collaborative project to create a free editable map of the world. Created in 2004, it was inspired by the success of Wikipedia and more than two million registered users who can add data by manual survey, GPS devices, aerial photography, and other free sources. We've made available a number of tables (explained in detail below): history_* tables: full history of OSM objects planet_* tables: snapshot of current OSM objects as of Nov 2019 The history_* and planet_* table groups are composed of node, way, relation, and changeset tables. These contain the primary OSM data types and an additional changeset corresponding to OSM edits for convenient access. These objects are encoded using the BigQuery GEOGRAPHY data type so that they can be operated upon with the built-in geography functions to perform geometry and feature selection, additional processing. Example analyses are given below. This dataset is part of a larger effort to make data available in BigQuery through the Google Cloud Public Datasets program . OSM itself is produced as a public good by volunteers, and there are no guarantees about data quality. Interested in learning more about how these data were brought into BigQuery and how you can use them? Check out the sample queries below to get started. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .

  16. h

    github-jupyter-parsed

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CodeParrot, github-jupyter-parsed [Dataset]. https://huggingface.co/datasets/codeparrot/github-jupyter-parsed
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    CodeParrot
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    GitHub Jupyter Dataset

      Dataset Description
    

    This is a parsed and preprocessed version of GitHub-Jupyter Dataset, a dataset extracted from Jupyter Notebooks on BigQuery. We only keep markdown and python cells and convert the markdown to text. Some heuristics are also applied to filter notebooks with little data and very long or very short cells.

      Licenses
    

    Each example has the license of its associated repository. There are in total 15 licenses: [ 'mit'… See the full description on the dataset page: https://huggingface.co/datasets/codeparrot/github-jupyter-parsed.

  17. CMS Synthetic Patient Data OMOP

    • redivis.com
    application/jsonl +7
    Updated Aug 19, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Redivis Demo Organization (2020). CMS Synthetic Patient Data OMOP [Dataset]. https://redivis.com/datasets/ye2v-6skh7wdr7
    Explore at:
    sas, avro, parquet, stata, application/jsonl, arrow, csv, spssAvailable download formats
    Dataset updated
    Aug 19, 2020
    Dataset provided by
    Redivis Inc.
    Authors
    Redivis Demo Organization
    Time period covered
    Jan 1, 2008 - Dec 31, 2010
    Description

    Abstract

    This is a synthetic patient dataset in the OMOP Common Data Model v5.2, originally released by the CMS and accessed via BigQuery. The dataset includes 24 tables and records for 2 million synthetic patients from 2008 to 2010.

    Methodology

    This dataset takes on the format of the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM). As shown in the diagram below, the purpose of the Common Data Model is to convert various distinctly-formatted datasets into a well-known, universal format with a set of standardized vocabularies. See the diagram below from the Observational Health Data Sciences and Informatics (OHDSI) webpage.

    https://redivis.com/fileUploads/d1a95a4e-074a-44d1-92e5-9adfd2f4068a%3E" alt="Why-CDM.png">

    Such universal data models ultimately enable researchers to streamline the analysis of observational medical data. For more information regarding the OMOP CDM, refer to the OHSDI OMOP site.

    Usage

    %3Cli%3EFor documentation regarding the source data format from the Center for Medicare and Medicaid Services (CMS), refer to the %3Ca href="https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/DE_Syn_PUF"%3ECMS Synthetic Public Use File%3C/a%3E.%3C/li%3E

    %3Cli%3EFor information regarding the conversion of the CMS data file to the OMOP CDM v5.2, refer to %3Ca href="https://github.com/OHDSI/ETL-CMS"%3Ethis OHDSI GitHub page%3C/a%3E. %3C/li%3E

    %3Cli%3EFor information regarding each of the 24 tables in this dataset, including more detailed variable metadata, see %3Ca href="https://github.com/OHDSI/CommonDataModel/wiki"%3Ethe OHDSI CDM GitHub Wiki page%3C/a%3E. All variable labels and descriptions as well as table descriptions come from this Wiki page. Note that this GitHub page includes information primarily regarding the 6.0 version of the CDM and that this dataset works with the 5.2 version. %3C/li%3E

  18. Ethereum Blockchain

    • kaggle.com
    Updated Jul 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Metis A.I. (2021). Ethereum Blockchain [Dataset]. http://doi.org/10.34740/kaggle/dsv/2390341
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 3, 2021
    Dataset provided by
    Kaggle
    Authors
    Metis A.I.
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    Context

    Using Ethereum public data to analyze for trading signal becomes a trend. Google BigQuery is way too costly. This forever free public dataset is created and updated for public to avoid the over charge by GCP.

    Content

    What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.

    Acknowledgements

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  19. Project Sunroof

    • console.cloud.google.com
    Updated Aug 15, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:Google%20Project%20Sunroof (2017). Project Sunroof [Dataset]. https://console.cloud.google.com/marketplace/product/project-sunroof/project-sunroof
    Explore at:
    Dataset updated
    Aug 15, 2017
    Dataset provided by
    Googlehttp://google.com/
    Description

    As the price of installing solar has gotten less expensive, more homeowners are turning to it as a possible option for decreasing their energy bill. We want to make installing solar panels easy and understandable for anyone. Project Sunroof puts Google's expansive data in mapping and computing resources to use, helping calculate the best solar plan for you. How does it work? When you enter your address, Project Sunroof looks up your home in Google Maps and combines that information with other databases to create your personalized roof analysis. Don’t worry, Project Sunroof doesn't give the address to anybody else. Learn more about Project Sunroof and see the tool at Project Sunroof’s site . Project Sunroof computes how much sunlight hits roofs in a year, based on shading calculations, typical meteorological data, and estimates of the size and shape of the roofs. You can see more details about how solar viability is determined by checking out methodology here. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .

  20. About COVID-19 Public Datasets

    • console.cloud.google.com
    Updated Jun 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:BigQuery%20Public%20Datasets%20Program&hl=ES&inv=1&invt=Ab5SgA (2022). About COVID-19 Public Datasets [Dataset]. https://console.cloud.google.com/marketplace/product/bigquery-public-datasets/covid19-public-data-program?hl=ES
    Explore at:
    Dataset updated
    Jun 19, 2022
    Dataset provided by
    BigQueryhttps://cloud.google.com/bigquery
    Googlehttp://google.com/
    Description

    In an effort to help combat COVID-19, we created a COVID-19 Public Datasets program to make data more accessible to researchers, data scientists and analysts. The program will host a repository of public datasets that relate to the COVID-19 crisis and make them free to access and analyze. These include datasets from the New York Times, European Centre for Disease Prevention and Control, Google, Global Health Data from the World Bank, and OpenStreetMap. Free hosting and queries of COVID datasets As with all data in the Google Cloud Public Datasets Program , Google pays for storage of datasets in the program. BigQuery also provides free queries over certain COVID-related datasets to support the response to COVID-19. Queries on COVID datasets will not count against the BigQuery sandbox free tier , where you can query up to 1TB free each month. Limitations and duration Queries of COVID data are free. If, during your analysis, you join COVID datasets with non-COVID datasets, the bytes processed in the non-COVID datasets will be counted against the free tier, then charged accordingly, to prevent abuse. Queries of COVID datasets will remain free until Sept 15, 2021. The contents of these datasets are provided to the public strictly for educational and research purposes only. We are not onboarding or managing PHI or PII data as part of the COVID-19 Public Dataset Program. Google has practices & policies in place to ensure that data is handled in accordance with widely recognized patient privacy and data security policies. See the list of all datasets included in the program

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
NOAA (2019). Daily Global Historical Climatology Network [Dataset]. https://www.kaggle.com/noaa/ghcn-d
Organization logo

Daily Global Historical Climatology Network

Daily Global Historical Climatology Network (BigQuery Dataset)

Explore at:
152 scholarly articles cite this dataset (View in Google Scholar)
zip(0 bytes)Available download formats
Dataset updated
Aug 30, 2019
Dataset provided by
National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
Authors
NOAA
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Context

Weather is the state of the atmosphere, describing for example the degree to which it is hot or cold, wet or dry, calm or stormy, clear or cloudy. Source: https://en.wikipedia.org/wiki/Weather

Content

NOAA’s Global Historical Climatology Network (GHCN) is an integrated database of climate summaries from land surface stations across the globe that have been subjected to a common suite of quality assurance reviews. Two GHCN datasets are available in BigQuery, the GHCN-D (daily) and the GHCN-M (monthly). The data included in the GHCN datasets are obtained from more than 20 sources, including some data from every year since 1763.

For a complete description of data variables available in this dataset, see NOAA’s readme.txt: https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt

Update Frequency: daily

Fork this kernel to get started with this dataset.

Acknowledgements

https://bigquery.cloud.google.com/dataset/bigquery-public-data:ghcn_d

https://cloud.google.com/bigquery/public-data/noaa-ghcn

Dataset Source: NOAA. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

Banner Photo by Max LaRochelle from Unplash.

Inspiration

Find weather stations close to a specific location?

Daily rainfall amounts at specific station?

Pulling daily min/max temperature (in Celsius) and rainfall (in mm) for the past 14 days?

Search
Clear search
Close search
Google apps
Main menu