https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Weather is the state of the atmosphere, describing for example the degree to which it is hot or cold, wet or dry, calm or stormy, clear or cloudy. Source: https://en.wikipedia.org/wiki/Weather
NOAA’s Global Historical Climatology Network (GHCN) is an integrated database of climate summaries from land surface stations across the globe that have been subjected to a common suite of quality assurance reviews. Two GHCN datasets are available in BigQuery, the GHCN-D (daily) and the GHCN-M (monthly). The data included in the GHCN datasets are obtained from more than 20 sources, including some data from every year since 1763.
For a complete description of data variables available in this dataset, see NOAA’s readme.txt: https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt
Update Frequency: daily
Fork this kernel to get started with this dataset.
https://bigquery.cloud.google.com/dataset/bigquery-public-data:ghcn_d
https://cloud.google.com/bigquery/public-data/noaa-ghcn
Dataset Source: NOAA. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
Banner Photo by Max LaRochelle from Unplash.
Find weather stations close to a specific location?
Daily rainfall amounts at specific station?
Pulling daily min/max temperature (in Celsius) and rainfall (in mm) for the past 14 days?
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The COKI Language Dataset contains predictions for 122 million academic publications. The dataset consists of DOI, title, ISO language code and the fastText language prediction probability score.
Methodology
A subset of the COKI Academic Observatory Dataset, which is produced by the Academic Observatory Workflows codebase [1], was extracted and converted to CSV with Bigquery and downloaded to a virtual machine. The subset consists of all publications with DOIs in our dataset, including each publication’s title and abstract from both Crossref Metadata and Microsoft Academic Graph. The CSV files were then processed with a Python script. The titles and abstracts for each record were pre-processed, concatenated together and analysed with fastText. The titles and abstracts from Crossref Metadata were used first, with the MAG titles and abstracts serving as a fallback when the Crossref Metadata information was empty. Language was predicted for each publication using the fastText lid.176.bin language identification model [2]. fastText was chosen because of its high accuracy and fast runtime speed [3]. The final output dataset consists of DOI, title, ISO language code and the fastText language prediction probability score.
Query or Download
The data is publicly accessible in BigQuery in the following two tables:
When you make queries on these tables, make sure that you are in your own Google Cloud project, otherwise the queries will fail.
See the COKI Language Detection README for instructions on how to download the data from Zenodo and load it into BigQuery.
Code
The code that generated this dataset, the BigQuery schemas and instructions for loading the data into BigQuery can be found here: https://github.com/The-Academic-Observatory/coki-language
License
COKI Language Dataset © 2022 by Curtin University is licenced under CC BY 4.0.
Attributions
This work contains information from:
References
[1] https://doi.org/10.5281/zenodo.6366695
[2] https://fasttext.cc/docs/en/language-identification.html
[3] https://modelpredict.com/language-identification-survey
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Taxicabs in Chicago, Illinois, are operated by private companies and licensed by the city. There are about seven thousand licensed cabs operating within the city limits. Licenses are obtained through the purchase or lease of a taxi medallion which is then affixed to the top right hood of the car. Source: https://en.wikipedia.org/wiki/Taxicabs_of_the_United_States#Chicago
This dataset includes taxi trips from 2013 to the present, reported to the City of Chicago in its role as a regulatory agency. To protect privacy but allow for aggregate analyses, the Taxi ID is consistent for any given taxi medallion number but does not show the number, Census Tracts are suppressed in some cases, and times are rounded to the nearest 15 minutes. Due to the data reporting process, not all trips are reported but the City believes that most are. See http://digital.cityofchicago.org/index.php/chicago-taxi-data-released for more information about this dataset and how it was created.
Fork this kernel to get started.
https://bigquery.cloud.google.com/dataset/bigquery-public-data:chicago_taxi_trips
https://cloud.google.com/bigquery/public-data/chicago-taxi
Dataset Source: City of Chicago
This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source —https://data.cityofchicago.org — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
Banner Photo by Ferdinand Stohr from Unplash.
What are the maximum, minimum and average fares for rides lasting 10 minutes or more? Which drop-off areas have the highest average tip? How does trip duration affect fare rates for trips lasting less than 90 minutes?
https://cloud.google.com/bigquery/images/chicago-taxi-fares-by-duration.png" alt="">
https://cloud.google.com/bigquery/images/chicago-taxi-fares-by-duration.png
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dimensions is the largest database of research insight in the world. It represents the most comprehensive collection of linked data related to the global research and innovation ecosystem available in a single platform. Because Dimensions maps the entire research lifecycle, you can follow academic and industry research from early stage funding, through to output and on to social and economic impact. Businesses, governments, universities, investors, funders and researchers around the world use Dimensions to inform their research strategy and make evidence-based decisions on the R&D and innovation landscape. With Dimensions on Google BigQuery, you can seamlessly combine Dimensions data with your own private and external datasets; integrate with Business Intelligence and data visualization tools; and analyze billions of data points in seconds to create the actionable insights your organization needs. Examples of usage: Competitive intelligence Horizon-scanning & emerging trends Innovation landscape mapping Academic & industry partnerships and collaboration networks Key Opinion Leader (KOL) identification Recruitment & talent Performance & benchmarking Tracking funding dollar flows and citation patterns Literature gap analysis Marketing and communication strategy Social and economic impact of research About the data: Dimensions is updated daily and constantly growing. It contains over 112m linked research publications, 1.3bn+ citations, 5.6m+ grants worth $1.7trillion+ in funding, 41m+ patents, 600k+ clinical trials, 100k+ organizations, 65m+ disambiguated researchers and more. The data is normalized, linked, and ready for analysis. Dimensions is available as a subscription offering. For more information, please visit www.dimensions.ai/bigquery and a member of our team will be in touch shortly. If you would like to try our data for free, please select "try sample" to see our openly available Covid-19 data.Learn more
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Adapted from Wikipedia: OpenStreetMap (OSM) is a collaborative project to create a free editable map of the world. Created in 2004, it was inspired by the success of Wikipedia and more than two million registered users who can add data by manual survey, GPS devices, aerial photography, and other free sources.
To aid researchers, data scientists, and analysts in the effort to combat COVID-19, Google is making a hosted repository of public datasets including OpenStreetMap data, free to access. To facilitate the Kaggle community to access the BigQuery dataset, it is onboarded to Kaggle platform which allows querying it without a linked GCP account. Please note that due to the large size of the dataset, Kaggle applies a quota of 5 TB of data scanned per user per 30-days.
This is the OpenStreetMap (OSM) planet-wide dataset loaded to BigQuery.
Tables:
- history_*
tables: full history of OSM objects.
- planet_*
tables: snapshot of current OSM objects as of Nov 2019.
The history_*
and planet_*
table groups are composed of node, way, relation, and changeset tables. These contain the primary OSM data types and an additional changeset corresponding to OSM edits for convenient access. These objects are encoded using the BigQuery GEOGRAPHY data type so that they can be operated upon with the built-in geography functions to perform geometry and feature selection, additional processing.
You can read more about OSM elements on the OSM Wiki. This dataset uses BigQuery GEOGRAPHY datatype which supports a set of functions that can be used to analyze geographical data, determine spatial relationships between geographical features, and construct or manipulate GEOGRAPHYs.
GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008. This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The dataset provides 12 months (August 2016 to August 2017) of obfuscated Google Analytics 360 data from the Google Merchandise Store , a real ecommerce store that sells Google-branded merchandise, in BigQuery. It’s a great way analyze business data and learn the benefits of using BigQuery to analyze Analytics 360 data Learn more about the data The data includes The data is typical of what an ecommerce website would see and includes the following information:Traffic source data: information about where website visitors originate, including data about organic traffic, paid search traffic, and display trafficContent data: information about the behavior of users on the site, such as URLs of pages that visitors look at, how they interact with content, etc. Transactional data: information about the transactions on the Google Merchandise Store website.Limitations: All users have view access to the dataset. This means you can query the dataset and generate reports but you cannot complete administrative tasks. Data for some fields is obfuscated such as fullVisitorId, or removed such as clientId, adWordsClickInfo and geoNetwork. “Not available in demo dataset” will be returned for STRING values and “null” will be returned for INTEGER values when querying the fields containing no data.This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery
Libraries.io gathers data on open source software from 33 package managers and 3 source code repositories. We track over 2.4m unique open source projects, 25m repositories and 121m interdependencies between them. This gives Libraries.io a unique understanding of open source software. In this release you will find data about software distributed and/or crafted publicly on the Internet. You will find information about its development, its distribution and its relationship with other software included as a dependency. You will not find any information about the individuals who create and maintain these projects. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is based on the TravisTorrent dataset released 2017-01-11 (https://travistorrent.testroots.org), the Google BigQuery GHTorrent dataset accessed 2017-07-03, and the Git log history of all projects in the dataset, retrieved 2017-07-16 and 2017-07-17.
We selected projects hosted on GitHub that employ the Continuous Integration (CI) system Travis CI. We identified the projects using the TravisTorrent data set and considered projects that:
To derive the time frames, we employed the GHTorrent Big Query data set. The resulting sample contains 113 projects. Of these projects, 89 are Ruby projects and 24 are Java projects. For our analysis, we only consider the activity one year before and after the first build.
We cloned the selected project repositories and extracted the version history for all branches (see https://github.com/sbaltes/git-log-parser). For each repo and branch, we created one log file with all regular commits and one log file with all merges. We only considered commits changing non-binary files and applied a file extension filter to only consider changes to Java or Ruby source code files. From the log files, we then extracted metadata about the commits and stored this data in CSV files (see https://github.com/sbaltes/git-log-parser).
We also retrieved a random sample of GitHub project to validate the effects we observed in the CI project sample. We only considered projects that:
In total, 8,046 projects satisfied those constraints. We drew a random sample of 100 projects from this sampling frame and retrieve the commit and merge data ni the same way as for the CI sample.
This dataset contains the following files:
tr_projects_sample_filtered_2.csv
A CSV file with information about the 113 selected projects.
tr_sample_commits_default_branch_before_ci.csv
tr_sample_commits_default_branch_during_ci.csv
One CSV file with information about all commits to the default branch before and after the first CI build. Only commits modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:
project: GitHub project name ("/" replaced by "_").
branch: The branch to which the commit was made.
hash_value: The SHA1 hash value of the commit.
author_name: The author name.
author_email: The author email address.
author_date: The authoring timestamp.
commit_name: The committer name.
commit_email: The committer email address.
commit_date: The commit timestamp.
log_message_length: The length of the git commit messages (in characters).
file_count: Files changed with this commit.
lines_added: Lines added to all files changed with this commit.
lines_deleted: Lines deleted in all files changed with this commit.
file_extensions: Distinct file extensions of files changed with this commit.
tr_sample_merges_default_branch_before_ci.csv
tr_sample_merges_default_branch_during_ci.csv
One CSV file with information about all merges into the default branch before and after the first CI build. Only merges modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:
project: GitHub project name ("/" replaced by "_").
branch: The destination branch of the merge.
hash_value: The SHA1 hash value of the merge commit.
merged_commits: Unique hash value prefixes of the commits merged with this commit.
author_name: The author name.
author_email: The author email address.
author_date: The authoring timestamp.
commit_name: The committer name.
commit_email: The committer email address.
commit_date: The commit timestamp.
log_message_length: The length of the git commit messages (in characters).
file_count: Files changed with this commit.
lines_added: Lines added to all files changed with this commit.
lines_deleted: Lines deleted in all files changed with this commit.
file_extensions: Distinct file extensions of files changed with this commit.
pull_request_id: ID of the GitHub pull request that has been merged with this commit (extracted from log message).
source_user: GitHub login name of the user who initiated the pull request (extracted from log message).
source_branch : Source branch of the pull request (extracted from log message).
comparison_project_sample_100.csv
A CSV file with information about the 100 projects in the comparison sample.
commits_default_branch_before_mid.csv
commits_default_branch_after_mid.csv
One CSV file with information about all commits to the default branch before and after the medium date of the commit history. Only commits modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the same columns as the commits tables described above.
merges_default_branch_before_mid.csv
merges_default_branch_after_mid.csv
One CSV file with information about all merges into the default branch before and after the medium date of the commit history. Only merges modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the same columns as the merge tables described above.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Dataset Card for notional-python
Dataset Summary
The Notional-python dataset contains python code files from 100 well-known repositories gathered from Google Bigquery Github Dataset. The dataset was created to test the ability of programming language models. Follow our repo to do the model evaluation using notional-python dataset.
Languages
Python
Dataset Creation
Curation Rationale
Notional-python was built to provide a dataset for… See the full description on the dataset page: https://huggingface.co/datasets/notional/notional-python.
StackExchange Dataset
Working doc: https://docs.google.com/document/d/1h585bH5sYcQW4pkHzqWyQqA4ape2Bq6o1Cya0TkMOQc/edit?usp=sharing
BigQuery query (see so_bigquery.ipynb): CREATE TEMP TABLE answers AS SELECT * FROM bigquery-public-data.stackoverflow.posts_answers WHERE LOWER(Body) LIKE '%arxiv%';
CREATE TEMPORARY TABLE questions AS SELECT * FROM bigquery-public-data.stackoverflow.posts_questions;
SELECT * FROM answers JOIN questions ON questions.id = answers.parent_id;
NOTE:… See the full description on the dataset page: https://huggingface.co/datasets/ag2435/stackexchange.
This dataset contains Hospital General Information from the U.S. Department of Health & Human Services. This is the BigQuery COVID-19 public dataset. This data contains a list of all hospitals that have been registered with Medicare. This list includes addresses, phone numbers, hospital types and quality of care information. The quality of care data is provided for over 4,000 Medicare-certified hospitals, including over 130 Veterans Administration (VA) medical centers, across the country. You can use this data to find hospitals and compare the quality of their care
You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.cms_medicare.hospital_general_info.
How do the hospitals in Mountain View, CA compare to the average hospital in the US? With the hospital compare data you can quickly understand how hospitals in one geographic location compare to another location. In this example query we compare Google’s home in Mountain View, California, to the average hospital in the United States. You can also modify the query to learn how the hospitals in your city compare to the US national average.
“#standardSQL
SELECT
MTV_AVG_HOSPITAL_RATING,
US_AVG_HOSPITAL_RATING
FROM (
SELECT
ROUND(AVG(CAST(hospital_overall_rating AS int64)),2) AS MTV_AVG_HOSPITAL_RATING
FROM
bigquery-public-data.cms_medicare.hospital_general_info
WHERE
city = 'MOUNTAIN VIEW'
AND state = 'CA'
AND hospital_overall_rating <> 'Not Available') MTV
JOIN (
SELECT
ROUND(AVG(CAST(hospital_overall_rating AS int64)),2) AS US_AVG_HOSPITAL_RATING
FROM
bigquery-public-data.cms_medicare.hospital_general_info
WHERE
hospital_overall_rating <> 'Not Available')
ON
1 = 1”
What are the most common diseases treated at hospitals that do well in the category of patient readmissions?
For hospitals that achieved “Above the national average” in the category of patient readmissions, it might be interesting to review the types of diagnoses that are treated at those inpatient facilities. While this query won’t provide the granular detail that went into the readmission calculation, it gives us a quick glimpse into the top disease related groups (DRG)
, or classification of inpatient stays that are found at those hospitals. By joining the general hospital information to the inpatient charge data, also provided by CMS, you could quickly identify DRGs that may warrant additional research. You can also modify the query to review the top diagnosis related groups for hospital metrics you might be interested in.
“#standardSQL
SELECT
drg_definition,
SUM(total_discharges) total_discharge_per_drg
FROM
bigquery-public-data.cms_medicare.hospital_general_info
gi
INNER JOIN
bigquery-public-data.cms_medicare.inpatient_charges_2015
ic
ON
gi.provider_id = ic.provider_id
WHERE
readmission_national_comparison = 'Above the national average'
GROUP BY
drg_definition
ORDER BY
total_discharge_per_drg DESC
LIMIT
10;”
This dataset contains two tables: creative_stats and removed_creative_stats. The creative_stats table contains information about advertisers that served ads in the European Economic Area or Turkey: their legal name, verification status, disclosed name, and location. It also includes ad specific information: impression ranges per region (including aggregate impressions for the European Economic Area), first shown and last shown dates, which criteria were used in audience selection, the format of the ad, the ad topic and whether the ad is funded by Google Ad Grants program. A link to the ad in the Google Ads Transparency Center is also provided. The removed_creative_stats table contains information about ads that served in the European Economic Area that Google removed: where and why they were removed and per-region information on when they served. The removed_creative_stats table also contains a link to the Google Ads Transparency Center for the removed ad. Data for both tables updates periodically and may be delayed from what appears on the Google Ads Transparency Center website. About BigQuery This data is hosted in Google BigQuery for users to easily query using SQL. Note that to use BigQuery, users must have a Google account and create a GCP project. This public dataset is included in BigQuery's 1TB/mo of free tier processing. Each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery . Download Dataset This public dataset is also hosted in Google Cloud Storage here and available free to use. Use this quick start guide to quickly learn how to access public datasets on Google Cloud Storage. We provide the raw data in JSON format, sharded across multiple files to support easier download of the large dataset. A README file which describes the data structure and our Terms of Service (also listed below) is included with the dataset. You can also download the results from a custom query. See here for options and instructions. Signed out users can download the full dataset by using the gCloud CLI. Follow the instructions here to download and install the gCloud CLI. To remove the login requirement, run "$ gcloud config set auth/disable_credentials True" To download the dataset, run "$ gcloud storage cp gs://ads-transparency-center/* . -R" This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
This public dataset was created by the Centers for Medicare & Medicaid Services. The data summarize counts of enrollees who are dually-eligible for both Medicare and Medicaid program, including those in Medicare Savings Programs. “Duals” represent 20 percent of all Medicare beneficiaries, yet they account for 34 percent of all spending by the program, according to the Commonwealth Fund . As a representation of this high-needs, high-cost population, these data offer a view of regions ripe for more intensive care coordination that can address complex social and clinical needs. In addition to the high cost savings opportunity to deliver upstream clinical interventions, this population represents the county-by-county volume of patients who are eligible for both state level (Medicaid) and federal level (Medicare) reimbursements and potential funding streams to address unmet social needs across various programs, waivers, and other projects. The dataset includes eligibility type and enrollment by quarter, at both the state and county level. These data represent monthly snapshots submitted by states to the CMS, which are inherently lower than ever-enrolled counts (which include persons enrolled at any time during a calendar year.) For more information on dually eligible beneficiaries
You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.sdoh_cms_dual_eligible_enrollment.
In what counties in Michigan has the number of dual-eligible individuals increased the most from 2015 to 2018? Find the counties in Michigan which have experienced the largest increase of dual enrollment households
duals_Jan_2015 AS (
SELECT Public_Total AS duals_2015, County_Name, FIPS
FROM bigquery-public-data.sdoh_cms_dual_eligible_enrollment.dual_eligible_enrollment_by_county_and_program
WHERE State_Abbr = "MI" AND Date = '2015-12-01'
),
duals_increase AS ( SELECT d18.FIPS, d18.County_Name, d15.duals_2015, d18.duals_2018, (d18.duals_2018 - d15.duals_2015) AS total_duals_diff FROM duals_Jan_2018 d18 JOIN duals_Jan_2015 d15 ON d18.FIPS = d15.FIPS )
SELECT * FROM duals_increase WHERE total_duals_diff IS NOT NULL ORDER BY total_duals_diff DESC
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Adapted from Wikipedia: OpenStreetMap (OSM) is a collaborative project to create a free editable map of the world. Created in 2004, it was inspired by the success of Wikipedia and more than two million registered users who can add data by manual survey, GPS devices, aerial photography, and other free sources. We've made available a number of tables (explained in detail below): history_* tables: full history of OSM objects planet_* tables: snapshot of current OSM objects as of Nov 2019 The history_* and planet_* table groups are composed of node, way, relation, and changeset tables. These contain the primary OSM data types and an additional changeset corresponding to OSM edits for convenient access. These objects are encoded using the BigQuery GEOGRAPHY data type so that they can be operated upon with the built-in geography functions to perform geometry and feature selection, additional processing. Example analyses are given below. This dataset is part of a larger effort to make data available in BigQuery through the Google Cloud Public Datasets program . OSM itself is produced as a public good by volunteers, and there are no guarantees about data quality. Interested in learning more about how these data were brought into BigQuery and how you can use them? Check out the sample queries below to get started. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
GitHub Jupyter Dataset
Dataset Description
This is a parsed and preprocessed version of GitHub-Jupyter Dataset, a dataset extracted from Jupyter Notebooks on BigQuery. We only keep markdown and python cells and convert the markdown to text. Some heuristics are also applied to filter notebooks with little data and very long or very short cells.
Licenses
Each example has the license of its associated repository. There are in total 15 licenses: [ 'mit'… See the full description on the dataset page: https://huggingface.co/datasets/codeparrot/github-jupyter-parsed.
This is a synthetic patient dataset in the OMOP Common Data Model v5.2, originally released by the CMS and accessed via BigQuery. The dataset includes 24 tables and records for 2 million synthetic patients from 2008 to 2010.
This dataset takes on the format of the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM). As shown in the diagram below, the purpose of the Common Data Model is to convert various distinctly-formatted datasets into a well-known, universal format with a set of standardized vocabularies. See the diagram below from the Observational Health Data Sciences and Informatics (OHDSI) webpage.
https://redivis.com/fileUploads/d1a95a4e-074a-44d1-92e5-9adfd2f4068a%3E" alt="Why-CDM.png">
Such universal data models ultimately enable researchers to streamline the analysis of observational medical data. For more information regarding the OMOP CDM, refer to the OHSDI OMOP site.
%3Cli%3EFor documentation regarding the source data format from the Center for Medicare and Medicaid Services (CMS), refer to the %3Ca href="https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/DE_Syn_PUF"%3ECMS Synthetic Public Use File%3C/a%3E.%3C/li%3E
%3Cli%3EFor information regarding the conversion of the CMS data file to the OMOP CDM v5.2, refer to %3Ca href="https://github.com/OHDSI/ETL-CMS"%3Ethis OHDSI GitHub page%3C/a%3E. %3C/li%3E
%3Cli%3EFor information regarding each of the 24 tables in this dataset, including more detailed variable metadata, see %3Ca href="https://github.com/OHDSI/CommonDataModel/wiki"%3Ethe OHDSI CDM GitHub Wiki page%3C/a%3E. All variable labels and descriptions as well as table descriptions come from this Wiki page. Note that this GitHub page includes information primarily regarding the 6.0 version of the CDM and that this dataset works with the 5.2 version. %3C/li%3E
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Using Ethereum public data to analyze for trading signal becomes a trend. Google BigQuery is way too costly. This forever free public dataset is created and updated for public to avoid the over charge by GCP.
What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
As the price of installing solar has gotten less expensive, more homeowners are turning to it as a possible option for decreasing their energy bill. We want to make installing solar panels easy and understandable for anyone. Project Sunroof puts Google's expansive data in mapping and computing resources to use, helping calculate the best solar plan for you. How does it work? When you enter your address, Project Sunroof looks up your home in Google Maps and combines that information with other databases to create your personalized roof analysis. Don’t worry, Project Sunroof doesn't give the address to anybody else. Learn more about Project Sunroof and see the tool at Project Sunroof’s site . Project Sunroof computes how much sunlight hits roofs in a year, based on shading calculations, typical meteorological data, and estimates of the size and shape of the roofs. You can see more details about how solar viability is determined by checking out methodology here. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
In an effort to help combat COVID-19, we created a COVID-19 Public Datasets program to make data more accessible to researchers, data scientists and analysts. The program will host a repository of public datasets that relate to the COVID-19 crisis and make them free to access and analyze. These include datasets from the New York Times, European Centre for Disease Prevention and Control, Google, Global Health Data from the World Bank, and OpenStreetMap. Free hosting and queries of COVID datasets As with all data in the Google Cloud Public Datasets Program , Google pays for storage of datasets in the program. BigQuery also provides free queries over certain COVID-related datasets to support the response to COVID-19. Queries on COVID datasets will not count against the BigQuery sandbox free tier , where you can query up to 1TB free each month. Limitations and duration Queries of COVID data are free. If, during your analysis, you join COVID datasets with non-COVID datasets, the bytes processed in the non-COVID datasets will be counted against the free tier, then charged accordingly, to prevent abuse. Queries of COVID datasets will remain free until Sept 15, 2021. The contents of these datasets are provided to the public strictly for educational and research purposes only. We are not onboarding or managing PHI or PII data as part of the COVID-19 Public Dataset Program. Google has practices & policies in place to ensure that data is handled in accordance with widely recognized patient privacy and data security policies. See the list of all datasets included in the program
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Weather is the state of the atmosphere, describing for example the degree to which it is hot or cold, wet or dry, calm or stormy, clear or cloudy. Source: https://en.wikipedia.org/wiki/Weather
NOAA’s Global Historical Climatology Network (GHCN) is an integrated database of climate summaries from land surface stations across the globe that have been subjected to a common suite of quality assurance reviews. Two GHCN datasets are available in BigQuery, the GHCN-D (daily) and the GHCN-M (monthly). The data included in the GHCN datasets are obtained from more than 20 sources, including some data from every year since 1763.
For a complete description of data variables available in this dataset, see NOAA’s readme.txt: https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt
Update Frequency: daily
Fork this kernel to get started with this dataset.
https://bigquery.cloud.google.com/dataset/bigquery-public-data:ghcn_d
https://cloud.google.com/bigquery/public-data/noaa-ghcn
Dataset Source: NOAA. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
Banner Photo by Max LaRochelle from Unplash.
Find weather stations close to a specific location?
Daily rainfall amounts at specific station?
Pulling daily min/max temperature (in Celsius) and rainfall (in mm) for the past 14 days?