Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website.
The sample dataset contains Google Analytics 360 data from the Google Merchandise Store, a real ecommerce store. The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website. It includes the following kinds of information:
Traffic source data: information about where website visitors originate. This includes data about organic traffic, paid search traffic, display traffic, etc. Content data: information about the behavior of users on the site. This includes the URLs of pages that visitors look at, how they interact with content, etc. Transactional data: information about the transactions that occur on the Google Merchandise Store website.
Fork this kernel to get started.
Banner Photo by Edho Pratama from Unsplash.
What is the total number of transactions generated per device browser in July 2017?
The real bounce rate is defined as the percentage of visits with a single pageview. What was the real bounce rate per traffic source?
What was the average number of product pageviews for users who made a purchase in July 2017?
What was the average number of product pageviews for users who did not make a purchase in July 2017?
What was the average total transactions per user that made a purchase in July 2017?
What is the average amount of money spent per session in July 2017?
What is the sequence of pages viewed?
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
BigQuery provides a limited number of sample tables that you can run queries against. These tables are suited for testing queries and learning BigQuery.
gsod: Contains weather information collected by NOAA, such as precipitation amounts and wind speeds from late 1929 to early 2010.
github_nested: Contains a timeline of actions such as pull requests and comments on GitHub repositories with a nested schema. Created in September 2012.
github_timeline: Contains a timeline of actions such as pull requests and comments on GitHub repositories with a flat schema. Created in May 2012.
natality: Describes all United States births registered in the 50 States, the District of Columbia, and New York City from 1969 to 2008.
shakespeare: Contains a word index of the works of Shakespeare, giving the number of times each word appears in each corpus.
trigrams: Contains English language trigrams from a sample of works published between 1520 and 2008.
wikipedia: Contains the complete revision history for all Wikipedia articles up to April 2010.
Fork this kernel to get started.
Data Source: https://cloud.google.com/bigquery/sample-tables
Banner Photo by Mervyn Chan from Unplash.
How many babies were born in New York City on Christmas Day?
How many words are in the play Hamlet?
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The dataset provides 12 months (August 2016 to August 2017) of obfuscated Google Analytics 360 data from the Google Merchandise Store , a real ecommerce store that sells Google-branded merchandise, in BigQuery. It’s a great way analyze business data and learn the benefits of using BigQuery to analyze Analytics 360 data Learn more about the data The data includes The data is typical of what an ecommerce website would see and includes the following information:Traffic source data: information about where website visitors originate, including data about organic traffic, paid search traffic, and display trafficContent data: information about the behavior of users on the site, such as URLs of pages that visitors look at, how they interact with content, etc. Transactional data: information about the transactions on the Google Merchandise Store website.Limitations: All users have view access to the dataset. This means you can query the dataset and generate reports but you cannot complete administrative tasks. Data for some fields is obfuscated such as fullVisitorId, or removed such as clientId, adWordsClickInfo and geoNetwork. “Not available in demo dataset” will be returned for STRING values and “null” will be returned for INTEGER values when querying the fields containing no data.This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Cannabis is a genus of flowering plants in the family Cannabaceae.
Source: https://en.wikipedia.org/wiki/Cannabis
In October 2016, Phylos Bioscience released a genomic open dataset of approximately 850 strains of Cannabis via the Open Cannabis Project. In combination with other genomics datasets made available by Courtagen Life Sciences, Michigan State University, NCBI, Sunrise Medicinal, University of Calgary, University of Toronto, and Yunnan Academy of Agricultural Sciences, the total amount of publicly available data exceeds 1,000 samples taken from nearly as many unique strains.
These data were retrieved from the National Center for Biotechnology Information’s Sequence Read Archive (NCBI SRA), processed using the BWA aligner and FreeBayes variant caller, indexed with the Google Genomics API, and exported to BigQuery for analysis. Data are available directly from Google Cloud Storage at gs://gcs-public-data--genomics/cannabis, as well as via the Google Genomics API as dataset ID 918853309083001239, and an additional duplicated subset of only transcriptome data as dataset ID 94241232795910911, as well as in the BigQuery dataset bigquery-public-data:genomics_cannabis.
All tables in the Cannabis Genomes Project dataset have a suffix like _201703. The suffix is referred to as [BUILD_DATE] in the descriptions below. The dataset is updated frequently as new releases become available.
The following tables are included in the Cannabis Genomes Project dataset:
Sample_info contains fields extracted for each SRA sample, including the SRA sample ID and other data that give indications about the type of sample. Sample types include: strain, library prep methods, and sequencing technology. See SRP008673 for an example of upstream sample data. SRP008673 is the University of Toronto sequencing of Cannabis Sativa subspecies Purple Kush.
MNPR01_reference_[BUILD_DATE] contains reference sequence names and lengths for the draft assembly of Cannabis Sativa subspecies Cannatonic produced by Phylos Bioscience. This table contains contig identifiers and their lengths.
MNPR01_[BUILD_DATE] contains variant calls for all included samples and types (genomic, transcriptomic) aligned to the MNPR01_reference_[BUILD_DATE] table. Samples can be found in the sample_info table. The MNPR01_[BUILD_DATE] table is exported using the Google Genomics BigQuery variants schema. This table is useful for general analysis of the Cannabis genome.
MNPR01_transcriptome_[BUILD_DATE] is similar to the MNPR01_[BUILD_DATE] table, but it includes only the subset transcriptomic samples. This table is useful for transcribed gene-level analysis of the Cannabis genome.
Fork this kernel to get started with this dataset.
Dataset Source: http://opencannabisproject.org/ Category: Genomics Use: This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - https://www.ncbi.nlm.nih.gov/home/about/policies.shtml - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset. Update frequency: As additional data are released to GenBank View in BigQuery: https://bigquery.cloud.google.com/dataset/bigquery-public-data:genomics_cannabis View in Google Cloud Storage: gs://gcs-public-data--genomics/cannabis
Banner Photo by Rick Proctor from Unplash.
Which Cannabis samples are included in the variants table?
Which contigs in the MNPR01_reference_[BUILD_DATE] table have the highest density of variants?
How many variants does each sample have at the THC Synthase gene (THCA1) locus?
Facebook
TwitterYou can check the fields description in the documentation: current Keyword database: https://docs.dataforseo.com/v3/databases/google/keywords/?bash; Historical Keyword database: https://docs.dataforseo.com/v3/databases/google/history/keywords/?bash. You don’t have to download fresh data dumps in JSON or CSV – we can deliver data straight to your storage or database. We send terrabytes of data to dozens of customers every month using Amazon S3, Google Cloud Storage, Microsoft Azure Blob, Eleasticsearch, and Google Big Query. Let us know if you’d like to get your data to any other storage or database.
Facebook
TwitterThis dataset contains Hospital General Information from the U.S. Department of Health & Human Services. This is the BigQuery COVID-19 public dataset. This data contains a list of all hospitals that have been registered with Medicare. This list includes addresses, phone numbers, hospital types and quality of care information. The quality of care data is provided for over 4,000 Medicare-certified hospitals, including over 130 Veterans Administration (VA) medical centers, across the country. You can use this data to find hospitals and compare the quality of their care
You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.cms_medicare.hospital_general_info.
How do the hospitals in Mountain View, CA compare to the average hospital in the US? With the hospital compare data you can quickly understand how hospitals in one geographic location compare to another location. In this example query we compare Google’s home in Mountain View, California, to the average hospital in the United States. You can also modify the query to learn how the hospitals in your city compare to the US national average.
“#standardSQL
SELECT
MTV_AVG_HOSPITAL_RATING,
US_AVG_HOSPITAL_RATING
FROM (
SELECT
ROUND(AVG(CAST(hospital_overall_rating AS int64)),2) AS MTV_AVG_HOSPITAL_RATING
FROM
bigquery-public-data.cms_medicare.hospital_general_info
WHERE
city = 'MOUNTAIN VIEW'
AND state = 'CA'
AND hospital_overall_rating <> 'Not Available') MTV
JOIN (
SELECT
ROUND(AVG(CAST(hospital_overall_rating AS int64)),2) AS US_AVG_HOSPITAL_RATING
FROM
bigquery-public-data.cms_medicare.hospital_general_info
WHERE
hospital_overall_rating <> 'Not Available')
ON
1 = 1”
What are the most common diseases treated at hospitals that do well in the category of patient readmissions?
For hospitals that achieved “Above the national average” in the category of patient readmissions, it might be interesting to review the types of diagnoses that are treated at those inpatient facilities. While this query won’t provide the granular detail that went into the readmission calculation, it gives us a quick glimpse into the top disease related groups (DRG)
, or classification of inpatient stays that are found at those hospitals. By joining the general hospital information to the inpatient charge data, also provided by CMS, you could quickly identify DRGs that may warrant additional research. You can also modify the query to review the top diagnosis related groups for hospital metrics you might be interested in.
“#standardSQL
SELECT
drg_definition,
SUM(total_discharges) total_discharge_per_drg
FROM
bigquery-public-data.cms_medicare.hospital_general_info gi
INNER JOIN
bigquery-public-data.cms_medicare.inpatient_charges_2015 ic
ON
gi.provider_id = ic.provider_id
WHERE
readmission_national_comparison = 'Above the national average'
GROUP BY
drg_definition
ORDER BY
total_discharge_per_drg DESC
LIMIT
10;”
Facebook
TwitterOnPoint Weather is a global weather dataset for business available for any lat/lon point and geographic area such as ZIP codes. OnPoint Weather provides a continuum of hourly and daily weather from the year 2000 to current time and a forward forecast of 45 days. OnPoint Climatology provides hourly and daily weather statistics which can be used to determine ‘departures from normal’ and to provide climatological guidance of expected weather for any location at any point in time. The OnPoint Climatology provides weather statistics such as means, standard deviations and frequency of occurrence. Weather has a significant impact on businesses and accounts for hundreds of billions in lost revenue annually. OnPoint Weather allows businesses to quantify weather impacts and develop strategies to optimize for weather to improve business performance. Examples of Usage Quantify the impact of weather on sales across diverse locations and times of the year Understand how supply chains are impacted by weather Understand how employee’s attendance and performance are impacted by weather Understand how weather influences foot traffic at malls, stores and restaurants OnPoint Weather is available through Google Cloud Platform’s Commercial Dataset Program and can be easily integrated with other Google Cloud Platform Services to quickly reveal and quantify weather impacts on business. Weather Source provides a full range of support services from answering quick questions to consulting and building custom solutions. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery 瞭解詳情
Facebook
TwitterThe United States Census Bureau’s international dataset provides estimates of country populations since 1950 and projections through 2050. Specifically, the dataset includes midyear population figures broken down by age and gender assignment at birth. Additionally, time-series data is provided for attributes including fertility rates, birth rates, death rates, and migration rates.
You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.census_bureau_international.
What countries have the longest life expectancy? In this query, 2016 census information is retrieved by joining the mortality_life_expectancy and country_names_area tables for countries larger than 25,000 km2. Without the size constraint, Monaco is the top result with an average life expectancy of over 89 years!
SELECT
age.country_name,
age.life_expectancy,
size.country_area
FROM (
SELECT
country_name,
life_expectancy
FROM
bigquery-public-data.census_bureau_international.mortality_life_expectancy
WHERE
year = 2016) age
INNER JOIN (
SELECT
country_name,
country_area
FROM
bigquery-public-data.census_bureau_international.country_names_area where country_area > 25000) size
ON
age.country_name = size.country_name
ORDER BY
2 DESC
/* Limit removed for Data Studio Visualization */
LIMIT
10
Which countries have the largest proportion of their population under 25? Over 40% of the world’s population is under 25 and greater than 50% of the world’s population is under 30! This query retrieves the countries with the largest proportion of young people by joining the age-specific population table with the midyear (total) population table.
SELECT
age.country_name,
SUM(age.population) AS under_25,
pop.midyear_population AS total,
ROUND((SUM(age.population) / pop.midyear_population) * 100,2) AS pct_under_25
FROM (
SELECT
country_name,
population,
country_code
FROM
bigquery-public-data.census_bureau_international.midyear_population_agespecific
WHERE
year =2017
AND age < 25) age
INNER JOIN (
SELECT
midyear_population,
country_code
FROM
bigquery-public-data.census_bureau_international.midyear_population
WHERE
year = 2017) pop
ON
age.country_code = pop.country_code
GROUP BY
1,
3
ORDER BY
4 DESC /* Remove limit for visualization*/
LIMIT
10
The International Census dataset contains growth information in the form of birth rates, death rates, and migration rates. Net migration is the net number of migrants per 1,000 population, an important component of total population and one that often drives the work of the United Nations Refugee Agency. This query joins the growth rate table with the area table to retrieve 2017 data for countries greater than 500 km2.
SELECT
growth.country_name,
growth.net_migration,
CAST(area.country_area AS INT64) AS country_area
FROM (
SELECT
country_name,
net_migration,
country_code
FROM
bigquery-public-data.census_bureau_international.birth_death_growth_rates
WHERE
year = 2017) growth
INNER JOIN (
SELECT
country_area,
country_code
FROM
bigquery-public-data.census_bureau_international.country_names_area
Historic (none)
United States Census Bureau
Terms of use: This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - http://www.data.gov/privacy-policy#data_policy - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
See the GCP Marketplace listing for more details and sample queries: https://console.cloud.google.com/marketplace/details/united-states-census-bureau/international-census-data
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
From https://research.google/tools/datasets/google-cluster-workload-traces-2019/
his is a trace of the workloads running on eight Google Borg compute clusters for the month of May 2019. The trace describes every job submission, scheduling decision, and resource usage data for the jobs that ran in those clusters.
It builds on the May 2011 trace of one cluster, which has enabled a wide range of research on advancing the state-of-the-art for cluster schedulers and cloud computing, and has been used to generate hundreds of analyses and studies.
Since 2011, machines and software have evolved, workloads have changed, and the importance of workload variance has become even clearer. The new trace allows researchers to explore these changes. The new dataset includes additional data, including:
CPU usage information histograms for each 5 minute period, not just a point sample; information about alloc sets (shared resource reservations used by jobs); and job-parent information for master/worker relationships such as MapReduce jobs. Just like the last trace, these new ones focus on resource requests and usage, and contain no information about end users, their data, or access patterns to storage systems and other services.
The trace data is being made available via Google BigQuery so that sophisticated analyses can be performed without requiring local resources. This site provides access instructions and a detailed description of what the traces contain.
https://drive.google.com/file/d/10r6cnJ5cJ89fPWCgj7j4LtLBqYN9RiI9/view
Facebook
TwitterThis dataset contains a randomized sample of roughly one quarter of all stories and comments from Hacker News from its launch in 2006. Hacker News is a social news website focusing on computer science and entrepreneurship. It is run by Paul Graham's investment fund and startup incubator, Y Combinator. In general, content that can be submitted is defined as "anything that gratifies one's intellectual curiosity".
Each story contains a story ID, the author that made the post, when it was written, and the number of points the story received.
Please note that the text field includes profanity. All texts are the author’s own, do not necessarily reflect the positions of Kaggle or Hacker News, and are presented without endorsement.
This dataset was kindly made publicly available by Hacker News under the MIT license.
Recent studies have found that many forums tend to be dominated by a very small fraction of users. Is this true of Hacker News?
Hacker News has received complaints that the site is biased towards Y Combinator startups. Do the data support this?
Is the amount of coverage by Hacker News predictive of a startup’s success?
You can use Kernels to analyze, share, and discuss this data on Kaggle, but if you’re looking for real-time updates and bigger data, check out the data in BigQuery, too: https://cloud.google.com/bigquery/public-data/hacker-news
The BigQuery version of this dataset has roughly four times as many articles.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website.
The sample dataset contains Google Analytics 360 data from the Google Merchandise Store, a real ecommerce store. The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website. It includes the following kinds of information:
Traffic source data: information about where website visitors originate. This includes data about organic traffic, paid search traffic, display traffic, etc. Content data: information about the behavior of users on the site. This includes the URLs of pages that visitors look at, how they interact with content, etc. Transactional data: information about the transactions that occur on the Google Merchandise Store website.
Fork this kernel to get started.
Banner Photo by Edho Pratama from Unsplash.
What is the total number of transactions generated per device browser in July 2017?
The real bounce rate is defined as the percentage of visits with a single pageview. What was the real bounce rate per traffic source?
What was the average number of product pageviews for users who made a purchase in July 2017?
What was the average number of product pageviews for users who did not make a purchase in July 2017?
What was the average total transactions per user that made a purchase in July 2017?
What is the average amount of money spent per session in July 2017?
What is the sequence of pages viewed?