20 datasets found

BigQuery Sample Tables
kaggle.com
zip
Updated Sep 4, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2018). BigQuery Sample Tables [Dataset]. https://www.kaggle.com/bigquery/samples
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Sep 4, 2018
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Googlehttp://google.com/
Authors
Google BigQuery
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

BigQuery provides a limited number of sample tables that you can run queries against. These tables are suited for testing queries and learning BigQuery.

Content

gsod: Contains weather information collected by NOAA, such as precipitation amounts and wind speeds from late 1929 to early 2010.

github_nested: Contains a timeline of actions such as pull requests and comments on GitHub repositories with a nested schema. Created in September 2012.

github_timeline: Contains a timeline of actions such as pull requests and comments on GitHub repositories with a flat schema. Created in May 2012.

natality: Describes all United States births registered in the 50 States, the District of Columbia, and New York City from 1969 to 2008.

shakespeare: Contains a word index of the works of Shakespeare, giving the number of times each word appears in each corpus.

trigrams: Contains English language trigrams from a sample of works published between 1520 and 2008.

wikipedia: Contains the complete revision history for all Wikipedia articles up to April 2010.

Fork this kernel to get started.

Acknowledgements

Data Source: https://cloud.google.com/bigquery/sample-tables

Banner Photo by Mervyn Chan from Unplash.

Inspiration

How many babies were born in New York City on Christmas Day?

How many words are in the play Hamlet?
Google Analytics Sample
kaggle.com
zip
Updated Sep 19, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The citation is currently not available for this dataset.
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Sep 19, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Googlehttp://google.com/
Authors
Google BigQuery
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website.

Content

The sample dataset contains Google Analytics 360 data from the Google Merchandise Store, a real ecommerce store. The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website. It includes the following kinds of information:

Traffic source data: information about where website visitors originate. This includes data about organic traffic, paid search traffic, display traffic, etc. Content data: information about the behavior of users on the site. This includes the URLs of pages that visitors look at, how they interact with content, etc. Transactional data: information about the transactions that occur on the Google Merchandise Store website.

Fork this kernel to get started.

Acknowledgements

Data from: https://bigquery.cloud.google.com/table/bigquery-public-data:google_analytics_sample.ga_sessions_20170801

Banner Photo by Edho Pratama from Unsplash.

Inspiration

What is the total number of transactions generated per device browser in July 2017?

The real bounce rate is defined as the percentage of visits with a single pageview. What was the real bounce rate per traffic source?

What was the average number of product pageviews for users who made a purchase in July 2017?

What was the average number of product pageviews for users who did not make a purchase in July 2017?

What was the average total transactions per user that made a purchase in July 2017?

What is the average amount of money spent per session in July 2017?

What is the sequence of pages viewed?
BigQuery Sample File
kaggle.com
zip
Updated Jun 26, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ro Kar (2019). BigQuery Sample File [Dataset]. https://www.kaggle.com/datasets/rokar91/bigquery-sample-file
Explore at:
zip(6059 bytes)Available download formats
Dataset updated
Jun 26, 2019
Authors
Ro Kar
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by Ro Kar

Released under CC0: Public Domain

Contents
Google Analytics Sample
console.cloud.google.com
Updated Jul 15, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:Obfuscated%20Google%20Analytics%20360%20data&hl=en_GB (2017). Google Analytics Sample [Dataset]. https://console.cloud.google.com/marketplace/product/obfuscated-ga360-data/obfuscated-ga360-data?hl=en_GB
Explore at:
Dataset updated
Jul 15, 2017
Dataset provided by
Googlehttp://google.com/
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The dataset provides 12 months (August 2016 to August 2017) of obfuscated Google Analytics 360 data from the Google Merchandise Store , a real ecommerce store that sells Google-branded merchandise, in BigQuery. It’s a great way analyze business data and learn the benefits of using BigQuery to analyze Analytics 360 data Learn more about the data The data includes The data is typical of what an ecommerce website would see and includes the following information:Traffic source data: information about where website visitors originate, including data about organic traffic, paid search traffic, and display trafficContent data: information about the behavior of users on the site, such as URLs of pages that visitors look at, how they interact with content, etc. Transactional data: information about the transactions on the Google Merchandise Store website.Limitations: All users have view access to the dataset. This means you can query the dataset and generate reports but you cannot complete administrative tasks. Data for some fields is obfuscated such as fullVisitorId, or removed such as clientId, adWordsClickInfo and geoNetwork. “Not available in demo dataset” will be returned for STRING values and “null” will be returned for INTEGER values when querying the fields containing no data.This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery
BigQuery sample Data set
kaggle.com
zip
Updated Nov 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ritu Barik (2024). BigQuery sample Data set [Dataset]. https://www.kaggle.com/ritubarik/bigquery-sample-data-set
Explore at:
zip(565 bytes)Available download formats
Dataset updated
Nov 11, 2024
Authors
Ritu Barik
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset

This dataset was created by Ritu Barik

Released under Apache 2.0

Contents
1000 Cannabis Genomes Project
kaggle.com
zip
Updated Feb 26, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). 1000 Cannabis Genomes Project [Dataset]. https://www.kaggle.com/bigquery/genomics-cannabis
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 26, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Googlehttp://google.com/
Authors
Google BigQuery
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Cannabis is a genus of flowering plants in the family Cannabaceae.

Source: https://en.wikipedia.org/wiki/Cannabis

Content

In October 2016, Phylos Bioscience released a genomic open dataset of approximately 850 strains of Cannabis via the Open Cannabis Project. In combination with other genomics datasets made available by Courtagen Life Sciences, Michigan State University, NCBI, Sunrise Medicinal, University of Calgary, University of Toronto, and Yunnan Academy of Agricultural Sciences, the total amount of publicly available data exceeds 1,000 samples taken from nearly as many unique strains.

https://medium.com/google-cloud/dna-sequencing-of-1000-cannabis-strains-publicly-available-in-google-bigquery-a33430d63998

These data were retrieved from the National Center for Biotechnology Information’s Sequence Read Archive (NCBI SRA), processed using the BWA aligner and FreeBayes variant caller, indexed with the Google Genomics API, and exported to BigQuery for analysis. Data are available directly from Google Cloud Storage at gs://gcs-public-data--genomics/cannabis, as well as via the Google Genomics API as dataset ID 918853309083001239, and an additional duplicated subset of only transcriptome data as dataset ID 94241232795910911, as well as in the BigQuery dataset bigquery-public-data:genomics_cannabis.

All tables in the Cannabis Genomes Project dataset have a suffix like _201703. The suffix is referred to as [BUILD_DATE] in the descriptions below. The dataset is updated frequently as new releases become available.

The following tables are included in the Cannabis Genomes Project dataset:

Sample_info contains fields extracted for each SRA sample, including the SRA sample ID and other data that give indications about the type of sample. Sample types include: strain, library prep methods, and sequencing technology. See SRP008673 for an example of upstream sample data. SRP008673 is the University of Toronto sequencing of Cannabis Sativa subspecies Purple Kush.

MNPR01_reference_[BUILD_DATE] contains reference sequence names and lengths for the draft assembly of Cannabis Sativa subspecies Cannatonic produced by Phylos Bioscience. This table contains contig identifiers and their lengths.

MNPR01_[BUILD_DATE] contains variant calls for all included samples and types (genomic, transcriptomic) aligned to the MNPR01_reference_[BUILD_DATE] table. Samples can be found in the sample_info table. The MNPR01_[BUILD_DATE] table is exported using the Google Genomics BigQuery variants schema. This table is useful for general analysis of the Cannabis genome.

MNPR01_transcriptome_[BUILD_DATE] is similar to the MNPR01_[BUILD_DATE] table, but it includes only the subset transcriptomic samples. This table is useful for transcribed gene-level analysis of the Cannabis genome.

Fork this kernel to get started with this dataset.

Acknowledgements

Dataset Source: http://opencannabisproject.org/ Category: Genomics Use: This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - https://www.ncbi.nlm.nih.gov/home/about/policies.shtml - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset. Update frequency: As additional data are released to GenBank View in BigQuery: https://bigquery.cloud.google.com/dataset/bigquery-public-data:genomics_cannabis View in Google Cloud Storage: gs://gcs-public-data--genomics/cannabis

Banner Photo by Rick Proctor from Unplash.

Inspiration

Which Cannabis samples are included in the variants table?

Which contigs in the MNPR01_reference_[BUILD_DATE] table have the highest density of variants?

How many variants does each sample have at the THC Synthase gene (THCA1) locus?
d
DataForSEO Google Keyword Database, historical and current
datarade.ai
.json, .csv
Updated Mar 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DataForSEO (2023). DataForSEO Google Keyword Database, historical and current [Dataset]. https://datarade.ai/data-products/dataforseo-google-keyword-database-historical-and-current-dataforseo
Explore at:
.json, .csvAvailable download formats
Dataset updated
Mar 14, 2023
Dataset authored and provided by
DataForSEO
Area covered
Canada, Cyprus, Uruguay, Spain, Bangladesh, Bolivia (Plurinational State of), El Salvador, Bahrain, Singapore, Turkey
Description
You can check the fields description in the documentation: current Keyword database: https://docs.dataforseo.com/v3/databases/google/keywords/?bash; Historical Keyword database: https://docs.dataforseo.com/v3/databases/google/history/keywords/?bash. You don’t have to download fresh data dumps in JSON or CSV – we can deliver data straight to your storage or database. We send terrabytes of data to dozens of customers every month using Amazon S3, Google Cloud Storage, Microsoft Azure Blob, Eleasticsearch, and Google Big Query. Let us know if you’d like to get your data to any other storage or database.
BigQuery Sample File
kaggle.com
zip
Updated Sep 28, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Srijan Singh (2019). BigQuery Sample File [Dataset]. https://www.kaggle.com/srijansingh53/bigquery-sample-file
Explore at:
zip(5605 bytes)Available download formats
Dataset updated
Sep 28, 2019
Authors
Srijan Singh
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Description
Dataset

This dataset was created by Srijan Singh

Released under GPL 2

Contents
a
Limite de Bairros
hub.arcgis.com
data.rio
+1more
Updated Apr 16, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prefeitura da Cidade do Rio de Janeiro (2020). Limite de Bairros [Dataset]. https://hub.arcgis.com/maps/PCRJ::limite-de-bairros/about
Explore at:
Dataset updated
Apr 16, 2020
Dataset authored and provided by
Prefeitura da Cidade do Rio de Janeiro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered

Description
Base de dados geográfica dos limites de Bairros da Cidade do Rio de Janeiro. Como acessar através do DatalakeBigQuerySELECT * FROM datario.dados_mestres.bairro LIMIT 1000Clique aqui para ir diretamente a essa tabela no BigQuery. Caso não tenha experiência com BigQuery, acesse nosso tutorial para entender como acessar os dados.Pythonimport basedosdados as bd# Para carregar o dado direto no pandasdf = bd.read_sql ( "SELECT * FROM datario.dados_mestres.bairro LIMIT 1000" , billing_project_id = "
GitHub Repo Sample Data
kaggle.com
zip
Updated Dec 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mayur Kr. Garg (2021). GitHub Repo Sample Data [Dataset]. https://www.kaggle.com/mayur7garg/github-repo-sample-data
Explore at:
zip(301265354 bytes)Available download formats
Dataset updated
Dec 28, 2021
Authors
Mayur Kr. Garg
Description
About

This dataset consists of samples of non binary files, their contents and extensions from BigQuery's GitHub public sample repo data.

File info

This dataset consists of two CSV files: - filenames_with_ext.csv - This CSV lists all filenames with extensions from BigQuery's GitHub public sample repo data. Files with no extensions have been excluded. - filecontent_with_top_ext.csv - This CSV has samples of non binary files, their contents and extensions from BigQuery's GitHub public sample repo data with subject to some constraints.

Data extraction

To understand how this data was extracted and what constraints were used, refer to the following notebook: GitHub Repo Data - mayur7garg
cms-medicare
kaggle.com
zip
Updated Apr 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2020). cms-medicare [Dataset]. https://www.kaggle.com/datasets/bigquery/cms-medicare
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Apr 21, 2020
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
Description
Context

This dataset contains Hospital General Information from the U.S. Department of Health & Human Services. This is the BigQuery COVID-19 public dataset. This data contains a list of all hospitals that have been registered with Medicare. This list includes addresses, phone numbers, hospital types and quality of care information. The quality of care data is provided for over 4,000 Medicare-certified hospitals, including over 130 Veterans Administration (VA) medical centers, across the country. You can use this data to find hospitals and compare the quality of their care

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.cms_medicare.hospital_general_info.

Sample Query

How do the hospitals in Mountain View, CA compare to the average hospital in the US? With the hospital compare data you can quickly understand how hospitals in one geographic location compare to another location. In this example query we compare Google’s home in Mountain View, California, to the average hospital in the United States. You can also modify the query to learn how the hospitals in your city compare to the US national average.

“#standardSQL SELECT MTV_AVG_HOSPITAL_RATING, US_AVG_HOSPITAL_RATING FROM ( SELECT ROUND(AVG(CAST(hospital_overall_rating AS int64)),2) AS MTV_AVG_HOSPITAL_RATING FROM bigquery-public-data.cms_medicare.hospital_general_info WHERE city = 'MOUNTAIN VIEW' AND state = 'CA' AND hospital_overall_rating <> 'Not Available') MTV JOIN ( SELECT ROUND(AVG(CAST(hospital_overall_rating AS int64)),2) AS US_AVG_HOSPITAL_RATING FROM bigquery-public-data.cms_medicare.hospital_general_info WHERE hospital_overall_rating <> 'Not Available') ON 1 = 1”

What are the most common diseases treated at hospitals that do well in the category of patient readmissions? For hospitals that achieved “Above the national average” in the category of patient readmissions, it might be interesting to review the types of diagnoses that are treated at those inpatient facilities. While this query won’t provide the granular detail that went into the readmission calculation, it gives us a quick glimpse into the top disease related groups (DRG)
, or classification of inpatient stays that are found at those hospitals. By joining the general hospital information to the inpatient charge data, also provided by CMS, you could quickly identify DRGs that may warrant additional research. You can also modify the query to review the top diagnosis related groups for hospital metrics you might be interested in. “#standardSQL SELECT drg_definition, SUM(total_discharges) total_discharge_per_drg FROM bigquery-public-data.cms_medicare.hospital_general_info gi INNER JOIN bigquery-public-data.cms_medicare.inpatient_charges_2015 ic ON gi.provider_id = ic.provider_id WHERE readmission_national_comparison = 'Above the national average' GROUP BY drg_definition ORDER BY total_discharge_per_drg DESC LIMIT 10;”
OnPoint Weather - Past Weather and Climatology Data Sample
console.cloud.google.com
Updated May 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:Weather%20Source&hl=zh-tw (2023). OnPoint Weather - Past Weather and Climatology Data Sample [Dataset]. https://console.cloud.google.com/marketplace/product/weathersource-com/weather-past-climatology?hl=zh-tw
Explore at:
Dataset updated
May 13, 2023
Dataset provided by
Googlehttp://google.com/
Description
OnPoint Weather is a global weather dataset for business available for any lat/lon point and geographic area such as ZIP codes. OnPoint Weather provides a continuum of hourly and daily weather from the year 2000 to current time and a forward forecast of 45 days. OnPoint Climatology provides hourly and daily weather statistics which can be used to determine ‘departures from normal’ and to provide climatological guidance of expected weather for any location at any point in time. The OnPoint Climatology provides weather statistics such as means, standard deviations and frequency of occurrence. Weather has a significant impact on businesses and accounts for hundreds of billions in lost revenue annually. OnPoint Weather allows businesses to quantify weather impacts and develop strategies to optimize for weather to improve business performance. Examples of Usage Quantify the impact of weather on sales across diverse locations and times of the year Understand how supply chains are impacted by weather Understand how employee’s attendance and performance are impacted by weather Understand how weather influences foot traffic at malls, stores and restaurants OnPoint Weather is available through Google Cloud Platform’s Commercial Dataset Program and can be easily integrated with other Google Cloud Platform Services to quickly reveal and quantify weather impacts on business. Weather Source provides a full range of support services from answering quick questions to consulting and building custom solutions. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery 瞭解詳情
census-bureau-international
kaggle.com
zip
Updated May 6, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2020). census-bureau-international [Dataset]. https://www.kaggle.com/bigquery/census-bureau-international
Explore at:
zip(0 bytes)Available download formats
Dataset updated
May 6, 2020
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Googlehttp://google.com/
Authors
Google BigQuery
Description
Context

The United States Census Bureau’s international dataset provides estimates of country populations since 1950 and projections through 2050. Specifically, the dataset includes midyear population figures broken down by age and gender assignment at birth. Additionally, time-series data is provided for attributes including fertility rates, birth rates, death rates, and migration rates.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.census_bureau_international.

Sample Query 1

What countries have the longest life expectancy? In this query, 2016 census information is retrieved by joining the mortality_life_expectancy and country_names_area tables for countries larger than 25,000 km2. Without the size constraint, Monaco is the top result with an average life expectancy of over 89 years!

standardSQL

SELECT age.country_name, age.life_expectancy, size.country_area FROM ( SELECT country_name, life_expectancy FROM bigquery-public-data.census_bureau_international.mortality_life_expectancy WHERE year = 2016) age INNER JOIN ( SELECT country_name, country_area FROM bigquery-public-data.census_bureau_international.country_names_area where country_area > 25000) size ON age.country_name = size.country_name ORDER BY 2 DESC /* Limit removed for Data Studio Visualization */ LIMIT 10

Sample Query 2

Which countries have the largest proportion of their population under 25? Over 40% of the world’s population is under 25 and greater than 50% of the world’s population is under 30! This query retrieves the countries with the largest proportion of young people by joining the age-specific population table with the midyear (total) population table.

standardSQL

SELECT age.country_name, SUM(age.population) AS under_25, pop.midyear_population AS total, ROUND((SUM(age.population) / pop.midyear_population) * 100,2) AS pct_under_25 FROM ( SELECT country_name, population, country_code FROM bigquery-public-data.census_bureau_international.midyear_population_agespecific WHERE year =2017 AND age < 25) age INNER JOIN ( SELECT midyear_population, country_code FROM bigquery-public-data.census_bureau_international.midyear_population WHERE year = 2017) pop ON age.country_code = pop.country_code GROUP BY 1, 3 ORDER BY 4 DESC /* Remove limit for visualization*/ LIMIT 10

Sample Query 3

The International Census dataset contains growth information in the form of birth rates, death rates, and migration rates. Net migration is the net number of migrants per 1,000 population, an important component of total population and one that often drives the work of the United Nations Refugee Agency. This query joins the growth rate table with the area table to retrieve 2017 data for countries greater than 500 km2.

SELECT growth.country_name, growth.net_migration, CAST(area.country_area AS INT64) AS country_area FROM ( SELECT country_name, net_migration, country_code FROM bigquery-public-data.census_bureau_international.birth_death_growth_rates WHERE year = 2017) growth INNER JOIN ( SELECT country_area, country_code FROM bigquery-public-data.census_bureau_international.country_names_area

Update frequency

Historic (none)

Dataset source

United States Census Bureau

Terms of use: This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - http://www.data.gov/privacy-policy#data_policy - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

See the GCP Marketplace listing for more details and sample queries: https://console.cloud.google.com/marketplace/details/united-states-census-bureau/international-census-data
census-bureau-usa
kaggle.com
zip
Updated May 18, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2020). census-bureau-usa [Dataset]. https://www.kaggle.com/datasets/bigquery/census-bureau-usa
Explore at:
zip(0 bytes)Available download formats
Dataset updated
May 18, 2020
Dataset authored and provided by
Google BigQuery
Area covered
United States
Description
Context :

The United States census count (also known as the Decennial Census of Population and Housing) is a count of every resident of the US. The census occurs every 10 years and is conducted by the United States Census Bureau. Census data is publicly available through the census website, but much of the data is available in summarized data and graphs. The raw data is often difficult to obtain, is typically divided by region, and it must be processed and combined to provide information about the nation as a whole. Update frequency: Historic (none)

Dataset source

United States Census Bureau

Sample Query

SELECT zipcode, population FROM bigquery-public-data.census_bureau_usa.population_by_zip_2010 WHERE gender = '' ORDER BY population DESC LIMIT 10

Terms of use

This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - http://www.data.gov/privacy-policy#data_policy - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

See the GCP Marketplace listing for more details and sample queries: https://console.cloud.google.com/marketplace/details/united-states-census-bureau/us-census-data
Hacker News Corpus
kaggle.com
zip
Updated Jun 29, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hacker News (2017). Hacker News Corpus [Dataset]. https://www.kaggle.com/hacker-news/hacker-news-corpus
Explore at:
zip(642956855 bytes)Available download formats
Dataset updated
Jun 29, 2017
Dataset authored and provided by
Hacker News
Description
Context

This dataset contains a randomized sample of roughly one quarter of all stories and comments from Hacker News from its launch in 2006. Hacker News is a social news website focusing on computer science and entrepreneurship. It is run by Paul Graham's investment fund and startup incubator, Y Combinator. In general, content that can be submitted is defined as "anything that gratifies one's intellectual curiosity".

Content

Each story contains a story ID, the author that made the post, when it was written, and the number of points the story received.

Please note that the text field includes profanity. All texts are the author’s own, do not necessarily reflect the positions of Kaggle or Hacker News, and are presented without endorsement.

Acknowledgements

This dataset was kindly made publicly available by Hacker News under the MIT license.

Inspiration

Recent studies have found that many forums tend to be dominated by a very small fraction of users. Is this true of Hacker News?

Hacker News has received complaints that the site is biased towards Y Combinator startups. Do the data support this?

Is the amount of coverage by Hacker News predictive of a startup’s success?

Use this dataset with BigQuery

You can use Kernels to analyze, share, and discuss this data on Kaggle, but if you’re looking for real-time updates and bigger data, check out the data in BigQuery, too: https://cloud.google.com/bigquery/public-data/hacker-news

The BigQuery version of this dataset has roughly four times as many articles.
Global Health
kaggle.com
zip
Updated May 18, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2020). Global Health [Dataset]. https://www.kaggle.com/bigquery/world-bank-health-population
Explore at:
zip(0 bytes)Available download formats
Dataset updated
May 18, 2020
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Googlehttp://google.com/
Authors
Google BigQuery
Description
Context

This dataset combines key health statistics from a variety of sources to provide a look at global health and population trends. It includes information on nutrition, reproductive health, education, immunization, and diseases from over 200 countries

Sample Query

What’s the average age of first marriages for females around the world? This query retrieves the average age of first marriages for females by country. Females are used because there is a larger age spread in first marriages for females

SELECT country_name, ROUND(AVG(value),2) AS average FROM bigquery-public-data.world_bank_health_population.health_nutrition_population WHERE indicator_code = "SP.DYN.SMAM.FE" AND year > 2000 GROUP BY country_name ORDER BY average
BigQuery_Sample_File
kaggle.com
zip
Updated Jun 27, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muskan Goel (2019). BigQuery_Sample_File [Dataset]. https://www.kaggle.com/bt18gcs188/bigquery-sample-file
Explore at:
zip(1375 bytes)Available download formats
Dataset updated
Jun 27, 2019
Authors
Muskan Goel
Description
Dataset

This dataset was created by Muskan Goel

Contents
Chicago Crime
kaggle.com
zip
Updated Nov 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ashkan Ranjbar (2025). Chicago Crime [Dataset]. https://www.kaggle.com/ashkanranjbar/chicago-crime
Explore at:
zip(10641044 bytes)Available download formats
Dataset updated
Nov 19, 2025
Authors
Ashkan Ranjbar
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Area covered
Chicago
Description
This dataset has gained popularity over time and is widely known. While Kaggle courses teach how to use Google BigQuery to extract a sample from it, this notebook provides a HOW-TO guide to access the dataset directly within your own notebook. Instead of uploading the entire dataset here, which is quite large, I offer several alternatives to work with a smaller portion of it. My main focus was to demonstrate various techniques to make the dataset more manageable on your own laptop, ensuring smoother operations. Additionally, I've included some interesting insights on basic descriptive statistics and even a modeling example, which can be further explored based on your preferences. I intend to revisit and refine it in the near future to enhance its rigor. Meanwhile, I welcome any suggestions to improve the notebook!

Here are the columns that I have chosen to include (after carefully eliminating a few others):

Date: This column represents the timestamp of the incident. From this column, I have extracted the Month, Day, and Hour information. We can also add additional time-based columns such as Week and Day of the Week, among others.

Block: This column provides a partially redacted address where the incident occurred, indicating the same block as the actual address.

IUCR: The acronym stands for Illinois Uniform Crime Reporting. This code is directly linked to the Primary Type and Description. You can find more information about it in this link.

Primary Type: This column describes the primary category of the IUCR code mentioned above.

Description: This column provides a secondary description of the IUCR code, serving as a subcategory of the primary description.

Location Description: Here, you can find the description of the location where the incident took place.

Arrest: This column indicates whether an arrest was made in relation to the incident.

Domestic: It shows whether the incident was domestic-related, as defined by the Illinois Domestic Violence Act.

Beat: The beat refers to the smallest police geographic area, with each beat having a dedicated territory. You can find more information about it in this link.

District: This column represents the police district where the incident occurred.

Ward: It refers to the number that labels the City Council district where the incident took place.

Community Areas: This column indicates the community area where the incident occurred. Chicago has a total of 77 community areas.

FBI Code: The crime classification outlined in the FBI's National Incident-Based Reporting System (NIBRS).

X-Coordinate, Y-Coordinate, Latitude, Longitude, Location: These columns provide information about the geographical coordinates of the incident location, including latitude and longitude. The "Location" column contains just the latitude and longitude coordinates.

Year, Updated On: These columns represent the year of the incident and the date on which the dataset was last updated.

Feel free to explore the notebook and provide any suggestions for improvement. Your feedback is highly appreciated!
Customer Activity
kaggle.com
zip
Updated Nov 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NW Analytics (2022). Customer Activity [Dataset]. https://www.kaggle.com/datasets/nwanalytics/customer-activity/code
Explore at:
zip(72684 bytes)Available download formats
Dataset updated
Nov 12, 2022
Authors
NW Analytics
Description
Context

Assume you are a data analyst in an EdTech company. The company’s customer success team works with an objective to help customers get the maximum value from their product by doing deeper dives into the customer's needs, wants and expectations from the product and helping them reach their goals.

The customer success team is aiming to achieve sustainable growth by focusing on retaining the existing users.

Therefore, your team wants to analyze the activity of your existing users and understand their performance, behaviours, and patterns to gain meaningful insights, that help your customer success team take data-informed decisions.

Expected Outcome

Brainstorm and identify the right metrics and frame proper questions for analysis. Your analysis should help your customer success team to understand.

How is the current retention of the users

How are they engaging with the content

How efficiently are their discussions being resolved

In case you identify any outliers in the data set, make a note of them and exclude them from your analysis.

Build the best suitable dashboard presenting your insights.

Your recommendations must be backed by meaningful insights and professional visualizations which will help your customer success team design road maps, strategies, and action items to achieve the goal.

Tools to use:

Google Data Studio (preferred), Tableau, Power Bi or any other visualization tool

You can use BigQuery SQL if you wish, not mandatory

Overview of the Dataset

The dataset contains the basic details of the enrolled users, their learning resource completion percentages, activities on the platform and the structure of learning resources available on the platform

1.**users_basic_details**: Contains basic details of the enrolled users.

2.**day_wise_user_activity**: Contains the details of the day-wise learning activity of the users. - A user shall have one entry for a lesson in a day.

3.**learning_resource_details**: Contains the details of learning resources offered to the enrolled users - Content is stored in a hierarchical structure: Track → Course →Topic → Lesson. A lesson can be a video, practice, exam, etc. - Example: Tech Foundations → Developer Foundations → Topic 1 → lesson 1

4.**feedback_details**: Contains the feedback details/rating given by the user to a particular lesson. - Feedback rating is given on a scale of 1 to 5, 5 being the highest. - A user can give feedback to the same lesson multiple times.

5.**discussion_details**: Contains the details of the discussions created by the user for a particular lesson.

6.**discussion_comment_details**: Contains the details of the comments posted for the discussions created by the user. - Comments may be posted by mentors or users themselves. - The role of mentors is to guide and help the users by resolving the doubts and issues faced by them related to their learning activity. - A discussion can have multiple comments.

Tables Description

users_basic_details:

user_id: unique id of the user [string]

gender: gender of the enrolled user [string]

current_city: city of residence of the user [string]

batch_start_datetime: start datetime of the batch, for which the user is enrolled [datetime]

referral_source: referral channel of the user [string]

highest_qualification: highest qualification (education details) of the enrolled user [string]

day_wise_user_activity:

activity_datetime: date and time of learning of the user [datetime]

user_id: unique id of the user [string]

lesson_id: unique id of the lesson [string]

lesson_type: type of the lesson. It can be "SESSION", "PRACTICE", "EXAM" or "PROJECT" [string]

day_completion_percentage: percent of the lesson completed by the user on a particular day (out of 100%) [float]

The completion percentage is calculated by the formula = learnt duration of a lesson on a day/total duration * 100

overall_completion_percentage: overall completion percentage of the lesson till date by the user (out of 100%) [float]

Example: If a user, who started a lesson on Jan 1, ’22 completes the lesson by learning it in parts (10%, 35%, 37%, 18% each day) on 4 different days, Then

Jan 1, ‘22 - day_completion_percentage - 10%, overall_completion_percentage - 10%

Jan 3, ‘22 - day_completion_percentage - 35%, overall_completion_percentage - 45%

Jan 4, ‘22 - day_completion_percentage - 37%, overall_completion_percentage - 82%

Jan 6, ‘22 - day_completion_percentage - 18%, overall_completion_percentage - 100%

learning_resource_details:

track_id: unique id of the track [string]

track_title: name of the track [string]

course_id: unique id of the course [string]

**`...
Google 2019 Cluster sample
kaggle.com
zip
Updated Feb 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Derrick Mwiti (2022). Google 2019 Cluster sample [Dataset]. https://www.kaggle.com/datasets/derrickmwiti/google-2019-cluster-sample/data
Explore at:
zip(101383815 bytes)Available download formats
Dataset updated
Feb 4, 2022
Authors
Derrick Mwiti
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
From https://research.google/tools/datasets/google-cluster-workload-traces-2019/

his is a trace of the workloads running on eight Google Borg compute clusters for the month of May 2019. The trace describes every job submission, scheduling decision, and resource usage data for the jobs that ran in those clusters.

It builds on the May 2011 trace of one cluster, which has enabled a wide range of research on advancing the state-of-the-art for cluster schedulers and cloud computing, and has been used to generate hundreds of analyses and studies.

Since 2011, machines and software have evolved, workloads have changed, and the importance of workload variance has become even clearer. The new trace allows researchers to explore these changes. The new dataset includes additional data, including:

CPU usage information histograms for each 5 minute period, not just a point sample; information about alloc sets (shared resource reservations used by jobs); and job-parent information for master/worker relationships such as MapReduce jobs. Just like the last trace, these new ones focus on resource requests and usage, and contain no information about end users, their data, or access patterns to storage systems and other services.

The trace data is being made available via Google BigQuery so that sophisticated analyses can be performed without requiring local resources. This site provides access instructions and a detailed description of what the traces contain.

https://drive.google.com/file/d/10r6cnJ5cJ89fPWCgj7j4LtLBqYN9RiI9/view
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Google BigQuery (2018). BigQuery Sample Tables [Dataset]. https://www.kaggle.com/bigquery/samples

BigQuery Sample Tables

Sample Tables for Tutorials and Learning (BigQuery)

Explore at:

zip(0 bytes)Available download formats

Dataset updated

Sep 4, 2018

Dataset provided by

BigQueryhttps://cloud.google.com/bigquery
Googlehttp://google.com/

Authors

Google BigQuery

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Context

BigQuery provides a limited number of sample tables that you can run queries against. These tables are suited for testing queries and learning BigQuery.

Content

gsod: Contains weather information collected by NOAA, such as precipitation amounts and wind speeds from late 1929 to early 2010.
github_nested: Contains a timeline of actions such as pull requests and comments on GitHub repositories with a nested schema. Created in September 2012.
github_timeline: Contains a timeline of actions such as pull requests and comments on GitHub repositories with a flat schema. Created in May 2012.
natality: Describes all United States births registered in the 50 States, the District of Columbia, and New York City from 1969 to 2008.
shakespeare: Contains a word index of the works of Shakespeare, giving the number of times each word appears in each corpus.
trigrams: Contains English language trigrams from a sample of works published between 1520 and 2008.
wikipedia: Contains the complete revision history for all Wikipedia articles up to April 2010.

Fork this kernel to get started.

Acknowledgements

Data Source: https://cloud.google.com/bigquery/sample-tables

Banner Photo by Mervyn Chan from Unplash.

Inspiration

How many babies were born in New York City on Christmas Day?

How many words are in the play Hamlet?

Clear search

Close search

Google apps

Main menu

BigQuery Sample Tables

Context

Content

Acknowledgements

Inspiration

Google Analytics Sample

Context

Content

Acknowledgements

Inspiration

BigQuery Sample File

Dataset

Contents

Google Analytics Sample

BigQuery sample Data set

Dataset

Contents

1000 Cannabis Genomes Project

Context

Content

Acknowledgements

Inspiration

DataForSEO Google Keyword Database, historical and current

BigQuery Sample File

Dataset

Contents

Limite de Bairros

GitHub Repo Sample Data

About

File info

Data extraction

cms-medicare

Context

Querying BigQuery tables

Sample Query

OnPoint Weather - Past Weather and Climatology Data Sample

census-bureau-international

Context

Querying BigQuery tables

Sample Query 1

standardSQL

Sample Query 2

standardSQL

Sample Query 3

Update frequency

Dataset source

census-bureau-usa

Context :

Dataset source

Sample Query

Terms of use

Hacker News Corpus

Context

Content

Acknowledgements

Inspiration

Use this dataset with BigQuery

Global Health

Context

Sample Query

BigQuery_Sample_File

Dataset

Contents

Chicago Crime

Customer Activity

Context

Expected Outcome

Tools to use:

Overview of the Dataset

Tables Description

Google 2019 Cluster sample

BigQuery Sample Tables

Sample Tables for Tutorials and Learning (BigQuery)

Context

Content

Acknowledgements

Inspiration