47 datasets found
  1. BigQuery Sample Tables

    • kaggle.com
    zip
    Updated Sep 4, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google BigQuery (2018). BigQuery Sample Tables [Dataset]. https://www.kaggle.com/bigquery/samples
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Sep 4, 2018
    Dataset provided by
    BigQueryhttps://cloud.google.com/bigquery
    Googlehttp://google.com/
    Authors
    Google BigQuery
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    BigQuery provides a limited number of sample tables that you can run queries against. These tables are suited for testing queries and learning BigQuery.

    Content

    • gsod: Contains weather information collected by NOAA, such as precipitation amounts and wind speeds from late 1929 to early 2010.

    • github_nested: Contains a timeline of actions such as pull requests and comments on GitHub repositories with a nested schema. Created in September 2012.

    • github_timeline: Contains a timeline of actions such as pull requests and comments on GitHub repositories with a flat schema. Created in May 2012.

    • natality: Describes all United States births registered in the 50 States, the District of Columbia, and New York City from 1969 to 2008.

    • shakespeare: Contains a word index of the works of Shakespeare, giving the number of times each word appears in each corpus.

    • trigrams: Contains English language trigrams from a sample of works published between 1520 and 2008.

    • wikipedia: Contains the complete revision history for all Wikipedia articles up to April 2010.

    Fork this kernel to get started.

    Acknowledgements

    Data Source: https://cloud.google.com/bigquery/sample-tables

    Banner Photo by Mervyn Chan from Unplash.

    Inspiration

    How many babies were born in New York City on Christmas Day?

    How many words are in the play Hamlet?

  2. theLook eCommerce

    • console.cloud.google.com
    Updated Nov 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:BigQuery%20Public%20Data&inv=1&invt=Ab2Y8Q (2022). theLook eCommerce [Dataset]. https://console.cloud.google.com/marketplace/product/bigquery-public-data/thelook-ecommerce
    Explore at:
    Dataset updated
    Nov 28, 2022
    Dataset provided by
    BigQueryhttps://cloud.google.com/bigquery
    Googlehttp://google.com/
    Description

    TheLook is a fictitious eCommerce clothing site developed by the Looker team. The dataset contains information about customers, products, orders, logistics, web events and digital marketing campaigns. The contents of this dataset are synthetic, and are provided to industry practitioners for the purpose of product discovery, testing, and evaluation. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets.What is BigQuery .

  3. Google Analytics Sample

    • kaggle.com
    zip
    Updated Sep 19, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google BigQuery (2019). Google Analytics Sample [Dataset]. https://www.kaggle.com/bigquery/google-analytics-sample
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Sep 19, 2019
    Dataset provided by
    BigQueryhttps://cloud.google.com/bigquery
    Googlehttp://google.com/
    Authors
    Google BigQuery
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website.

    Content

    The sample dataset contains Google Analytics 360 data from the Google Merchandise Store, a real ecommerce store. The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website. It includes the following kinds of information:

    Traffic source data: information about where website visitors originate. This includes data about organic traffic, paid search traffic, display traffic, etc. Content data: information about the behavior of users on the site. This includes the URLs of pages that visitors look at, how they interact with content, etc. Transactional data: information about the transactions that occur on the Google Merchandise Store website.

    Fork this kernel to get started.

    Acknowledgements

    Data from: https://bigquery.cloud.google.com/table/bigquery-public-data:google_analytics_sample.ga_sessions_20170801

    Banner Photo by Edho Pratama from Unsplash.

    Inspiration

    What is the total number of transactions generated per device browser in July 2017?

    The real bounce rate is defined as the percentage of visits with a single pageview. What was the real bounce rate per traffic source?

    What was the average number of product pageviews for users who made a purchase in July 2017?

    What was the average number of product pageviews for users who did not make a purchase in July 2017?

    What was the average total transactions per user that made a purchase in July 2017?

    What is the average amount of money spent per session in July 2017?

    What is the sequence of pages viewed?

  4. Google Analytics Sample

    • console.cloud.google.com
    Updated Jul 15, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:Obfuscated%20Google%20Analytics%20360%20data&hl=de&inv=1&invt=Ab2fng (2017). Google Analytics Sample [Dataset]. https://console.cloud.google.com/marketplace/product/obfuscated-ga360-data/obfuscated-ga360-data?hl=de
    Explore at:
    Dataset updated
    Jul 15, 2017
    Dataset provided by
    Googlehttp://google.com/
    Description

    The dataset provides 12 months (August 2016 to August 2017) of obfuscated Google Analytics 360 data from the Google Merchandise Store , a real ecommerce store that sells Google-branded merchandise, in BigQuery. It’s a great way analyze business data and learn the benefits of using BigQuery to analyze Analytics 360 data Learn more about the data The data includes The data is typical of what an ecommerce website would see and includes the following information:Traffic source data: information about where website visitors originate, including data about organic traffic, paid search traffic, and display trafficContent data: information about the behavior of users on the site, such as URLs of pages that visitors look at, how they interact with content, etc. Transactional data: information about the transactions on the Google Merchandise Store website.Limitations: All users have view access to the dataset. This means you can query the dataset and generate reports but you cannot complete administrative tasks. Data for some fields is obfuscated such as fullVisitorId, or removed such as clientId, adWordsClickInfo and geoNetwork. “Not available in demo dataset” will be returned for STRING values and “null” will be returned for INTEGER values when querying the fields containing no data.This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery

  5. 1000 Cannabis Genomes Project

    • kaggle.com
    zip
    Updated Feb 26, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google BigQuery (2019). 1000 Cannabis Genomes Project [Dataset]. https://www.kaggle.com/bigquery/genomics-cannabis
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Feb 26, 2019
    Dataset provided by
    Googlehttp://google.com/
    Authors
    Google BigQuery
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Cannabis is a genus of flowering plants in the family Cannabaceae.

    Source: https://en.wikipedia.org/wiki/Cannabis

    Content

    In October 2016, Phylos Bioscience released a genomic open dataset of approximately 850 strains of Cannabis via the Open Cannabis Project. In combination with other genomics datasets made available by Courtagen Life Sciences, Michigan State University, NCBI, Sunrise Medicinal, University of Calgary, University of Toronto, and Yunnan Academy of Agricultural Sciences, the total amount of publicly available data exceeds 1,000 samples taken from nearly as many unique strains.

    https://medium.com/google-cloud/dna-sequencing-of-1000-cannabis-strains-publicly-available-in-google-bigquery-a33430d63998

    These data were retrieved from the National Center for Biotechnology Information’s Sequence Read Archive (NCBI SRA), processed using the BWA aligner and FreeBayes variant caller, indexed with the Google Genomics API, and exported to BigQuery for analysis. Data are available directly from Google Cloud Storage at gs://gcs-public-data--genomics/cannabis, as well as via the Google Genomics API as dataset ID 918853309083001239, and an additional duplicated subset of only transcriptome data as dataset ID 94241232795910911, as well as in the BigQuery dataset bigquery-public-data:genomics_cannabis.

    All tables in the Cannabis Genomes Project dataset have a suffix like _201703. The suffix is referred to as [BUILD_DATE] in the descriptions below. The dataset is updated frequently as new releases become available.

    The following tables are included in the Cannabis Genomes Project dataset:

    Sample_info contains fields extracted for each SRA sample, including the SRA sample ID and other data that give indications about the type of sample. Sample types include: strain, library prep methods, and sequencing technology. See SRP008673 for an example of upstream sample data. SRP008673 is the University of Toronto sequencing of Cannabis Sativa subspecies Purple Kush.

    MNPR01_reference_[BUILD_DATE] contains reference sequence names and lengths for the draft assembly of Cannabis Sativa subspecies Cannatonic produced by Phylos Bioscience. This table contains contig identifiers and their lengths.

    MNPR01_[BUILD_DATE] contains variant calls for all included samples and types (genomic, transcriptomic) aligned to the MNPR01_reference_[BUILD_DATE] table. Samples can be found in the sample_info table. The MNPR01_[BUILD_DATE] table is exported using the Google Genomics BigQuery variants schema. This table is useful for general analysis of the Cannabis genome.

    MNPR01_transcriptome_[BUILD_DATE] is similar to the MNPR01_[BUILD_DATE] table, but it includes only the subset transcriptomic samples. This table is useful for transcribed gene-level analysis of the Cannabis genome.

    Fork this kernel to get started with this dataset.

    Acknowledgements

    Dataset Source: http://opencannabisproject.org/ Category: Genomics Use: This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - https://www.ncbi.nlm.nih.gov/home/about/policies.shtml - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset. Update frequency: As additional data are released to GenBank View in BigQuery: https://bigquery.cloud.google.com/dataset/bigquery-public-data:genomics_cannabis View in Google Cloud Storage: gs://gcs-public-data--genomics/cannabis

    Banner Photo by Rick Proctor from Unplash.

    Inspiration

    Which Cannabis samples are included in the variants table?

    Which contigs in the MNPR01_reference_[BUILD_DATE] table have the highest density of variants?

    How many variants does each sample have at the THC Synthase gene (THCA1) locus?

  6. Data from: Hacker News

    • console.cloud.google.com
    Updated Jul 21, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:Y%20Combinator&inv=1&invt=Ab2YeQ (2018). Hacker News [Dataset]. https://console.cloud.google.com/marketplace/product/y-combinator/hacker-news
    Explore at:
    Dataset updated
    Jul 21, 2018
    Dataset provided by
    Googlehttp://google.com/
    Description

    This dataset contains all stories and comments from Hacker News from its launch in 2006 to present. Each story contains a story ID, the author that made the post, when it was written, and the number of points the story received. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .

  7. Google Trends - International

    • console.cloud.google.com
    Updated Jul 22, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:BigQuery%20Public%20Datasets%20Program&inv=1&invt=Ab2hhQ (2018). Google Trends - International [Dataset]. https://console.cloud.google.com/marketplace/product/bigquery-public-datasets/google-trends-intl
    Explore at:
    Dataset updated
    Jul 22, 2018
    Dataset provided by
    Google Searchhttp://google.com/
    BigQueryhttps://cloud.google.com/bigquery
    Googlehttp://google.com/
    Description

    The International Google Trends dataset will provide critical signals that individual users and businesses alike can leverage to make better data-driven decisions. This dataset simplifies the manual interaction with the existing Google Trends UI by automating and exposing anonymized, aggregated, and indexed search data in BigQuery. This dataset includes the Top 25 stories and Top 25 Rising queries from Google Trends. It will be made available as two separate BigQuery tables, with a set of new top terms appended daily. Each set of Top 25 and Top 25 rising expires after 30 days, and will be accompanied by a rolling five-year window of historical data for each country and region across the globe, where data is available. This Google dataset is hosted in Google BigQuery as part of Google Cloud's Datasets solution and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery

  8. Google Analytics 4 sample data

    • kaggle.com
    Updated Sep 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    preeti deshmukh (2023). Google Analytics 4 sample data [Dataset]. https://www.kaggle.com/datasets/pdaasha/ga4-obfuscated-sample-ecommerce-jan2021/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 16, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    preeti deshmukh
    Description

    Context Google Analytics 4 is Google analytics latest service that enables you to measure traffic and engagement across your website as well as app.

    Content This is a sample dataset of GA4 for Google Merchendise store for the month of Jan 2021.

    Acknowledgements This information is exported from the google bigquery public data set.

    Data from: https://bigquery.cloud.google.com/table/bigquery-public-data:ga4_obfuscated_sample_ecommerce.events_*

    Inspiration To study google analytics it is really very difficult to get sample data so here's making it easy for some in need of one.

  9. OpenStreetMap Public Dataset

    • console.cloud.google.com
    Updated Jan 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:OpenStreetMap&inv=1&invt=Ab3rZg (2020). OpenStreetMap Public Dataset [Dataset]. https://console.cloud.google.com/marketplace/details/openstreetmap/geo-openstreetmap
    Explore at:
    Dataset updated
    Jan 16, 2020
    Dataset provided by
    OpenStreetMap//www.openstreetmap.org/
    Googlehttp://google.com/
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Adapted from Wikipedia: OpenStreetMap (OSM) is a collaborative project to create a free editable map of the world. Created in 2004, it was inspired by the success of Wikipedia and more than two million registered users who can add data by manual survey, GPS devices, aerial photography, and other free sources. We've made available a number of tables (explained in detail below): history_* tables: full history of OSM objects planet_* tables: snapshot of current OSM objects as of Nov 2019 The history_* and planet_* table groups are composed of node, way, relation, and changeset tables. These contain the primary OSM data types and an additional changeset corresponding to OSM edits for convenient access. These objects are encoded using the BigQuery GEOGRAPHY data type so that they can be operated upon with the built-in geography functions to perform geometry and feature selection, additional processing. Example analyses are given below. This dataset is part of a larger effort to make data available in BigQuery through the Google Cloud Public Datasets program . OSM itself is produced as a public good by volunteers, and there are no guarantees about data quality. Interested in learning more about how these data were brought into BigQuery and how you can use them? Check out the sample queries below to get started. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .

  10. American Community Survey (ACS)

    • console.cloud.google.com
    Updated Jul 16, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:United%20States%20Census%20Bureau&inv=1&invt=Abyneg (2018). American Community Survey (ACS) [Dataset]. https://console.cloud.google.com/marketplace/product/united-states-census-bureau/acs
    Explore at:
    Dataset updated
    Jul 16, 2018
    Dataset provided by
    Googlehttp://google.com/
    Description

    The American Community Survey (ACS) is an ongoing survey that provides vital information on a yearly basis about our nation and its people by contacting over 3.5 million households across the country. The resulting data provides incredibly detailed demographic information across the US aggregated at various geographic levels which helps determine how more than $675 billion in federal and state funding are distributed each year. Businesses use ACS data to inform strategic decision-making. ACS data can be used as a component of market research, provide information about concentrations of potential employees with a specific education or occupation, and which communities could be good places to build offices or facilities. For example, someone scouting a new location for an assisted-living center might look for an area with a large proportion of seniors and a large proportion of people employed in nursing occupations. Through the ACS, we know more about jobs and occupations, educational attainment, veterans, whether people own or rent their homes, and other topics. Public officials, planners, and entrepreneurs use this information to assess the past and plan the future. For more information, see the Census Bureau's ACS Information Guide . This public dataset is hosted in Google BigQuery as part of the Google Cloud Public Datasets Program , with Carto providing cleaning and onboarding support. It is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .

  11. USA Name Data

    • kaggle.com
    Updated Feb 12, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data.gov (2019). USA Name Data [Dataset]. https://www.kaggle.com/datasets/datagov/usa-names
    Explore at:
    Dataset updated
    Feb 12, 2019
    Dataset provided by
    Data.govhttps://data.gov/
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    United States
    Description

    Context

    Cultural diversity in the U.S. has led to great variations in names and naming traditions and names have been used to express creativity, personality, cultural identity, and values. Source: https://en.wikipedia.org/wiki/Naming_in_the_United_States

    Content

    This public dataset was created by the Social Security Administration and contains all names from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in this data. For others who did apply, records may not show the place of birth, and again their names are not included in the data.

    All data are from a 100% sample of records on Social Security card applications as of the end of February 2015. To safeguard privacy, the Social Security Administration restricts names to those with at least 5 occurrences.

    Fork this kernel to get started with this dataset.

    Acknowledgements

    https://bigquery.cloud.google.com/dataset/bigquery-public-data:usa_names

    https://cloud.google.com/bigquery/public-data/usa-names

    Dataset Source: Data.gov. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

    Banner Photo by @dcp from Unplash.

    Inspiration

    What are the most common names?

    What are the most common female names?

    Are there more female or male names?

    Female names by a wide margin?

  12. Multilingual Spoken Words Corpus - MLCommons Association

    • console.cloud.google.com
    Updated Jul 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:BigQuery%20Public%20Data&inv=1&invt=Ab3iWA (2022). Multilingual Spoken Words Corpus - MLCommons Association [Dataset]. https://console.cloud.google.com/marketplace/product/bigquery-public-data/mswc
    Explore at:
    Dataset updated
    Jul 10, 2022
    Dataset provided by
    BigQueryhttps://cloud.google.com/bigquery
    Googlehttp://google.com/
    Description

    The Multilingual Spoken Words Corpus is a large and growing audio dataset of spoken words in 50 languages for academic research and commercial applications in keyword spotting and spoken term search. The dataset contains more than 340,000 keywords, totaling 23.4 million 1-second spoken examples (over 6,000 hours). The dataset has many use cases, ranging from voice-enabled consumer devices to call center automation. It was generated by applying forced alignment on crowd-sourced sentence-level audio to produce per-word timing estimates for extraction. All alignments are included in the dataset. Please see the paper for a detailed analysis of the contents of the data and methods for detecting potential outliers, along with baseline accuracy metrics on keyword spotting models trained from the dataset compared to models trained on a manually-recorded keyword dataset. The dataset was released by the MLCommons Association; latest information at mlcommons.org/words. This public dataset is hosted in Google Cloud Storage and available free to use. Use this quick start guide to learn how to access public datasets on Google Cloud Storage.

  13. Daily Global Historical Climatology Network

    • kaggle.com
    zip
    Updated Aug 30, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NOAA (2019). Daily Global Historical Climatology Network [Dataset]. https://www.kaggle.com/noaa/ghcn-d
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Aug 30, 2019
    Dataset provided by
    National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
    Authors
    NOAA
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Weather is the state of the atmosphere, describing for example the degree to which it is hot or cold, wet or dry, calm or stormy, clear or cloudy. Source: https://en.wikipedia.org/wiki/Weather

    Content

    NOAA’s Global Historical Climatology Network (GHCN) is an integrated database of climate summaries from land surface stations across the globe that have been subjected to a common suite of quality assurance reviews. Two GHCN datasets are available in BigQuery, the GHCN-D (daily) and the GHCN-M (monthly). The data included in the GHCN datasets are obtained from more than 20 sources, including some data from every year since 1763.

    For a complete description of data variables available in this dataset, see NOAA’s readme.txt: https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt

    Update Frequency: daily

    Fork this kernel to get started with this dataset.

    Acknowledgements

    https://bigquery.cloud.google.com/dataset/bigquery-public-data:ghcn_d

    https://cloud.google.com/bigquery/public-data/noaa-ghcn

    Dataset Source: NOAA. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

    Banner Photo by Max LaRochelle from Unplash.

    Inspiration

    Find weather stations close to a specific location?

    Daily rainfall amounts at specific station?

    Pulling daily min/max temperature (in Celsius) and rainfall (in mm) for the past 14 days?

  14. Influence of Continuous Integration on the Development Activity in GitHub...

    • zenodo.org
    csv
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sebastian Baltes; Sebastian Baltes; Jascha Knack; Jascha Knack (2020). Influence of Continuous Integration on the Development Activity in GitHub Projects [Dataset]. http://doi.org/10.5281/zenodo.1140261
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sebastian Baltes; Sebastian Baltes; Jascha Knack; Jascha Knack
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is based on the TravisTorrent dataset released 2017-01-11 (https://travistorrent.testroots.org), the Google BigQuery GHTorrent dataset accessed 2017-07-03, and the Git log history of all projects in the dataset, retrieved 2017-07-16 - 2017-07-17.

    We selected projects hosted on GitHub that employ the Continuous Integration (CI) system Travis CI. We identified the projects using the TravisTorrent data set and considered projects that:

    1. were active for one year before the first build with Travis CI (before_ci),
    2. used Travis CI at least for one year (during_ci),
    3. had commit or merge activity on the default branch in both of these phases, and
    4. used the default branch to trigger builds.

    To derive the time frames, we employed the GHTorrent Big Query data set. The resulting sample contains 321 projects. Of these projects, 214 are Ruby projects and 107 are Java projects. The mean time span before_ci was 2.9 years (SD=1.9, Mdn=2.3), the mean time span during_ci was 3.2 years (SD=1.1, Mdn=3.3). For our analysis, we only consider the activity one year before and after the first build.

    We cloned the selected project repositories and extracted the version history for all branches (see https://github.com/sbaltes/git-log-parser). For each repo and branch, we created one log file with all regular commits and one log file with all merges. We only considered commits changing non-binary files and applied a file extension filter to only consider changes to Java or Ruby source code files. From the log files, we then extracted metadata about the commits and stored this data in CSV files (see https://github.com/sbaltes/git-log-parser).

    The dataset contains the following files:

    tr_projects_sample_filtered.csv
    A CSV file with information about the 321 selected projects.

    tr_sample_commits_default_branch_before_ci.csv
    tr_sample_commits_default_branch_during_ci.csv

    One CSV file with information about all commits to the default branch before and after the first CI build. Only commits modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:

    project: GitHub project name ("/" replaced by "_").
    branch: The branch to which the commit was made.
    hash_value: The SHA1 hash value of the commit.
    author_name: The author name.
    author_email: The author email address.
    author_date: The authoring timestamp.
    commit_name: The committer name.
    commit_email: The committer email address.
    commit_date: The commit timestamp.
    log_message_length: The length of the git commit messages (in characters).
    file_count: Files changed with this commit.
    lines_added: Lines added to all files changed with this commit.
    lines_deleted: Lines deleted in all files changed with this commit.
    file_extensions: Distinct file extensions of files changed with this commit.

    tr_sample_merges_default_branch_before_ci.csv
    tr_sample_merges_default_branch_during_ci.csv

    One CSV file with information about all merges into the default branch before and after the first CI build. Only merges modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:

    project: GitHub project name ("/" replaced by "_").
    branch: The destination branch of the merge.
    hash_value: The SHA1 hash value of the merge commit.
    merged_commits: Unique hash value prefixes of the commits merged with this commit.
    author_name: The author name.
    author_email: The author email address.
    author_date: The authoring timestamp.
    commit_name: The committer name.
    commit_email: The committer email address.
    commit_date: The commit timestamp.
    log_message_length: The length of the git commit messages (in characters).
    file_count: Files changed with this commit.
    lines_added: Lines added to all files changed with this commit.
    lines_deleted: Lines deleted in all files changed with this commit.
    file_extensions: Distinct file extensions of files changed with this commit.
    pull_request_id: ID of the GitHub pull request that has been merged with this commit (extracted from log message).
    source_user: GitHub login name of the user who initiated the pull request (extracted from log message).
    source_branch : Source branch of the pull request (extracted from log message).

  15. USA Names

    • console.cloud.google.com
    Updated Aug 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:U.S.%20Social%20Security%20Administration&hl=pt-BR&inv=1&invt=Ab4Asw (2023). USA Names [Dataset]. https://console.cloud.google.com/marketplace/product/social-security-administration/us-names?hl=pt-BR
    Explore at:
    Dataset updated
    Aug 15, 2023
    Dataset provided by
    Googlehttp://google.com/
    Area covered
    United States
    Description

    This public dataset was created by the Social Security Administration and contains all names from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in this data. For others who did apply, records may not show the place of birth, and again their names are not included in the data. All data are from a 100% sample of records on Social Security card applications as of the end of February 2015. To safeguard privacy, the Social Security Administration restricts names to those with at least 5 occurrences. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .

  16. Ethereum Classic Cryptocurrency Dataset - Dataset - CryptoData Hub

    • cryptodata.center
    Updated Dec 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    cryptodata.center (2024). Ethereum Classic Cryptocurrency Dataset - Dataset - CryptoData Hub [Dataset]. https://cryptodata.center/dataset/ethereum-classic-cryptocurrency-dataset
    Explore at:
    Dataset updated
    Dec 4, 2024
    Dataset provided by
    CryptoDATA
    Description

    Ethereum Classic is a cryptocurrency with shared history with the Ethereum cryptocurrency. On technical merits, the two cryptocurrencies are nearly identical, differing only in programming language features supported by the Ethereum Virtual machine which is used to write smart contracts. This dataset contains the blockchain data in their entirety, pre-processed to be human-friendly and to support common use cases such as auditing, investigating, and researching the economic and financial properties of the system. Interested in learning more about how Cloud Public Data is working to make data from blockchains and cryptocurrencies more accessible? Check out our blog post on the Google Cloud Big Data Blog and try the sample query below to get started. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .

  17. Medicare and Medicaid Services

    • kaggle.com
    zip
    Updated Apr 22, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google BigQuery (2020). Medicare and Medicaid Services [Dataset]. https://www.kaggle.com/datasets/bigquery/sdoh-hrsa-shortage-areas
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Apr 22, 2020
    Dataset provided by
    Googlehttp://google.com/
    Authors
    Google BigQuery
    Description

    Context

    This public dataset was created by the Centers for Medicare & Medicaid Services. The data summarize counts of enrollees who are dually-eligible for both Medicare and Medicaid program, including those in Medicare Savings Programs. “Duals” represent 20 percent of all Medicare beneficiaries, yet they account for 34 percent of all spending by the program, according to the Commonwealth Fund . As a representation of this high-needs, high-cost population, these data offer a view of regions ripe for more intensive care coordination that can address complex social and clinical needs. In addition to the high cost savings opportunity to deliver upstream clinical interventions, this population represents the county-by-county volume of patients who are eligible for both state level (Medicaid) and federal level (Medicare) reimbursements and potential funding streams to address unmet social needs across various programs, waivers, and other projects. The dataset includes eligibility type and enrollment by quarter, at both the state and county level. These data represent monthly snapshots submitted by states to the CMS, which are inherently lower than ever-enrolled counts (which include persons enrolled at any time during a calendar year.) For more information on dually eligible beneficiaries

    Querying BigQuery tables

    You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.sdoh_cms_dual_eligible_enrollment.

    Sample Query

    In what counties in Michigan has the number of dual-eligible individuals increased the most from 2015 to 2018? Find the counties in Michigan which have experienced the largest increase of dual enrollment households

    duals_Jan_2015 AS ( SELECT Public_Total AS duals_2015, County_Name, FIPS FROM bigquery-public-data.sdoh_cms_dual_eligible_enrollment.dual_eligible_enrollment_by_county_and_program WHERE State_Abbr = "MI" AND Date = '2015-12-01' ),

    duals_increase AS ( SELECT d18.FIPS, d18.County_Name, d15.duals_2015, d18.duals_2018, (d18.duals_2018 - d15.duals_2015) AS total_duals_diff FROM duals_Jan_2018 d18 JOIN duals_Jan_2015 d15 ON d18.FIPS = d15.FIPS )

    SELECT * FROM duals_increase WHERE total_duals_diff IS NOT NULL ORDER BY total_duals_diff DESC

  18. Chicago Taxi Trips

    • kaggle.com
    • data.cityofchicago.org
    zip
    Updated Apr 18, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Chicago (2018). Chicago Taxi Trips [Dataset]. https://www.kaggle.com/datasets/chicago/chicago-taxi-trips-bq
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Apr 18, 2018
    Dataset authored and provided by
    City of Chicago
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Chicago
    Description

    Context

    Taxicabs in Chicago, Illinois, are operated by private companies and licensed by the city. There are about seven thousand licensed cabs operating within the city limits. Licenses are obtained through the purchase or lease of a taxi medallion which is then affixed to the top right hood of the car. Source: https://en.wikipedia.org/wiki/Taxicabs_of_the_United_States#Chicago

    Content

    This dataset includes taxi trips from 2013 to the present, reported to the City of Chicago in its role as a regulatory agency. To protect privacy but allow for aggregate analyses, the Taxi ID is consistent for any given taxi medallion number but does not show the number, Census Tracts are suppressed in some cases, and times are rounded to the nearest 15 minutes. Due to the data reporting process, not all trips are reported but the City believes that most are. See http://digital.cityofchicago.org/index.php/chicago-taxi-data-released for more information about this dataset and how it was created.

    Fork this kernel to get started.

    Acknowledgements

    https://bigquery.cloud.google.com/dataset/bigquery-public-data:chicago_taxi_trips

    https://cloud.google.com/bigquery/public-data/chicago-taxi

    Dataset Source: City of Chicago

    This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source —https://data.cityofchicago.org — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

    Banner Photo by Ferdinand Stohr from Unplash.

    Inspiration

    What are the maximum, minimum and average fares for rides lasting 10 minutes or more? Which drop-off areas have the highest average tip? How does trip duration affect fare rates for trips lasting less than 90 minutes?

    https://cloud.google.com/bigquery/images/chicago-taxi-fares-by-duration.png" alt=""> https://cloud.google.com/bigquery/images/chicago-taxi-fares-by-duration.png

  19. g

    Iowa Liquor Retail Sales

    • console.cloud.google.com
    Updated Jul 31, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iowa Department of Commerce (2020). Iowa Liquor Retail Sales [Dataset]. https://console.cloud.google.com/marketplace/product/iowa-department-of-commerce/iowa-liquor-sales
    Explore at:
    Dataset updated
    Jul 31, 2020
    Dataset authored and provided by
    Iowa Department of Commerce
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Iowa
    Description

    This dataset contains every wholesale purchase of liquor in the State of Iowa by retailers for sale to individuals since January 1, 2012. The State of Iowa controls the wholesale distribution of liquor intended for retail sale, which means this dataset offers a complete view of retail liquor sales in the entire state. The dataset contains every wholesale order of liquor by all grocery stores, liquor stores, convenience stores, etc., with details about the store and location, the exact liquor brand and size, and the number of bottles ordered. In addition to being an excellent dataset for analyzing liquor sales, this is a large and clean public dataset of retail sales data. It can be used to explore problems like stockout prediction, retail demand forecasting, and other retail supply chain problems. The data dictionary is available from the State of Iowa's Alcoholic Beverages Division , within the Iowa Department of Commerce . There are some minor discrepancies in the data, discussed in the web view of the data . This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery.

  20. r

    gdelt_knowledge_graph_2020_sample

    • redivis.com
    Updated Aug 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Columbia Data Platform Demo (2021). gdelt_knowledge_graph_2020_sample [Dataset]. https://redivis.com/datasets/9nnp-dwj34k01c
    Explore at:
    Dataset updated
    Aug 4, 2021
    Dataset authored and provided by
    Columbia Data Platform Demo
    Description

    1 million rows from the gkg_partitioned table in the gdelt-bq.gdeltv2 dataset on big query. Only the year 2020 was queried.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Google BigQuery (2018). BigQuery Sample Tables [Dataset]. https://www.kaggle.com/bigquery/samples
Organization logoOrganization logo

BigQuery Sample Tables

Sample Tables for Tutorials and Learning (BigQuery)

Explore at:
zip(0 bytes)Available download formats
Dataset updated
Sep 4, 2018
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Googlehttp://google.com/
Authors
Google BigQuery
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Context

BigQuery provides a limited number of sample tables that you can run queries against. These tables are suited for testing queries and learning BigQuery.

Content

  • gsod: Contains weather information collected by NOAA, such as precipitation amounts and wind speeds from late 1929 to early 2010.

  • github_nested: Contains a timeline of actions such as pull requests and comments on GitHub repositories with a nested schema. Created in September 2012.

  • github_timeline: Contains a timeline of actions such as pull requests and comments on GitHub repositories with a flat schema. Created in May 2012.

  • natality: Describes all United States births registered in the 50 States, the District of Columbia, and New York City from 1969 to 2008.

  • shakespeare: Contains a word index of the works of Shakespeare, giving the number of times each word appears in each corpus.

  • trigrams: Contains English language trigrams from a sample of works published between 1520 and 2008.

  • wikipedia: Contains the complete revision history for all Wikipedia articles up to April 2010.

Fork this kernel to get started.

Acknowledgements

Data Source: https://cloud.google.com/bigquery/sample-tables

Banner Photo by Mervyn Chan from Unplash.

Inspiration

How many babies were born in New York City on Christmas Day?

How many words are in the play Hamlet?

Search
Clear search
Close search
Google apps
Main menu