28 datasets found

Open Images
kaggle.com
opendatalab.com
zip
Updated Feb 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). Open Images [Dataset]. https://www.kaggle.com/bigquery/open-images
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 12, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Context

Labeled datasets are useful in machine learning research.

Content

This public dataset contains approximately 9 million URLs and metadata for images that have been annotated with labels spanning more than 6,000 categories.

Tables: 1) annotations_bbox 2) dict 3) images 4) labels

Update Frequency: Quarterly

Querying BigQuery Tables

Fork this kernel to get started.

Acknowledgements

https://bigquery.cloud.google.com/dataset/bigquery-public-data:open_images

https://cloud.google.com/bigquery/public-data/openimages

APA-style citation: Google Research (2016). The Open Images dataset [Image urls and labels]. Available from github: https://github.com/openimages/dataset.

Use: The annotations are licensed by Google Inc. under CC BY 4.0 license.

The images referenced in the dataset are listed as having a CC BY 2.0 license. Note: while we tried to identify images that are licensed under a Creative Commons Attribution license, we make no representations or warranties regarding the license status of each image and you should verify the license for each image yourself.

Banner Photo by Mattias Diesel from Unsplash.

Inspiration

Which labels are in the dataset? Which labels have "bus" in their display names? How many images of a trolleybus are in the dataset? What are some landing pages of images with a trolleybus? Which images with cherries are in the training set?
GitHub Repos
kaggle.com
zip
Updated Mar 20, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset provided by
GitHubhttps://github.com/
Authors
Github
Description
GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

Acknowledgements

This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

Inspiration

This is the perfect dataset for fighting language wars.

Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?
h
github-r-repos
huggingface.co
Updated Jun 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Falbel (2023). github-r-repos [Dataset]. https://huggingface.co/datasets/dfalbel/github-r-repos
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 6, 2023
Authors
Daniel Falbel
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
GitHub R repositories dataset

R source files from GitHub. This dataset has been created using the public GitHub datasets from Google BigQuery. This is the actual query that has been used to export the data: EXPORT DATA OPTIONS ( uri = 'gs://your-bucket/gh-r/*.parquet', format = 'PARQUET') as ( select f.id, f.repo_name, f.path, c.content, c.size from ( SELECT distinct id, repo_name, path FROM bigquery-public-data.github_repos.files where ends_with(path… See the full description on the dataset page: https://huggingface.co/datasets/dfalbel/github-r-repos.
BigQuery GIS Utility Datasets (U.S.)
kaggle.com
zip
Updated Mar 20, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). BigQuery GIS Utility Datasets (U.S.) [Dataset]. https://www.kaggle.com/bigquery/utility-us
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset provided by
Googlehttp://google.com/
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Querying BigQuery tables You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME].

Project: "bigquery-public-data"

Table: "utility_us"

Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

If you're using Python, you can start with this code:

import pandas as pd from bq_helper import BigQueryHelper bq_assistant = BigQueryHelper("bigquery-public-data", "utility_us")
h
github_meta
huggingface.co
Updated Aug 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DeepGit (2024). github_meta [Dataset]. https://huggingface.co/datasets/deepgit/github_meta
Explore at:
Dataset updated
Aug 9, 2024
Dataset authored and provided by
DeepGit
License
https://choosealicense.com/licenses/osl-3.0/https://choosealicense.com/licenses/osl-3.0/
Description
Process to Generate DuckDB Dataset

1. Load Repository Metadata

Read repo_metadata.json from GitHub Public Repository Metadata Normalize JSON into three lists: Repositories → general metadata (stars, forks, license, etc.). Languages → repo-language mappings with size. Topics → repo-topic mappings.

Convert lists into Pandas DataFrames: df_repos, df_languages, df_topics.

2. Enhance with BigQuery Data

Create a temporary BigQuery table (repo_list)… See the full description on the dataset page: https://huggingface.co/datasets/deepgit/github_meta.
Z
Dataset Reuse Indicators Datasets
data.niaid.nih.gov
Updated Sep 6, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Koesten, Laura; Vougiouklis, Pavlos; Groth, Paul; Simperl, Elena (2020). Dataset Reuse Indicators Datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4015954
Explore at:
Dataset updated
Sep 6, 2020
Dataset provided by
University of Amsterdam
Huawei Technologies
King's College London
Authors
Koesten, Laura; Vougiouklis, Pavlos; Groth, Paul; Simperl, Elena
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains two files.

1) A python pickle file (github_dataset.zip) that contains Github repositories with datasets. Specifically, using Google’s public dataset copy of Github and the BigQuery service to build a list of repositories that have a CSV or XLSX or XLS file. We then used the GitHub API to collect nformation about each repository in this list. The resulting dataset consists of 87936 repositories that contain at least a CSV, XLSX or XLS file, alongside with information about their features (e.g. number of open and closed issues and license) from GitHub. This corpus had more than two million data files. We then excluded those files withless then ten rows, which was the case for 65537 repositories with a total of 1,467,240 data files.

2) A python pickle file (processed_dataset.zip) containing the feature information necessary to train a machine learning model to predict reuse on these Github datasets

Source code can be found at: https://github.com/laurakoesten/Dataset-Reuse-Indicators

For a full description of the content see:

Koesten, Laura and Vougiouklis, Pavlos and Simperl, Elena and Groth, Paul, Dataset Reuse: Translating Principles to Practice. Available at SSRN: https://ssrn.com/abstract=3589836 or http://dx.doi.org/10.2139/ssrn.3589836

GitHub-Issues

kaggle.com

zip

Updated Apr 28, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Hamel Husain (2022). GitHub-Issues [Dataset]. https://www.kaggle.com/datasets/hamelhusain/githubissues

Explore at:

zip(2612014366 bytes)Available download formats

Dataset updated

Apr 28, 2022

Authors

Hamel Husain

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

GitHub Issues Data Pulled From BigQuery Public Dataset -- Access this query at https://console.cloud.google.com/bigquery?sq=235037502967:a71a4b32d74442558a2739b581064e5f

This data is pulled with the following SQL query

SELECT url, title, body
FROM(
SELECT url, title, body
  , ROW_NUMBER() OVER (PARTITION BY SUBSTR(body, 80, 120) ORDER BY url) as count_body_beg
FROM(
SELECT url, title, body
  , ROW_NUMBER() OVER (PARTITION BY SUBSTR(body, 40, 80) ORDER BY url) as count_body_beg
FROM(
SELECT url, title, body
 , ROW_NUMBER() OVER (PARTITION BY SUBSTR(body, 0, 40) ORDER BY url) as count_body_beg
FROM(

  SELECT DISTINCT 
   url
   -- replace more than one white-space character in a row with a single space
  , REGEXP_REPLACE(title, r"\s{2,}", ' ') as title
  , REGEXP_REPLACE(body, r"\s{2,}", ' ') as body
  , ROW_NUMBER() OVER (PARTITION BY SUBSTR(title, 0, 22) ORDER BY url) as count_title_beg
  -- , RANK() OVER (PARTITION BY SUBSTR(body, 0, 1000) ORDER BY url) as count_body_beg
  FROM(
    SELECT
      JSON_EXTRACT(payload, '$.issue.html_url') as url
      -- extract the title and body removing parentheses, brackets, and quotes
     , LOWER(TRIM(REGEXP_REPLACE(JSON_EXTRACT(payload, '$.issue.title'), r"\
|\(|\)|\[|\]|#|\*|`|\"", ' '))) as title
     , LOWER(TRIM(REGEXP_REPLACE(JSON_EXTRACT(payload, '$.issue.body'), r"\
|\(|\)|\[|\]|#|\*|`|\"", ' '))) as body
    FROM `githubarchive.day.2021*`
    WHERE 
     -- ALL Of 2021 
       _TABLE_SUFFIX BETWEEN '0101' and '1231'
     and type="IssuesEvent" 
     -- Only want the issue at a specific point otherwise will have duplicates
     and JSON_EXTRACT(payload, '$.action') = "\"opened\"" 
     UNION ALL 
       SELECT
      JSON_EXTRACT(payload, '$.issue.html_url') as url
      -- extract the title and body removing parentheses, brackets, and quotes
     , LOWER(TRIM(REGEXP_REPLACE(JSON_EXTRACT(payload, '$.issue.title'), r"\
|\(|\)|\[|\]|#|\*|`|\"", ' '))) as title
     , LOWER(TRIM(REGEXP_REPLACE(JSON_EXTRACT(payload, '$.issue.body'), r"\
|\(|\)|\[|\]|#|\*|`|\"", ' '))) as body
    FROM `githubarchive.day.2020*`
    WHERE 
     -- ALL Of 2020 
       _TABLE_SUFFIX BETWEEN '0101' and '1231'
     and type="IssuesEvent" 
     -- Only want the issue at a specific point otherwise will have duplicates
     and JSON_EXTRACT(payload, '$.action') = "\"opened\""

  ) as tbl

  WHERE 
   -- the body must be at least 8 words long and the title at least 3 words long
   -- this is an arbitrary way to filter out empty or sparse issues
     ARRAY_LENGTH(SPLIT(body, ' ')) >= 6
   and ARRAY_LENGTH(SPLIT(title, ' ')) >= 3
   -- filter out issues that have really long titles or bodies
   --  (these are outliers, and will slow tokenization down).
   and LENGTH(title) <= 400
   and LENGTH(body) <= 2000
) tbl2
WHERE count_title_beg = 1
)tbl3
WHERE count_body_beg = 1
)tbl4
WHERE count_body_beg = 1
)tbl5
WHERE count_body_beg = 1

h
github-repo-enumeration
huggingface.co
Updated Sep 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aurelium (2024). github-repo-enumeration [Dataset]. https://huggingface.co/datasets/aurelium/github-repo-enumeration
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 20, 2024
Authors
Aurelium
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset was generated from GHArchive's Google BigQuery table. It contains a list of every public repo (~380,000,000) committed to from January 2016 up to August 2024, as well as the number of unique contributors and totals of the amounts of various events on those repositories in that time period. This is useless on its own, but represents more than a few hours of effort and roughly $8 worth of cloud processing, so I figured I would save the next person to try this some effort.
Makani Flight Logs
console.cloud.google.com
Updated Feb 15, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:BigQuery%20Public%20Datasets%20Program&hl=en-GB (2020). Makani Flight Logs [Dataset]. https://console.cloud.google.com/marketplace/product/bigquery-public-datasets/makani-logs?hl=en-GB
Explore at:
Dataset updated
Feb 15, 2020
Dataset provided by
Googlehttp://google.com/
BigQueryhttps://cloud.google.com/bigquery
License
Description
Makani developed energy kites, using a wing tethered to a ground station to efficiently harness energy from the wind, generating electricity at utility-scale. As the kite flies autonomously in loops, rotors on the wing spin as the wind moves through them, generating electricity that is sent down the tether to the grid. The company was closed in February 2020, but major technical learnings have been made available in the public domain. This data set is part of that public package. The main folder in this bucket is labeled 'merged logs' and contains all telemetry from the kite and base station collected during crosswind flights of the M600 kite between 2016 and 2019. The other buckets contain build files and databases that are used to build and run the Makani flight simulator, which can be accessed at github.com/google/makani . This public dataset is hosted in Google Cloud Storage and available free to use. Use this quick start guide to learn how to access public datasets on Google Cloud Storage.
GitHub Programming Languages Data
kaggle.com
zip
Updated Jan 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isaac Wen (2022). GitHub Programming Languages Data [Dataset]. https://www.kaggle.com/datasets/isaacwen/github-programming-languages-data
Explore at:
zip(41198 bytes)Available download formats
Dataset updated
Jan 2, 2022
Authors
Isaac Wen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context

A common question for those new and familiar to computer science and software engineering is what is the most best and/or most popular programming language. It is very difficult to give a definitive answer, as there are a seemingly indefinite number of metrics that can define the 'best' or 'most popular' programming language.

One such metric that can be used to define a 'popular' programming language is the number of projects and files that are made using that programming language. As GitHub is the most popular public collaboration and file-sharing platform, analyzing the languages that are used for repositories, PRs, and issues on GitHub and be a good indicator for the popularity of a language.

Content

This dataset contains statistics about the programming languages used for repositories, PRs, and issues on GitHub. The data is from 2011 to 2021.

Source

This data was queried and aggregated from BigQuery's public github_repos and githubarchive datasets.

Limitations

Only data for public GitHub repositories, and their corresponding PRs/issues, have their data available publicly. Thus, this dataset is only based on public repositories, which may not be fully representative of all repositories on GitHub.
h
java-decompiler
huggingface.co
Updated Aug 16, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bradley McDanel (2017). java-decompiler [Dataset]. https://huggingface.co/datasets/BradMcDanel/java-decompiler
Explore at:
Dataset updated
Aug 16, 2017
Authors
Bradley McDanel
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
java-decompiler

This dataset contains Java source files and corresponding decompiled bytecode, suitable for training or evaluating decompilation and code understanding models. The Java files were extracted from public GitHub repositories indexed in Google BigQuery’s GitHub dataset. Files were selected with the following filters:

Only single-class files were retained. Only files importing java.* libraries (i.e., no third-party dependencies). Each file was compilable with minimal… See the full description on the dataset page: https://huggingface.co/datasets/BradMcDanel/java-decompiler.
Eclipse Megamovie
console.cloud.google.com
Updated Jul 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:Google%20Cloud%20Public%20Datasets%20Program&hl=en_GB (2023). Eclipse Megamovie [Dataset]. https://console.cloud.google.com/marketplace/product/google-cloud-public-datasets/eclipse-megamovie?hl=en_GB
Explore at:
Dataset updated
Jul 15, 2023
Dataset provided by
Googlehttp://google.com/
Description
This is the full set of images submitted for the Eclipse Megamovie project, a citizen science project to capture images of the Sun’s corona during the August 21, 2017 total solar eclipse. These images were taken by volunteer photographers (as well as the general public) from across the country using consumer camera equipment. The Eclipse Megamovie project was a collaboration between UC Berkeley, Google, the Astronomical Society of the Pacific, and many more.* In addition to the dataset, the code used by the project to create the website and process individual movies can be found in GitHub For a full description of the data fields, see below. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery . *Additional partners: Center for Research on Lifelong STEM Learning, Oregon State University, Eclipse Across America, Foothill College, High Altitude Observatory of the National Center for Atmospheric Research, Ideum, Lick Observatory, Space Sciences Laboratory, University of California, Berkeley, University of Colorado at Boulder, Williams College and the IAU Working Group.
CMS Synthetic Patient Data OMOP
redivis.com
application/jsonl +7
Updated Aug 19, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Redivis Demo Organization (2020). CMS Synthetic Patient Data OMOP [Dataset]. https://redivis.com/datasets/ye2v-6skh7wdr7
Explore at:
sas, avro, parquet, stata, application/jsonl, arrow, csv, spssAvailable download formats
Dataset updated
Aug 19, 2020
Dataset provided by
Redivis Inc.
Authors
Redivis Demo Organization
Time period covered
Jan 1, 2008 - Dec 31, 2010
Description
Abstract

This is a synthetic patient dataset in the OMOP Common Data Model v5.2, originally released by the CMS and accessed via BigQuery. The dataset includes 24 tables and records for 2 million synthetic patients from 2008 to 2010.

Methodology

This dataset takes on the format of the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM). As shown in the diagram below, the purpose of the Common Data Model is to convert various distinctly-formatted datasets into a well-known, universal format with a set of standardized vocabularies. See the diagram below from the Observational Health Data Sciences and Informatics (OHDSI) webpage.

https://redivis.com/fileUploads/d1a95a4e-074a-44d1-92e5-9adfd2f4068a%3E" alt="Why-CDM.png">

Such universal data models ultimately enable researchers to streamline the analysis of observational medical data. For more information regarding the OMOP CDM, refer to the OHSDI OMOP site.

Usage

%3Cli%3EFor documentation regarding the source data format from the Center for Medicare and Medicaid Services (CMS), refer to the %3Ca href="https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/DE_Syn_PUF"%3ECMS Synthetic Public Use File%3C/a%3E.%3C/li%3E

%3Cli%3EFor information regarding the conversion of the CMS data file to the OMOP CDM v5.2, refer to %3Ca href="https://github.com/OHDSI/ETL-CMS"%3Ethis OHDSI GitHub page%3C/a%3E. %3C/li%3E

%3Cli%3EFor information regarding each of the 24 tables in this dataset, including more detailed variable metadata, see %3Ca href="https://github.com/OHDSI/CommonDataModel/wiki"%3Ethe OHDSI CDM GitHub Wiki page%3C/a%3E. All variable labels and descriptions as well as table descriptions come from this Wiki page. Note that this GitHub page includes information primarily regarding the 6.0 version of the CDM and that this dataset works with the 5.2 version. %3C/li%3E
SOTorrent 2018-12-09
kaggle.com
zip
Updated Dec 18, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SOTorrent (2018). SOTorrent 2018-12-09 [Dataset]. https://www.kaggle.com/datasets/sotorrent/2018-12-09
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Dec 18, 2018
Dataset authored and provided by
SOTorrent
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Please notice

Tables TitleVersion and Votes are not yet visible in the Data preview page, but they are accessible in Kernels.

Context

Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on SO evolve over time, for example when bugs in code snippets are fixed, code is updated to work with a more recent library version, or text surrounding a code snippet is edited for clarity. To be able to analyze how content on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump.

Content

SOTorrent provides access to the version history of SO content at the level of whole posts and individual text or code blocks. It connects SO posts to other platforms by aggregating URLs from text blocks and comments, and by collecting references from GitHub files to SO posts. Our vision is that researchers will use SOTorrent to investigate and understand the evolution of SO posts and their relation to other platforms such as GitHub. If you use this dataset in your work, please cite our MSR 2018 paper or our MSR 2019 mining challenge proposal.

This version is based on the official Stack Overflow data dump released 2018-12-02 and the Google BigQuery GitHub data set queried 2018-12-09.

Inspiration

The goal of the MSR 2019 mining challenge is to study the origin, evolution, and usage of Stack Overflow code snippets. Questions that are, to the best of our knowledge, not sufficiently answered yet include:

How are code snippets on Stack Overflow maintained?

How many clones of code snippets exist inside Stack Overflow?

How can we detect buggy versions of Stack Overflow code snippets and find them in GitHub projects?

How frequently are code snippets copied from external sources into Stack Overflow and then co-evolve there?

How do snippets copied from Stack Overflow to GitHub co-evolve?

Does the evolution of Stack Overflow code snippets follow patterns?

Do these patterns differ between programming languages?

Are the licenses of external sources compatible with Stack Overflow’s license (CC BY-SA 3.0)?

How many code blocks on Stack Overflow do not contain source code (and are only used for markup)?

Can we reliably predict bug-fixing edits to code on Stack Overflow?

Can we reliably predict popularity of Stack Overflow code snippets on GitHub?

These are just some of the questions that could be answered using SOTorrent. We encourage challenge participants to adapt the above questions or formulate their own research questions about the origin, evolution, and usage of content on Stack Overflow.
Synthetic Patient Data in OMOP
console.cloud.google.com
Updated Jul 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:U.S.%20Department%20of%20Health%20%26%20Human%20Services&hl=ja (2023). Synthetic Patient Data in OMOP [Dataset]. https://console.cloud.google.com/marketplace/product/hhs/synpuf?hl=ja
Explore at:
Dataset updated
Jul 26, 2023
Dataset provided by
Googlehttp://google.com/
Description
The Synthetic Patient Data in OMOP Dataset is a synthetic database released by the Centers for Medicare and Medicaid Services (CMS) Medicare Claims Synthetic Public Use Files (SynPUF). It is synthetic data containing 2008-2010 Medicare insurance claims for development and demonstration purposes. It has been converted to the Observational Medical Outcomes Partnership (OMOP) common data model from its original form, CSV, by the open source community as released on GitHub Please refer to the CMS Linkable 2008–2010 Medicare Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF) User Manual for details regarding how DE-SynPUF was created." This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
h
LELU
huggingface.co
Updated Apr 22, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FrancophonIA (2022). LELU [Dataset]. https://huggingface.co/datasets/FrancophonIA/LELU
Explore at:
Dataset updated
Apr 22, 2022
Dataset authored and provided by
FrancophonIA
Description
[!NOTE] Dataset origin: https://www.kaggle.com/datasets/breandan/french-reddit-discussion

LELÚ is a French dialog corpus that contains a rich collection of human-human, spontaneous written conversations, extracted from Reddit’s public dataset available through Google BigQuery. Our corpus is composed of 556,621 conversations with 1,583,083 utterances in total. The code to generate this dataset can be found in our GitHub Repository. The tag attributes can be described as follows: link_id: ID… See the full description on the dataset page: https://huggingface.co/datasets/FrancophonIA/LELU.
Bitcoin Blockchain Historical Data
kaggle.com
zip
Updated Feb 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). Bitcoin Blockchain Historical Data [Dataset]. https://www.kaggle.com/bigquery/bitcoin-blockchain
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 12, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Blockchain technology, first implemented by Satoshi Nakamoto in 2009 as a core component of Bitcoin, is a distributed, public ledger recording transactions. Its usage allows secure peer-to-peer communication by linking blocks containing hash pointers to a previous block, a timestamp, and transaction data. Bitcoin is a decentralized digital currency (cryptocurrency) which leverages the Blockchain to store transactions in a distributed manner in order to mitigate against flaws in the financial industry.

Nearly ten years after its inception, Bitcoin and other cryptocurrencies experienced an explosion in popular awareness. The value of Bitcoin, on the other hand, has experienced more volatility. Meanwhile, as use cases of Bitcoin and Blockchain grow, mature, and expand, hype and controversy have swirled.

Content

In this dataset, you will have access to information about blockchain blocks and transactions. All historical data are in the bigquery-public-data:crypto_bitcoin dataset. It’s updated it every 10 minutes. The data can be joined with historical prices in kernels. See available similar datasets here: https://www.kaggle.com/datasets?search=bitcoin.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.crypto_bitcoin.[TABLENAME]. Fork this kernel to get started.

Method & Acknowledgements

Allen Day (Twitter | Medium), Google Cloud Developer Advocate & Colin Bookman, Google Cloud Customer Engineer retrieve data from the Bitcoin network using a custom client available on GitHub that they built with the bitcoinj Java library. Historical data from the origin block to 2018-01-31 were loaded in bulk to two BigQuery tables, blocks_raw and transactions. These tables contain fresh data, as they are now appended when new blocks are broadcast to the Bitcoin network. For additional information visit the Google Cloud Big Data and Machine Learning Blog post "Bitcoin in BigQuery: Blockchain analytics on public data".

Photo by Andre Francois on Unsplash.

Inspiration

How many bitcoins are sent each day?

How many addresses receive bitcoin each day?

Compare transaction volume to historical prices by joining with other available data sources
PatCit: A Comprehensive Dataset of Patent Citations
zenodo.org
application/gzip, bin
Updated Dec 23, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gaétan de Rassenfosse; Gaétan de Rassenfosse; Cyril Verluise; Cyril Verluise (2020). PatCit: A Comprehensive Dataset of Patent Citations [Dataset]. http://doi.org/10.5281/zenodo.3710994
Explore at:
application/gzip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3710994
Dataset updated
Dec 23, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gaétan de Rassenfosse; Gaétan de Rassenfosse; Cyril Verluise; Cyril Verluise
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PATCIT: A Comprehensive Dataset of Patent Citations [Website, Newsletter, GitHub]

Patents are at the crossroads of many innovation nodes: science, industry, products, competition, etc. Such interactions can be identified through citations in a broad sense.

It is now common to use front-page patent citations to study some aspects of the innovation system. However, there is much more buried in the Non Patent Literature (NPL) citations and in the patent text itself.

Good news: Natural Language Processing (NLP) tools now enable social scientists to excavate and structure this long hidden information. That's the purpose of this project

IN PRACTICE

A detailed presentation of the current state of the project is available in our March 2020 presentation.

So far, we have:

classified the 40 million NPL citations reported in the DOCDB database in 9 distinct research oriented classes with a 90% accuracy rate.

parsed and consolidated the 27 million NPL citations classified as bibliographical references.

extracted, parsed and consolidated in-text bibliographical references and patent citations from the body of all time USPTO patents.

The latest version of the dataset is the v0.15. It is made of the v0.1 of the US contextual citations dataset and v0.2 of the front-page NPL citations dataset.

Give it a try! The dataset is publicly available on Google Cloud BigQuery, just click here.

FEATURES

Open

The code is licensed under MIT-2 and the dataset is licensed under CC4. Two highly permissive licenses.

The project is thought to be dynamically improved by and for the community. Anyone should feel free to open discussions, raise issues, request features and contribute to the project.

Comprehensive

We address worldwide patents, as long as the data is available.

We address all classes of citations, not only bibliographical references.

We address front-page and in-text citations.

Highest standards

We use and implement state-of-the art machine learning solutions.

We take great care to implement only the most efficient solutions. We believe that computational resources should be used sparsely, for both environmental sustainability and long term financial sustainability of the project.
Ethereum Classic Blockchain
kaggle.com
zip
Updated Mar 20, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). Ethereum Classic Blockchain [Dataset]. https://www.kaggle.com/datasets/bigquery/crypto-ethereum-classic
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Ethereum Classic is an open-source, public, blockchain-based distributed computing platform featuring smart contract (scripting) functionality. It provides a decentralized Turing-complete virtual machine, the Ethereum Virtual Machine (EVM), which can execute scripts using an international network of public nodes. Ethereum Classic and Ethereum have a value token called "ether", which can be transferred between participants, stored in a cryptocurrency wallet and is used to compensate participant nodes for computations performed in the Ethereum Platform.

Ethereum Classic came into existence when some members of the Ethereum community rejected the DAO hard fork on the grounds of "immutability", the principle that the blockchain cannot be changed, and decided to keep using the unforked version of Ethereum. Till this day, Etherum Classic runs the original Ethereum chain.

Content

In this dataset, you will have access to Ethereum Classic (ETC) historical block data along with transactions and traces. You can access the data from BigQuery in your notebook with bigquery-public-data.crypto_ethereum_classic dataset.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.crypto_ethereum_classic.[TABLENAME]. Fork this kernel to get started.

Acknowledgements

This dataset wouldn't be possible without the help of Allen Day, Evgeny Medvedev and Yaz Khoury. This dataset uses Blockchain ETL. Special thanks to ETC community member @donsyang for the banner image.

Inspiration

One of the main questions we wanted to answer was the Gini coefficient of ETC data. We also wanted to analyze the DAO Smart Contract before and after the DAO Hack and the resulting Hardfork. We also wanted to analyze the network during the famous 51% attack and see what sort of patterns we can spot about the attacker.
Hacker News Corpus
kaggle.com
zip
Updated Jun 29, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hacker News (2017). Hacker News Corpus [Dataset]. https://www.kaggle.com/hacker-news/hacker-news-corpus
Explore at:
zip(642956855 bytes)Available download formats
Dataset updated
Jun 29, 2017
Dataset authored and provided by
Hacker News
Description
Context

This dataset contains a randomized sample of roughly one quarter of all stories and comments from Hacker News from its launch in 2006. Hacker News is a social news website focusing on computer science and entrepreneurship. It is run by Paul Graham's investment fund and startup incubator, Y Combinator. In general, content that can be submitted is defined as "anything that gratifies one's intellectual curiosity".

Content

Each story contains a story ID, the author that made the post, when it was written, and the number of points the story received.

Please note that the text field includes profanity. All texts are the author’s own, do not necessarily reflect the positions of Kaggle or Hacker News, and are presented without endorsement.

Acknowledgements

This dataset was kindly made publicly available by Hacker News under the MIT license.

Inspiration

Recent studies have found that many forums tend to be dominated by a very small fraction of users. Is this true of Hacker News?

Hacker News has received complaints that the site is biased towards Y Combinator startups. Do the data support this?

Is the amount of coverage by Hacker News predictive of a startup’s success?

Use this dataset with BigQuery

You can use Kernels to analyze, share, and discuss this data on Kaggle, but if you’re looking for real-time updates and bigger data, check out the data in BigQuery, too: https://cloud.google.com/bigquery/public-data/hacker-news

The BigQuery version of this dataset has roughly four times as many articles.

Facebook

Twitter

Click to copy link

Link copied

Cite

Google BigQuery (2019). Open Images [Dataset]. https://www.kaggle.com/bigquery/open-images

Open Images

9 million URLs with labels and more than 6,000 categories (BigQuery)

Explore at:

zip(0 bytes)Available download formats

Dataset updated

Feb 12, 2019

Dataset provided by

BigQueryhttps://cloud.google.com/bigquery

Authors

Google BigQuery

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Context

Labeled datasets are useful in machine learning research.

Content

This public dataset contains approximately 9 million URLs and metadata for images that have been annotated with labels spanning more than 6,000 categories.

Tables: 1) annotations_bbox 2) dict 3) images 4) labels

Update Frequency: Quarterly

Querying BigQuery Tables

Fork this kernel to get started.

Acknowledgements

https://bigquery.cloud.google.com/dataset/bigquery-public-data:open_images

https://cloud.google.com/bigquery/public-data/openimages

APA-style citation: Google Research (2016). The Open Images dataset [Image urls and labels]. Available from github: https://github.com/openimages/dataset.

Use: The annotations are licensed by Google Inc. under CC BY 4.0 license.

The images referenced in the dataset are listed as having a CC BY 2.0 license. Note: while we tried to identify images that are licensed under a Creative Commons Attribution license, we make no representations or warranties regarding the license status of each image and you should verify the license for each image yourself.

Banner Photo by Mattias Diesel from Unsplash.

Inspiration

Which labels are in the dataset? Which labels have "bus" in their display names? How many images of a trolleybus are in the dataset? What are some landing pages of images with a trolleybus? Which images with cherries are in the training set?

Clear search

Close search

Google apps

Main menu

Open Images

Context

Content

Querying BigQuery Tables

Acknowledgements

Inspiration

GitHub Repos

Querying BigQuery tables

Acknowledgements

Inspiration

github-r-repos

BigQuery GIS Utility Datasets (U.S.)

Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

github_meta

Dataset Reuse Indicators Datasets

GitHub-Issues

github-repo-enumeration

Makani Flight Logs

GitHub Programming Languages Data

Context

Content

Source

Limitations

java-decompiler

Eclipse Megamovie

CMS Synthetic Patient Data OMOP

Abstract

Methodology

Usage

SOTorrent 2018-12-09

Please notice

Context

Content

Inspiration

Synthetic Patient Data in OMOP

LELU

Bitcoin Blockchain Historical Data

Context

Content

Querying BigQuery tables

Method & Acknowledgements

Inspiration

PatCit: A Comprehensive Dataset of Patent Citations

Ethereum Classic Blockchain

Context

Content

Querying BigQuery tables

Acknowledgements

Inspiration

Hacker News Corpus

Context

Content

Acknowledgements

Inspiration

Use this dataset with BigQuery

Open Images

9 million URLs with labels and more than 6,000 categories (BigQuery)

Context

Content

Querying BigQuery Tables

Acknowledgements

Inspiration