28 datasets found
  1. Open Images

    • kaggle.com
    • opendatalab.com
    zip
    Updated Feb 12, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google BigQuery (2019). Open Images [Dataset]. https://www.kaggle.com/bigquery/open-images
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Feb 12, 2019
    Dataset provided by
    BigQueryhttps://cloud.google.com/bigquery
    Authors
    Google BigQuery
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Context

    Labeled datasets are useful in machine learning research.

    Content

    This public dataset contains approximately 9 million URLs and metadata for images that have been annotated with labels spanning more than 6,000 categories.

    Tables: 1) annotations_bbox 2) dict 3) images 4) labels

    Update Frequency: Quarterly

    Querying BigQuery Tables

    Fork this kernel to get started.

    Acknowledgements

    https://bigquery.cloud.google.com/dataset/bigquery-public-data:open_images

    https://cloud.google.com/bigquery/public-data/openimages

    APA-style citation: Google Research (2016). The Open Images dataset [Image urls and labels]. Available from github: https://github.com/openimages/dataset.

    Use: The annotations are licensed by Google Inc. under CC BY 4.0 license.

    The images referenced in the dataset are listed as having a CC BY 2.0 license. Note: while we tried to identify images that are licensed under a Creative Commons Attribution license, we make no representations or warranties regarding the license status of each image and you should verify the license for each image yourself.

    Banner Photo by Mattias Diesel from Unsplash.

    Inspiration

    Which labels are in the dataset? Which labels have "bus" in their display names? How many images of a trolleybus are in the dataset? What are some landing pages of images with a trolleybus? Which images with cherries are in the training set?

  2. GitHub Repos

    • kaggle.com
    zip
    Updated Mar 20, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Mar 20, 2019
    Dataset provided by
    GitHubhttps://github.com/
    Authors
    Github
    Description

    GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

    This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

    Querying BigQuery tables

    You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

    Acknowledgements

    This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

    Inspiration

    • This is the perfect dataset for fighting language wars.
    • Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?
  3. h

    github-r-repos

    • huggingface.co
    Updated Jun 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Falbel (2023). github-r-repos [Dataset]. https://huggingface.co/datasets/dfalbel/github-r-repos
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 6, 2023
    Authors
    Daniel Falbel
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    GitHub R repositories dataset

    R source files from GitHub. This dataset has been created using the public GitHub datasets from Google BigQuery. This is the actual query that has been used to export the data: EXPORT DATA OPTIONS ( uri = 'gs://your-bucket/gh-r/*.parquet', format = 'PARQUET') as ( select f.id, f.repo_name, f.path, c.content, c.size from ( SELECT distinct id, repo_name, path FROM bigquery-public-data.github_repos.files where ends_with(path… See the full description on the dataset page: https://huggingface.co/datasets/dfalbel/github-r-repos.

  4. BigQuery GIS Utility Datasets (U.S.)

    • kaggle.com
    zip
    Updated Mar 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google BigQuery (2019). BigQuery GIS Utility Datasets (U.S.) [Dataset]. https://www.kaggle.com/bigquery/utility-us
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Mar 20, 2019
    Dataset provided by
    Googlehttp://google.com/
    BigQueryhttps://cloud.google.com/bigquery
    Authors
    Google BigQuery
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Querying BigQuery tables You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME].

    • Project: "bigquery-public-data"
    • Table: "utility_us"

    Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

    If you're using Python, you can start with this code:

    import pandas as pd
    from bq_helper import BigQueryHelper
    bq_assistant = BigQueryHelper("bigquery-public-data", "utility_us")
    
  5. h

    github_meta

    • huggingface.co
    Updated Aug 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DeepGit (2024). github_meta [Dataset]. https://huggingface.co/datasets/deepgit/github_meta
    Explore at:
    Dataset updated
    Aug 9, 2024
    Dataset authored and provided by
    DeepGit
    License

    https://choosealicense.com/licenses/osl-3.0/https://choosealicense.com/licenses/osl-3.0/

    Description

    Process to Generate DuckDB Dataset

      1. Load Repository Metadata
    

    Read repo_metadata.json from GitHub Public Repository Metadata Normalize JSON into three lists: Repositories → general metadata (stars, forks, license, etc.). Languages → repo-language mappings with size. Topics → repo-topic mappings.

    Convert lists into Pandas DataFrames: df_repos, df_languages, df_topics.

      2. Enhance with BigQuery Data
    

    Create a temporary BigQuery table (repo_list)… See the full description on the dataset page: https://huggingface.co/datasets/deepgit/github_meta.

  6. Z

    Dataset Reuse Indicators Datasets

    • data.niaid.nih.gov
    Updated Sep 6, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Koesten, Laura; Vougiouklis, Pavlos; Groth, Paul; Simperl, Elena (2020). Dataset Reuse Indicators Datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4015954
    Explore at:
    Dataset updated
    Sep 6, 2020
    Dataset provided by
    University of Amsterdam
    Huawei Technologies
    King's College London
    Authors
    Koesten, Laura; Vougiouklis, Pavlos; Groth, Paul; Simperl, Elena
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains two files.

    1) A python pickle file (github_dataset.zip) that contains Github repositories with datasets. Specifically, using Google’s public dataset copy of Github and the BigQuery service to build a list of repositories that have a CSV or XLSX or XLS file. We then used the GitHub API to collect nformation about each repository in this list. The resulting dataset consists of 87936 repositories that contain at least a CSV, XLSX or XLS file, alongside with information about their features (e.g. number of open and closed issues and license) from GitHub. This corpus had more than two million data files. We then excluded those files withless then ten rows, which was the case for 65537 repositories with a total of 1,467,240 data files.

    2) A python pickle file (processed_dataset.zip) containing the feature information necessary to train a machine learning model to predict reuse on these Github datasets

    Source code can be found at: https://github.com/laurakoesten/Dataset-Reuse-Indicators

    For a full description of the content see:

    Koesten, Laura and Vougiouklis, Pavlos and Simperl, Elena and Groth, Paul, Dataset Reuse: Translating Principles to Practice. Available at SSRN: https://ssrn.com/abstract=3589836 or http://dx.doi.org/10.2139/ssrn.3589836

  7. GitHub-Issues

    • kaggle.com
    zip
    Updated Apr 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hamel Husain (2022). GitHub-Issues [Dataset]. https://www.kaggle.com/datasets/hamelhusain/githubissues
    Explore at:
    zip(2612014366 bytes)Available download formats
    Dataset updated
    Apr 28, 2022
    Authors
    Hamel Husain
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    GitHub Issues Data Pulled From BigQuery Public Dataset -- Access this query at https://console.cloud.google.com/bigquery?sq=235037502967:a71a4b32d74442558a2739b581064e5f

    This data is pulled with the following SQL query

    SELECT url, title, body
    FROM(
    SELECT url, title, body
      , ROW_NUMBER() OVER (PARTITION BY SUBSTR(body, 80, 120) ORDER BY url) as count_body_beg
    FROM(
    SELECT url, title, body
      , ROW_NUMBER() OVER (PARTITION BY SUBSTR(body, 40, 80) ORDER BY url) as count_body_beg
    FROM(
    SELECT url, title, body
     , ROW_NUMBER() OVER (PARTITION BY SUBSTR(body, 0, 40) ORDER BY url) as count_body_beg
    FROM(
    
      SELECT DISTINCT 
       url
       -- replace more than one white-space character in a row with a single space
      , REGEXP_REPLACE(title, r"\s{2,}", ' ') as title
      , REGEXP_REPLACE(body, r"\s{2,}", ' ') as body
      , ROW_NUMBER() OVER (PARTITION BY SUBSTR(title, 0, 22) ORDER BY url) as count_title_beg
      -- , RANK() OVER (PARTITION BY SUBSTR(body, 0, 1000) ORDER BY url) as count_body_beg
      FROM(
        SELECT
          JSON_EXTRACT(payload, '$.issue.html_url') as url
          -- extract the title and body removing parentheses, brackets, and quotes
         , LOWER(TRIM(REGEXP_REPLACE(JSON_EXTRACT(payload, '$.issue.title'), r"\
    |\(|\)|\[|\]|#|\*|`|\"", ' '))) as title
         , LOWER(TRIM(REGEXP_REPLACE(JSON_EXTRACT(payload, '$.issue.body'), r"\
    |\(|\)|\[|\]|#|\*|`|\"", ' '))) as body
        FROM `githubarchive.day.2021*`
        WHERE 
         -- ALL Of 2021 
           _TABLE_SUFFIX BETWEEN '0101' and '1231'
         and type="IssuesEvent" 
         -- Only want the issue at a specific point otherwise will have duplicates
         and JSON_EXTRACT(payload, '$.action') = "\"opened\"" 
         UNION ALL 
           SELECT
          JSON_EXTRACT(payload, '$.issue.html_url') as url
          -- extract the title and body removing parentheses, brackets, and quotes
         , LOWER(TRIM(REGEXP_REPLACE(JSON_EXTRACT(payload, '$.issue.title'), r"\
    |\(|\)|\[|\]|#|\*|`|\"", ' '))) as title
         , LOWER(TRIM(REGEXP_REPLACE(JSON_EXTRACT(payload, '$.issue.body'), r"\
    |\(|\)|\[|\]|#|\*|`|\"", ' '))) as body
        FROM `githubarchive.day.2020*`
        WHERE 
         -- ALL Of 2020 
           _TABLE_SUFFIX BETWEEN '0101' and '1231'
         and type="IssuesEvent" 
         -- Only want the issue at a specific point otherwise will have duplicates
         and JSON_EXTRACT(payload, '$.action') = "\"opened\""
    
      ) as tbl
    
      WHERE 
       -- the body must be at least 8 words long and the title at least 3 words long
       -- this is an arbitrary way to filter out empty or sparse issues
         ARRAY_LENGTH(SPLIT(body, ' ')) >= 6
       and ARRAY_LENGTH(SPLIT(title, ' ')) >= 3
       -- filter out issues that have really long titles or bodies
       --  (these are outliers, and will slow tokenization down).
       and LENGTH(title) <= 400
       and LENGTH(body) <= 2000
    ) tbl2
    WHERE count_title_beg = 1
    )tbl3
    WHERE count_body_beg = 1
    )tbl4
    WHERE count_body_beg = 1
    )tbl5
    WHERE count_body_beg = 1
    
  8. h

    github-repo-enumeration

    • huggingface.co
    Updated Sep 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aurelium (2024). github-repo-enumeration [Dataset]. https://huggingface.co/datasets/aurelium/github-repo-enumeration
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 20, 2024
    Authors
    Aurelium
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset was generated from GHArchive's Google BigQuery table. It contains a list of every public repo (~380,000,000) committed to from January 2016 up to August 2024, as well as the number of unique contributors and totals of the amounts of various events on those repositories in that time period. This is useless on its own, but represents more than a few hours of effort and roughly $8 worth of cloud processing, so I figured I would save the next person to try this some effort.

  9. Makani Flight Logs

    • console.cloud.google.com
    Updated Feb 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:BigQuery%20Public%20Datasets%20Program&hl=en-GB (2020). Makani Flight Logs [Dataset]. https://console.cloud.google.com/marketplace/product/bigquery-public-datasets/makani-logs?hl=en-GB
    Explore at:
    Dataset updated
    Feb 15, 2020
    Dataset provided by
    Googlehttp://google.com/
    BigQueryhttps://cloud.google.com/bigquery
    License
    Description

    Makani developed energy kites, using a wing tethered to a ground station to efficiently harness energy from the wind, generating electricity at utility-scale. As the kite flies autonomously in loops, rotors on the wing spin as the wind moves through them, generating electricity that is sent down the tether to the grid. The company was closed in February 2020, but major technical learnings have been made available in the public domain. This data set is part of that public package. The main folder in this bucket is labeled 'merged logs' and contains all telemetry from the kite and base station collected during crosswind flights of the M600 kite between 2016 and 2019. The other buckets contain build files and databases that are used to build and run the Makani flight simulator, which can be accessed at github.com/google/makani . This public dataset is hosted in Google Cloud Storage and available free to use. Use this quick start guide to learn how to access public datasets on Google Cloud Storage.

  10. GitHub Programming Languages Data

    • kaggle.com
    zip
    Updated Jan 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isaac Wen (2022). GitHub Programming Languages Data [Dataset]. https://www.kaggle.com/datasets/isaacwen/github-programming-languages-data
    Explore at:
    zip(41198 bytes)Available download formats
    Dataset updated
    Jan 2, 2022
    Authors
    Isaac Wen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context

    A common question for those new and familiar to computer science and software engineering is what is the most best and/or most popular programming language. It is very difficult to give a definitive answer, as there are a seemingly indefinite number of metrics that can define the 'best' or 'most popular' programming language.

    One such metric that can be used to define a 'popular' programming language is the number of projects and files that are made using that programming language. As GitHub is the most popular public collaboration and file-sharing platform, analyzing the languages that are used for repositories, PRs, and issues on GitHub and be a good indicator for the popularity of a language.

    Content

    This dataset contains statistics about the programming languages used for repositories, PRs, and issues on GitHub. The data is from 2011 to 2021.

    Source

    This data was queried and aggregated from BigQuery's public github_repos and githubarchive datasets.

    Limitations

    Only data for public GitHub repositories, and their corresponding PRs/issues, have their data available publicly. Thus, this dataset is only based on public repositories, which may not be fully representative of all repositories on GitHub.

  11. h

    java-decompiler

    • huggingface.co
    Updated Aug 16, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bradley McDanel (2017). java-decompiler [Dataset]. https://huggingface.co/datasets/BradMcDanel/java-decompiler
    Explore at:
    Dataset updated
    Aug 16, 2017
    Authors
    Bradley McDanel
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    java-decompiler

    This dataset contains Java source files and corresponding decompiled bytecode, suitable for training or evaluating decompilation and code understanding models. The Java files were extracted from public GitHub repositories indexed in Google BigQuery’s GitHub dataset. Files were selected with the following filters:

    Only single-class files were retained. Only files importing java.* libraries (i.e., no third-party dependencies). Each file was compilable with minimal… See the full description on the dataset page: https://huggingface.co/datasets/BradMcDanel/java-decompiler.

  12. Eclipse Megamovie

    • console.cloud.google.com
    Updated Jul 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:Google%20Cloud%20Public%20Datasets%20Program&hl=en_GB (2023). Eclipse Megamovie [Dataset]. https://console.cloud.google.com/marketplace/product/google-cloud-public-datasets/eclipse-megamovie?hl=en_GB
    Explore at:
    Dataset updated
    Jul 15, 2023
    Dataset provided by
    Googlehttp://google.com/
    Description

    This is the full set of images submitted for the Eclipse Megamovie project, a citizen science project to capture images of the Sun’s corona during the August 21, 2017 total solar eclipse. These images were taken by volunteer photographers (as well as the general public) from across the country using consumer camera equipment. The Eclipse Megamovie project was a collaboration between UC Berkeley, Google, the Astronomical Society of the Pacific, and many more.* In addition to the dataset, the code used by the project to create the website and process individual movies can be found in GitHub For a full description of the data fields, see below. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery . *Additional partners: Center for Research on Lifelong STEM Learning, Oregon State University, Eclipse Across America, Foothill College, High Altitude Observatory of the National Center for Atmospheric Research, Ideum, Lick Observatory, Space Sciences Laboratory, University of California, Berkeley, University of Colorado at Boulder, Williams College and the IAU Working Group.

  13. CMS Synthetic Patient Data OMOP

    • redivis.com
    application/jsonl +7
    Updated Aug 19, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Redivis Demo Organization (2020). CMS Synthetic Patient Data OMOP [Dataset]. https://redivis.com/datasets/ye2v-6skh7wdr7
    Explore at:
    sas, avro, parquet, stata, application/jsonl, arrow, csv, spssAvailable download formats
    Dataset updated
    Aug 19, 2020
    Dataset provided by
    Redivis Inc.
    Authors
    Redivis Demo Organization
    Time period covered
    Jan 1, 2008 - Dec 31, 2010
    Description

    Abstract

    This is a synthetic patient dataset in the OMOP Common Data Model v5.2, originally released by the CMS and accessed via BigQuery. The dataset includes 24 tables and records for 2 million synthetic patients from 2008 to 2010.

    Methodology

    This dataset takes on the format of the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM). As shown in the diagram below, the purpose of the Common Data Model is to convert various distinctly-formatted datasets into a well-known, universal format with a set of standardized vocabularies. See the diagram below from the Observational Health Data Sciences and Informatics (OHDSI) webpage.

    https://redivis.com/fileUploads/d1a95a4e-074a-44d1-92e5-9adfd2f4068a%3E" alt="Why-CDM.png">

    Such universal data models ultimately enable researchers to streamline the analysis of observational medical data. For more information regarding the OMOP CDM, refer to the OHSDI OMOP site.

    Usage

    %3Cli%3EFor documentation regarding the source data format from the Center for Medicare and Medicaid Services (CMS), refer to the %3Ca href="https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/DE_Syn_PUF"%3ECMS Synthetic Public Use File%3C/a%3E.%3C/li%3E

    %3Cli%3EFor information regarding the conversion of the CMS data file to the OMOP CDM v5.2, refer to %3Ca href="https://github.com/OHDSI/ETL-CMS"%3Ethis OHDSI GitHub page%3C/a%3E. %3C/li%3E

    %3Cli%3EFor information regarding each of the 24 tables in this dataset, including more detailed variable metadata, see %3Ca href="https://github.com/OHDSI/CommonDataModel/wiki"%3Ethe OHDSI CDM GitHub Wiki page%3C/a%3E. All variable labels and descriptions as well as table descriptions come from this Wiki page. Note that this GitHub page includes information primarily regarding the 6.0 version of the CDM and that this dataset works with the 5.2 version. %3C/li%3E

  14. SOTorrent 2018-12-09

    • kaggle.com
    zip
    Updated Dec 18, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SOTorrent (2018). SOTorrent 2018-12-09 [Dataset]. https://www.kaggle.com/datasets/sotorrent/2018-12-09
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Dec 18, 2018
    Dataset authored and provided by
    SOTorrent
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Please notice

    Tables TitleVersion and Votes are not yet visible in the Data preview page, but they are accessible in Kernels.

    Context

    Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on SO evolve over time, for example when bugs in code snippets are fixed, code is updated to work with a more recent library version, or text surrounding a code snippet is edited for clarity. To be able to analyze how content on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump.

    Content

    SOTorrent provides access to the version history of SO content at the level of whole posts and individual text or code blocks. It connects SO posts to other platforms by aggregating URLs from text blocks and comments, and by collecting references from GitHub files to SO posts. Our vision is that researchers will use SOTorrent to investigate and understand the evolution of SO posts and their relation to other platforms such as GitHub. If you use this dataset in your work, please cite our MSR 2018 paper or our MSR 2019 mining challenge proposal.

    This version is based on the official Stack Overflow data dump released 2018-12-02 and the Google BigQuery GitHub data set queried 2018-12-09.

    Inspiration

    The goal of the MSR 2019 mining challenge is to study the origin, evolution, and usage of Stack Overflow code snippets. Questions that are, to the best of our knowledge, not sufficiently answered yet include:

    • How are code snippets on Stack Overflow maintained?
    • How many clones of code snippets exist inside Stack Overflow?
    • How can we detect buggy versions of Stack Overflow code snippets and find them in GitHub projects?
    • How frequently are code snippets copied from external sources into Stack Overflow and then co-evolve there?
    • How do snippets copied from Stack Overflow to GitHub co-evolve?
    • Does the evolution of Stack Overflow code snippets follow patterns?
    • Do these patterns differ between programming languages?
    • Are the licenses of external sources compatible with Stack Overflow’s license (CC BY-SA 3.0)?
    • How many code blocks on Stack Overflow do not contain source code (and are only used for markup)?
    • Can we reliably predict bug-fixing edits to code on Stack Overflow?
    • Can we reliably predict popularity of Stack Overflow code snippets on GitHub?

    These are just some of the questions that could be answered using SOTorrent. We encourage challenge participants to adapt the above questions or formulate their own research questions about the origin, evolution, and usage of content on Stack Overflow.

  15. Synthetic Patient Data in OMOP

    • console.cloud.google.com
    Updated Jul 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:U.S.%20Department%20of%20Health%20%26%20Human%20Services&hl=ja (2023). Synthetic Patient Data in OMOP [Dataset]. https://console.cloud.google.com/marketplace/product/hhs/synpuf?hl=ja
    Explore at:
    Dataset updated
    Jul 26, 2023
    Dataset provided by
    Googlehttp://google.com/
    Description

    The Synthetic Patient Data in OMOP Dataset is a synthetic database released by the Centers for Medicare and Medicaid Services (CMS) Medicare Claims Synthetic Public Use Files (SynPUF). It is synthetic data containing 2008-2010 Medicare insurance claims for development and demonstration purposes. It has been converted to the Observational Medical Outcomes Partnership (OMOP) common data model from its original form, CSV, by the open source community as released on GitHub Please refer to the CMS Linkable 2008–2010 Medicare Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF) User Manual for details regarding how DE-SynPUF was created." This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .

  16. h

    LELU

    • huggingface.co
    Updated Apr 22, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FrancophonIA (2022). LELU [Dataset]. https://huggingface.co/datasets/FrancophonIA/LELU
    Explore at:
    Dataset updated
    Apr 22, 2022
    Dataset authored and provided by
    FrancophonIA
    Description

    [!NOTE] Dataset origin: https://www.kaggle.com/datasets/breandan/french-reddit-discussion

    LELÚ is a French dialog corpus that contains a rich collection of human-human, spontaneous written conversations, extracted from Reddit’s public dataset available through Google BigQuery. Our corpus is composed of 556,621 conversations with 1,583,083 utterances in total. The code to generate this dataset can be found in our GitHub Repository. The tag attributes can be described as follows: link_id: ID… See the full description on the dataset page: https://huggingface.co/datasets/FrancophonIA/LELU.

  17. Bitcoin Blockchain Historical Data

    • kaggle.com
    zip
    Updated Feb 12, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google BigQuery (2019). Bitcoin Blockchain Historical Data [Dataset]. https://www.kaggle.com/bigquery/bitcoin-blockchain
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Feb 12, 2019
    Dataset provided by
    BigQueryhttps://cloud.google.com/bigquery
    Authors
    Google BigQuery
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Blockchain technology, first implemented by Satoshi Nakamoto in 2009 as a core component of Bitcoin, is a distributed, public ledger recording transactions. Its usage allows secure peer-to-peer communication by linking blocks containing hash pointers to a previous block, a timestamp, and transaction data. Bitcoin is a decentralized digital currency (cryptocurrency) which leverages the Blockchain to store transactions in a distributed manner in order to mitigate against flaws in the financial industry.

    Nearly ten years after its inception, Bitcoin and other cryptocurrencies experienced an explosion in popular awareness. The value of Bitcoin, on the other hand, has experienced more volatility. Meanwhile, as use cases of Bitcoin and Blockchain grow, mature, and expand, hype and controversy have swirled.

    Content

    In this dataset, you will have access to information about blockchain blocks and transactions. All historical data are in the bigquery-public-data:crypto_bitcoin dataset. It’s updated it every 10 minutes. The data can be joined with historical prices in kernels. See available similar datasets here: https://www.kaggle.com/datasets?search=bitcoin.

    Querying BigQuery tables

    You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.crypto_bitcoin.[TABLENAME]. Fork this kernel to get started.

    Method & Acknowledgements

    Allen Day (Twitter | Medium), Google Cloud Developer Advocate & Colin Bookman, Google Cloud Customer Engineer retrieve data from the Bitcoin network using a custom client available on GitHub that they built with the bitcoinj Java library. Historical data from the origin block to 2018-01-31 were loaded in bulk to two BigQuery tables, blocks_raw and transactions. These tables contain fresh data, as they are now appended when new blocks are broadcast to the Bitcoin network. For additional information visit the Google Cloud Big Data and Machine Learning Blog post "Bitcoin in BigQuery: Blockchain analytics on public data".

    Photo by Andre Francois on Unsplash.

    Inspiration

    • How many bitcoins are sent each day?
    • How many addresses receive bitcoin each day?
    • Compare transaction volume to historical prices by joining with other available data sources
  18. PatCit: A Comprehensive Dataset of Patent Citations

    • zenodo.org
    application/gzip, bin
    Updated Dec 23, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gaétan de Rassenfosse; Gaétan de Rassenfosse; Cyril Verluise; Cyril Verluise (2020). PatCit: A Comprehensive Dataset of Patent Citations [Dataset]. http://doi.org/10.5281/zenodo.3710994
    Explore at:
    application/gzip, binAvailable download formats
    Dataset updated
    Dec 23, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gaétan de Rassenfosse; Gaétan de Rassenfosse; Cyril Verluise; Cyril Verluise
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PATCIT: A Comprehensive Dataset of Patent Citations [Website, Newsletter, GitHub]

    Patents are at the crossroads of many innovation nodes: science, industry, products, competition, etc. Such interactions can be identified through citations in a broad sense.

    It is now common to use front-page patent citations to study some aspects of the innovation system. However, there is much more buried in the Non Patent Literature (NPL) citations and in the patent text itself.

    Good news: Natural Language Processing (NLP) tools now enable social scientists to excavate and structure this long hidden information. That's the purpose of this project

    IN PRACTICE

    A detailed presentation of the current state of the project is available in our March 2020 presentation.

    So far, we have:

    1. classified the 40 million NPL citations reported in the DOCDB database in 9 distinct research oriented classes with a 90% accuracy rate.
    2. parsed and consolidated the 27 million NPL citations classified as bibliographical references.

    3. extracted, parsed and consolidated in-text bibliographical references and patent citations from the body of all time USPTO patents.

    The latest version of the dataset is the v0.15. It is made of the v0.1 of the US contextual citations dataset and v0.2 of the front-page NPL citations dataset.

    Give it a try! The dataset is publicly available on Google Cloud BigQuery, just click here.

    FEATURES

    Open

    • The code is licensed under MIT-2 and the dataset is licensed under CC4. Two highly permissive licenses.
    • The project is thought to be dynamically improved by and for the community. Anyone should feel free to open discussions, raise issues, request features and contribute to the project.

    Comprehensive

    • We address worldwide patents, as long as the data is available.
    • We address all classes of citations, not only bibliographical references.
    • We address front-page and in-text citations.

    Highest standards

    • We use and implement state-of-the art machine learning solutions.
    • We take great care to implement only the most efficient solutions. We believe that computational resources should be used sparsely, for both environmental sustainability and long term financial sustainability of the project.

  19. Ethereum Classic Blockchain

    • kaggle.com
    zip
    Updated Mar 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google BigQuery (2019). Ethereum Classic Blockchain [Dataset]. https://www.kaggle.com/datasets/bigquery/crypto-ethereum-classic
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Mar 20, 2019
    Dataset provided by
    BigQueryhttps://cloud.google.com/bigquery
    Authors
    Google BigQuery
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Ethereum Classic is an open-source, public, blockchain-based distributed computing platform featuring smart contract (scripting) functionality. It provides a decentralized Turing-complete virtual machine, the Ethereum Virtual Machine (EVM), which can execute scripts using an international network of public nodes. Ethereum Classic and Ethereum have a value token called "ether", which can be transferred between participants, stored in a cryptocurrency wallet and is used to compensate participant nodes for computations performed in the Ethereum Platform.

    Ethereum Classic came into existence when some members of the Ethereum community rejected the DAO hard fork on the grounds of "immutability", the principle that the blockchain cannot be changed, and decided to keep using the unforked version of Ethereum. Till this day, Etherum Classic runs the original Ethereum chain.

    Content

    In this dataset, you will have access to Ethereum Classic (ETC) historical block data along with transactions and traces. You can access the data from BigQuery in your notebook with bigquery-public-data.crypto_ethereum_classic dataset.

    Querying BigQuery tables

    You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.crypto_ethereum_classic.[TABLENAME]. Fork this kernel to get started.

    Acknowledgements

    This dataset wouldn't be possible without the help of Allen Day, Evgeny Medvedev and Yaz Khoury. This dataset uses Blockchain ETL. Special thanks to ETC community member @donsyang for the banner image.

    Inspiration

    One of the main questions we wanted to answer was the Gini coefficient of ETC data. We also wanted to analyze the DAO Smart Contract before and after the DAO Hack and the resulting Hardfork. We also wanted to analyze the network during the famous 51% attack and see what sort of patterns we can spot about the attacker.

  20. Hacker News Corpus

    • kaggle.com
    zip
    Updated Jun 29, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hacker News (2017). Hacker News Corpus [Dataset]. https://www.kaggle.com/hacker-news/hacker-news-corpus
    Explore at:
    zip(642956855 bytes)Available download formats
    Dataset updated
    Jun 29, 2017
    Dataset authored and provided by
    Hacker News
    Description

    Context

    This dataset contains a randomized sample of roughly one quarter of all stories and comments from Hacker News from its launch in 2006. Hacker News is a social news website focusing on computer science and entrepreneurship. It is run by Paul Graham's investment fund and startup incubator, Y Combinator. In general, content that can be submitted is defined as "anything that gratifies one's intellectual curiosity".

    Content

    Each story contains a story ID, the author that made the post, when it was written, and the number of points the story received.

    Please note that the text field includes profanity. All texts are the author’s own, do not necessarily reflect the positions of Kaggle or Hacker News, and are presented without endorsement.

    Acknowledgements

    This dataset was kindly made publicly available by Hacker News under the MIT license.

    Inspiration

    • Recent studies have found that many forums tend to be dominated by a very small fraction of users. Is this true of Hacker News?

    • Hacker News has received complaints that the site is biased towards Y Combinator startups. Do the data support this?

    • Is the amount of coverage by Hacker News predictive of a startup’s success?

    Use this dataset with BigQuery

    You can use Kernels to analyze, share, and discuss this data on Kaggle, but if you’re looking for real-time updates and bigger data, check out the data in BigQuery, too: https://cloud.google.com/bigquery/public-data/hacker-news

    The BigQuery version of this dataset has roughly four times as many articles.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Google BigQuery (2019). Open Images [Dataset]. https://www.kaggle.com/bigquery/open-images
Organization logo

Open Images

9 million URLs with labels and more than 6,000 categories (BigQuery)

Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 12, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Context

Labeled datasets are useful in machine learning research.

Content

This public dataset contains approximately 9 million URLs and metadata for images that have been annotated with labels spanning more than 6,000 categories.

Tables: 1) annotations_bbox 2) dict 3) images 4) labels

Update Frequency: Quarterly

Querying BigQuery Tables

Fork this kernel to get started.

Acknowledgements

https://bigquery.cloud.google.com/dataset/bigquery-public-data:open_images

https://cloud.google.com/bigquery/public-data/openimages

APA-style citation: Google Research (2016). The Open Images dataset [Image urls and labels]. Available from github: https://github.com/openimages/dataset.

Use: The annotations are licensed by Google Inc. under CC BY 4.0 license.

The images referenced in the dataset are listed as having a CC BY 2.0 license. Note: while we tried to identify images that are licensed under a Creative Commons Attribution license, we make no representations or warranties regarding the license status of each image and you should verify the license for each image yourself.

Banner Photo by Mattias Diesel from Unsplash.

Inspiration

Which labels are in the dataset? Which labels have "bus" in their display names? How many images of a trolleybus are in the dataset? What are some landing pages of images with a trolleybus? Which images with cherries are in the training set?

Search
Clear search
Close search
Google apps
Main menu