100+ datasets found
  1. GitHub Activity Data

    • console.cloud.google.com
    Updated Jun 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:GitHub&inv=1&invt=Ab2YDQ (2022). GitHub Activity Data [Dataset]. https://console.cloud.google.com/marketplace/product/github/github-repos
    Explore at:
    Dataset updated
    Jun 23, 2022
    Dataset provided by
    GitHubhttps://github.com/
    Googlehttp://google.com/
    Description

    GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008. This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .

  2. h

    github-code

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CodeParrot, github-code [Dataset]. https://huggingface.co/datasets/codeparrot/github-code
    Explore at:
    Dataset provided by
    Good Engineering, Inc
    Authors
    CodeParrot
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    The GitHub Code dataest consists of 115M code files from GitHub in 32 programming languages with 60 extensions totalling in 1TB of text data. The dataset was created from the GitHub dataset on BiqQuery.

  3. g

    Coronavirus COVID-19 Global Cases by the Center for Systems Science and...

    • github.com
    • systems.jhu.edu
    • +1more
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE), Coronavirus COVID-19 Global Cases by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU) [Dataset]. https://github.com/CSSEGISandData/COVID-19
    Explore at:
    Dataset provided by
    Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE)
    Area covered
    Global
    Description

    2019 Novel Coronavirus COVID-19 (2019-nCoV) Visual Dashboard and Map:
    https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6

    • Confirmed Cases by Country/Region/Sovereignty
    • Confirmed Cases by Province/State/Dependency
    • Deaths
    • Recovered

    Downloadable data:
    https://github.com/CSSEGISandData/COVID-19

    Additional Information about the Visual Dashboard:
    https://systems.jhu.edu/research/public-health/ncov

  4. h

    github-meta-data

    • huggingface.co
    Updated May 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    zamal_ (2025). github-meta-data [Dataset]. https://huggingface.co/datasets/zamal/github-meta-data
    Explore at:
    Dataset updated
    May 31, 2025
    Authors
    zamal_
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    GitHub Meta Data

    This dataset contains GitHub repository descriptions paired with their tags.

    input: a natural language query or description of a GitHub project
    target: comma-separated tags describing it

    Used for training a T5 model for GitHub-style tag generation.

  5. Dataset of a Study of Computational reproducibility of Jupyter notebooks...

    • zenodo.org
    • explore.openaire.eu
    pdf, zip
    Updated Jul 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sheeba Samuel; Sheeba Samuel; Daniel Mietchen; Daniel Mietchen (2024). Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications [Dataset]. http://doi.org/10.5281/zenodo.8226725
    Explore at:
    zip, pdfAvailable download formats
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sheeba Samuel; Sheeba Samuel; Daniel Mietchen; Daniel Mietchen
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This repository contains the dataset for the study of computational reproducibility of Jupyter notebooks from biomedical publications. Our focus lies in evaluating the extent of reproducibility of Jupyter notebooks derived from GitHub repositories linked to publications present in the biomedical literature repository, PubMed Central. We analyzed the reproducibility of Jupyter notebooks from GitHub repositories associated with publications indexed in the biomedical literature repository PubMed Central. The dataset includes the metadata information of the journals, publications, the Github repositories mentioned in the publications and the notebooks present in the Github repositories.

    Data Collection and Analysis

    We use the code for reproducibility of Jupyter notebooks from the study done by Pimentel et al., 2019 and adapted the code from ReproduceMeGit. We provide code for collecting the publication metadata from PubMed Central using NCBI Entrez utilities via Biopython.

    Our approach involves searching PMC using the esearch function for Jupyter notebooks using the query: ``(ipynb OR jupyter OR ipython) AND github''. We meticulously retrieve data in XML format, capturing essential details about journals and articles. By systematically scanning the entire article, encompassing the abstract, body, data availability statement, and supplementary materials, we extract GitHub links. Additionally, we mine repositories for key information such as dependency declarations found in files like requirements.txt, setup.py, and pipfile. Leveraging the GitHub API, we enrich our data by incorporating repository creation dates, update histories, pushes, and programming languages.

    All the extracted information is stored in a SQLite database. After collecting and creating the database tables, we ran a pipeline to collect the Jupyter notebooks contained in the GitHub repositories based on the code from Pimentel et al., 2019.

    Our reproducibility pipeline was started on 27 March 2023.

    Repository Structure

    Our repository is organized into two main folders:

    • archaeology: This directory hosts scripts designed to download, parse, and extract metadata from PubMed Central publications and associated repositories. There are 24 database tables created which store the information on articles, journals, authors, repositories, notebooks, cells, modules, executions, etc. in the db.sqlite database file.
    • analyses: Here, you will find notebooks instrumental in the in-depth analysis of data related to our study. The db.sqlite file generated by running the archaelogy folder is stored in the analyses folder for further analysis. The path can however be configured in the config.py file. There are two sets of notebooks: one set (naming pattern N[0-9]*.ipynb) is focused on examining data pertaining to repositories and notebooks, while the other set (PMC[0-9]*.ipynb) is for analyzing data associated with publications in PubMed Central, i.e.\ for plots involving data about articles, journals, publication dates or research fields. The resultant figures from the these notebooks are stored in the 'outputs' folder.
    • MethodsWorkflow: The MethodsWorkflow file provides a conceptual overview of the workflow used in this study.

    Accessing Data and Resources:

    • All the data generated during the initial study can be accessed at https://doi.org/10.5281/zenodo.6802158
    • For the latest results and re-run data, refer to this link.
    • The comprehensive SQLite database that encapsulates all the study's extracted data is stored in the db.sqlite file.
    • The metadata in xml format extracted from PubMed Central which contains the information about the articles and journal can be accessed in pmc.xml file.

    System Requirements:

    Running the pipeline:

    • Clone the computational-reproducibility-pmc repository using Git:
      git clone https://github.com/fusion-jena/computational-reproducibility-pmc.git
    • Navigate to the computational-reproducibility-pmc directory:
      cd computational-reproducibility-pmc/computational-reproducibility-pmc
    • Configure environment variables in the config.py file:
      GITHUB_USERNAME = os.environ.get("JUP_GITHUB_USERNAME", "add your github username here")
      GITHUB_TOKEN = os.environ.get("JUP_GITHUB_PASSWORD", "add your github token here")
    • Other environment variables can also be set in the config.py file.
      BASE_DIR = Path(os.environ.get("JUP_BASE_DIR", "./")).expanduser() # Add the path of directory where the GitHub repositories will be saved
      DB_CONNECTION = os.environ.get("JUP_DB_CONNECTION", "sqlite:///db.sqlite") # Add the path where the database is stored.
    • To set up conda environments for each python versions, upgrade pip, install pipenv, and install the archaeology package in each environment, execute:
      source conda-setup.sh
    • Change to the archaeology directory
      cd archaeology
    • Activate conda environment. We used py36 to run the pipeline.
      conda activate py36
    • Execute the main pipeline script (r0_main.py):
      python r0_main.py

    Running the analysis:

    • Navigate to the analysis directory.
      cd analyses
    • Activate conda environment. We use raw38 for the analysis of the metadata collected in the study.
      conda activate raw38
    • Install the required packages using the requirements.txt file.
      pip install -r requirements.txt
    • Launch Jupyterlab
      jupyter lab
    • Refer to the Index.ipynb notebook for the execution order and guidance.

    References:

  6. m

    Data extracted from GitHub repositories (training and test data-sets)

    • data.mendeley.com
    Updated Aug 1, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Youcef Bouziane (2019). Data extracted from GitHub repositories (training and test data-sets) [Dataset]. http://doi.org/10.17632/gt3f4jnbvn.3
    Explore at:
    Dataset updated
    Aug 1, 2019
    Authors
    Youcef Bouziane
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the SQL tables of the training and test datasets used in our experimentation. These tables contain the preprocessed textual data (in a form of tokens) extracted from each training and test project. Besides the preprocessed textual data, this dataset also contains meta-data about the projects, GitHub topics, and GitHub collections. The GitHub projects are identified by the tuple “Owner” and “Name”. The descriptions of the table fields are attached to their respective data descriptions.

  7. d

    Project GitHub

    • catalog.data.gov
    • s.cnmilf.com
    • +2more
    Updated Apr 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Project GitHub [Dataset]. https://catalog.data.gov/dataset/project-github
    Explore at:
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Dashlink
    Description

    Release of CertWare was announced 23 Mar 2012 on: code.nasa.gov The announcement points to the Certware project on NASA’s GitHub repository at: nasa.github.com/CertWare The project site contains install instructions as an Eclipse feature, various tutorials and resources, and a link to the GitHub source repository. CertWare is released under the NASA Open Source Agreement (NOSA).

  8. f

    GitHub Issues and Comments

    • figshare.com
    html
    Updated Jun 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexandra Donisan (2022). GitHub Issues and Comments [Dataset]. http://doi.org/10.6084/m9.figshare.20024303.v1
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Jun 8, 2022
    Dataset provided by
    figshare
    Authors
    Alexandra Donisan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset containing GitHub issues (that are labeled using technical debt keywords) together with their comments. Both issues and comments have their GitHub reactions. The dataset is a MongoDB exported JSON.

  9. h

    github-code-clean

    • huggingface.co
    • opendatalab.com
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CodeParrot (2022). github-code-clean [Dataset]. https://huggingface.co/datasets/codeparrot/github-code-clean
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    Good Engineering, Inc
    Authors
    CodeParrot
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The GitHub Code clean dataset in a more filtered version of codeparrot/github-code dataset, it consists of 115M code files from GitHub in 32 programming languages with 60 extensions totaling in almost 1TB of text data.

  10. g

    Coronavirus (Covid-19) Data in the United States

    • github.com
    • openicpsr.org
    • +2more
    csv
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    New York Times, Coronavirus (Covid-19) Data in the United States [Dataset]. https://github.com/nytimes/covid-19-data
    Explore at:
    csvAvailable download formats
    Dataset provided by
    New York Times
    License

    https://github.com/nytimes/covid-19-data/blob/master/LICENSEhttps://github.com/nytimes/covid-19-data/blob/master/LICENSE

    Description

    The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.

    Since the first reported coronavirus case in Washington State on Jan. 21, 2020, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.

    We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.

    The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository.

  11. GitHub developer behavior and repository evolution dataset

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated Feb 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ShengyuZhao; TianyiZhou; ShengyuZhao; TianyiZhou (2020). GitHub developer behavior and repository evolution dataset [Dataset]. http://doi.org/10.5281/zenodo.3648084
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Feb 7, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    ShengyuZhao; TianyiZhou; ShengyuZhao; TianyiZhou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this work, based on GitHub Archive project and repository mining tools, we process all available data into concise and structured format to generate GitHub developer behavior and repository evolution dataset. With the self-configurable interactive analysis tool provided by us, it will give us a macroscopic view of open source ecosystem evolution.

  12. Z

    Data from: A Dataset for GitHub Repository Deduplication

    • data.niaid.nih.gov
    Updated Feb 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Spinellis, Diomidis (2020). A Dataset for GitHub Repository Deduplication [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_3653919
    Explore at:
    Dataset updated
    Feb 9, 2020
    Dataset provided by
    Spinellis, Diomidis
    Mockus, Audris
    Kotti, Zoe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    GitHub projects can be easily replicated through the site's fork process or through a Git clone-push sequence. This is a problem for empirical software engineering, because it can lead to skewed results or mistrained machine learning models. We provide a dataset of 10.6 million GitHub projects that are copies of others, and link each record with the project's ultimate parent. The ultimate parents were derived from a ranking along six metrics. The related projects were calculated as the connected components of an 18.2 million node and 12 million edge denoised graph created by directing edges to ultimate parents. The graph was created by filtering out more than 30 hand-picked and 2.3 million pattern-matched clumping projects. Projects that introduced unwanted clumping were identified by repeatedly visualizing shortest path distances between unrelated important projects. Our dataset identified 30 thousand duplicate projects in an existing popular reference dataset of 1.8 million projects. An evaluation of our dataset against another created independently with different methods found a significant overlap, but also differences attributed to the operational definition of what projects are considered as related.

    The dataset is provided as two files identifying GitHub repositories using the login-name/project-name convention. The file deduplicate_names contains 10,649,348 tab-separated records mapping a duplicated source project to a definitive target project.

    The file forks_clones_noise_names is a 50,324,363 member superset of the source projects, containing also projects that were excluded from the mapping as noise.

  13. Z

    Data from: Mining the Technical Roles of GitHub Users

    • data.niaid.nih.gov
    • zenodo.org
    Updated Feb 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luciana L. (2022). Mining the Technical Roles of GitHub Users [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2559483
    Explore at:
    Dataset updated
    Feb 24, 2022
    Dataset provided by
    Marco Tulio
    Luciana L.
    João Eduardo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the scripts and dataset used in the study reported at Mining the Technical Roles of GitHub Users paper. The files are described in more detailed below:

    processed_ground_truth.csv: A CSV file with the information of the developers considered in the study. Due to privacy issues, we already preprocessed the dataset to remove identification clues. Please contact the authors in case you need the original one.

    processed_ground_truth_fullstack.csv: Same CSV file but with fullstack developers.

    script.ipynb, utils.py: Source code of the script used in our study.

    Dockerfile, docker-compose.yml, requirements.txt: Files to replicate the code environment used in this study.

    BoW-tuning.csv: List of classifications results for different bag of words parameters.

  14. Z

    Data from: QuerTCI: A Tool Integrating GitHub Issue Querying with Comment...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Feb 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ye Paing (2022). QuerTCI: A Tool Integrating GitHub Issue Querying with Comment Classification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6115403
    Explore at:
    Dataset updated
    Feb 21, 2022
    Dataset provided by
    Ye Paing
    Tatiana Castro Vélez
    Raffi Khatchadourian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Issue tracking systems enable users and developers to comment on problems plaguing a software system. Empirical Software Engineering (ESE) researchers study (open-source) project issues and the comments and threads within to discover---among others---challenges developers face when, e.g., incorporating new technologies, platforms, and programming language constructs. However, issue discussion threads accumulate over time and thus can become unwieldy, hindering any insight that researchers may gain. While existing approaches alleviate this burden by classifying issue thread comments, there is a gap between searching popular open-source software repositories (e.g., those on GitHub) for issues containing particular keywords and feeding the results into a classification model. In this paper, we demonstrate a research infrastructure tool called QuerTCI that bridges this gap by integrating the GitHub issue comment search API with the classification models found in existing approaches. Using queries, ESE researchers can retrieve GitHub issues containing particular keywords, e.g., those related to a certain programming language construct, and subsequently classify the kinds of discussions occurring in those issues. Using our tool, our hope is that ESE researchers can uncover challenges related to particular technologies using certain keywords through popular open-source repositories more seamlessly than previously possible. A tool demonstration video may be found at: https://youtu.be/fADKSxn0QUk.

  15. e

    Data from: "A guide to using GitHub for developing and versioning data...

    • knb.ecoinformatics.org
    • dataone.org
    • +1more
    Updated May 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert Crystal-Ornelas; Charuleka Varadharajan; Ben Bond-Lamberty; Kristin Boye; Shreyas Cholia; Michael Crow; Ranjeet Devarakonda; Kim S. Ely; Amy Goldman; Susan Heinz; Valerie Hendrix; Joan Damerow; Stephanie Pennington; Madison Burrus; Zarine Kakalia; Emily Robles; Maegen Simmonds; Alistair Rogers; Terri Velliquette; Helen Weierbach; Pamela Weisenhorn; Jessica N. Welch; Deborah A. Agarwal (2023). Data from: "A guide to using GitHub for developing and versioning data standards and reporting formats" [Dataset]. http://doi.org/10.15485/1780565
    Explore at:
    Dataset updated
    May 4, 2023
    Dataset provided by
    ESS-DIVE
    Authors
    Robert Crystal-Ornelas; Charuleka Varadharajan; Ben Bond-Lamberty; Kristin Boye; Shreyas Cholia; Michael Crow; Ranjeet Devarakonda; Kim S. Ely; Amy Goldman; Susan Heinz; Valerie Hendrix; Joan Damerow; Stephanie Pennington; Madison Burrus; Zarine Kakalia; Emily Robles; Maegen Simmonds; Alistair Rogers; Terri Velliquette; Helen Weierbach; Pamela Weisenhorn; Jessica N. Welch; Deborah A. Agarwal
    Time period covered
    Sep 1, 2020 - Dec 3, 2020
    Description

    These data are the results of a systematic review that investigated how data standards and reporting formats are documented on the version control platform GitHub. Our systematic review identified 32 data standards in earth science, environmental science, and ecology that use GitHub for version control of data standard documents. In our analysis, we characterized the documents and content within each of the 32 GitHub repositories to identify common practices for groups that version control their documents on GitHub. In this data package, there are 8 CSV files that contain data that we characterized from each repository, according to the location within the repository. For example, in 'readme_pages.csv' we characterize the content that appears across the 32 GitHub repositories included in our systematic review. Each of the 8 CSV files has an associated data dictionary file (names appended with '_dd.csv' and here we describe each content category within CSV files. There is one file-level metadata file (flmd.csv) that provides a description of each file within the data package.

  16. g

    Amazon review data 2018

    • nijianmo.github.io
    • cseweb.ucsd.edu
    • +1more
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCSD CSE Research Project, Amazon review data 2018 [Dataset]. https://nijianmo.github.io/amazon/
    Explore at:
    Dataset authored and provided by
    UCSD CSE Research Project
    Description

    Context

    This Dataset is an updated version of the Amazon review dataset released in 2014. As in the previous version, this dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs). In addition, this version provides the following features:

    • More reviews:

      • The total number of reviews is 233.1 million (142.8 million in 2014).
    • New reviews:

      • Current data includes reviews in the range May 1996 - Oct 2018.
    • Metadata: - We have added transaction metadata for each review shown on the review page.

      • Added more detailed metadata of the product landing page.

    Acknowledgements

    If you publish articles based on this dataset, please cite the following paper:

    • Jianmo Ni, Jiacheng Li, Julian McAuley. Justifying recommendations using distantly-labeled reviews and fined-grained aspects. EMNLP, 2019.
  17. f

    A Representative User-centric GitHub Developers Dataset for Malicious...

    • figshare.com
    png
    Updated Dec 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yushan Liu (2022). A Representative User-centric GitHub Developers Dataset for Malicious Account Detection [Dataset]. http://doi.org/10.6084/m9.figshare.21789566.v1
    Explore at:
    pngAvailable download formats
    Dataset updated
    Dec 29, 2022
    Dataset provided by
    figshare
    Authors
    Yushan Liu
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Using GitHub APIs, we construct an unbiased dataset of over 10 million GitHub users. The data was collected between Jul. 20 and Aug. 27, 2018, covering 10,000 users. Each data entry is stored in JSON format, representing one GitHub user, and containing the descriptive information in the user’s profile page, the information of her commit activities and created/forked public repositories.

    We provide a sample of dataset in 'Github_dataset_sample.json'. If you are interested in using the full dataset, please contact chenyang AT fudan.edu.cn to obtain the full dataset for research purposes only.

    Please cite the following paper when using the dataset: Qingyuan Gong, Yushan Liu, Jiayun Zhang, Yang Chen, Qi Li, Yu Xiao, Xin Wang, Pan Hui. Detecting Malicious Accounts in Online Developer Communities Using Deep Learning. To appear: IEEE Transactions on Knowledge and Data Engineering.

  18. s

    Data from: Can we make it better? Assessing and improving quality of GitHub...

    • researchdata.smu.edu.sg
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GEDE ARTHA AZRIADI PRANA (SMU) (2023). Data from: Can we make it better? Assessing and improving quality of GitHub repositories [Dataset]. http://doi.org/10.25440/smu.17073050.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    SMU Research Data Repository (RDR)
    Authors
    GEDE ARTHA AZRIADI PRANA (SMU)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the related dataset for the PhD dissertation by G. A. A. Prana, "Can We Make It Better? Assessing and Improving Quality of GitHub Repositories", available at https://ink.library.smu.edu.sg/etd_coll/373/The code hosting platform GitHub has gained immense popularity worldwide in recent years, with over 200 million repositories hosted as of June 2021. Due to its popularity, it has great potential to facilitate widespread improvements across many software projects. Naturally, GitHub has attracted much research attention, and the source code in the various repositories it hosts also provide opportunity to apply techniques and tools developed by software engineering researchers over the years. However, much of existing body of research applicable to GitHub focuses on code quality of the software projects and ways to improve them. Fewer work focus on potential ways to improve quality of GitHub repositories through other aspects, although quality of a software project on GitHub is also affected by factors outside a project's source code, such as documentation, the project's dependencies, and pool of contributors.The three works that form this dissertation focus on investigating aspects of GitHub repositories beyond the code quality, and identify specific potential improvements that can be applied to improve wide range of GitHub repositories. In the first work, we aim to systematically understand the content of README files in GitHub software projects, and develop a tool that can process them automatically. The work begins with a qualitative study involving 4,226 README file sections from 393 randomly-sampled GitHub repositories, which reveals that many README files contain the What'' andHow'' of the software project, but often do not contain the purpose and status of the project. This is followed by a development and evaluation of a multi-label classifier that can predict eight different README content categories with F1 of 0.746. From our subsequent evaluation of the classifier, which involve twenty software professionals, we find that adding labels generated by the classifier to README files ease information discovery.Our second work focuses on characteristics of vulnerabilities in open-source libraries used by 450 software projects on GitHub that are written in Java, Python, and Ruby. Using an industrial software composition analysis tool, we scanned every version of the projects after each commit made between November 1, 2017 and October 31, 2018. Our subsequent analyses on the discovered library names, versions, and associated vulnerabilities reveal, among others, that Denial of Service'' andInformation Disclosure'' vulnerability types are common. In addition, we also find that most of the vulnerabilities persist throughout the observation period, and that attributes such as project size, project popularity, and experience level of commit authors do not translate to better or worse handling of vulnerabilities in dependent libraries. Based on the findings in the second work, we list a number of implications for library users, library developers, as well as researchers, and provide several concrete recommendations. This includes recommendations to simplify projects' dependency sets, as well as to encourage research into ways to automatically recommend libraries known to be secure to developers.In our third work, we conduct a multi-region geographical analysis of gender inclusion on GitHub. We use a mixed-methods approach involving a quantitative analysis of commit authors of 21,456 project repositories, followed by a survey that is strategically targeted to developers in various regions worldwide and a qualitative analysis of the survey responses. Among other findings, we discover differences in diversity levels between regions, with Asia and Americas being highest. We also find no strong correlation between gender and geographic diversity of a repository's commit authors. Further, from our survey respondents worldwide, we also identify barriers and motivations to contribute to open-source software. The results of this work provides insights on the current state of gender diversity in open source software and potential ways to improve participation of developers from under-represented regions and gender, and subsequently improve the open-source software community in general. Such potential ways include creation of codes of conduct, proximity-based mentorship schemes, and highlighting of women / regional role models.

  19. (No) Influence of Continuous Integration on the Development Activity in...

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sebastian Baltes; Sebastian Baltes; Jascha Knack; Jascha Knack (2020). (No) Influence of Continuous Integration on the Development Activity in GitHub Projects — Dataset [Dataset]. http://doi.org/10.5281/zenodo.1291582
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sebastian Baltes; Sebastian Baltes; Jascha Knack; Jascha Knack
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is based on the TravisTorrent dataset released 2017-01-11 (https://travistorrent.testroots.org), the Google BigQuery GHTorrent dataset accessed 2017-07-03, and the Git log history of all projects in the dataset, retrieved 2017-07-16 and 2017-07-17.

    We selected projects hosted on GitHub that employ the Continuous Integration (CI) system Travis CI. We identified the projects using the TravisTorrent data set and considered projects that:

    1. used GitHub from the beginning (first commit not more than seven days before project creation date according to GHTorrent),
    2. were active for at least one year (365 days) before the first build with Travis CI (before_ci),
    3. used Travis CI at least for one year (during_ci),
    4. had commit or merge activity on the default branch in both of these phases, and
    5. used the default branch to trigger builds.

    To derive the time frames, we employed the GHTorrent Big Query data set. The resulting sample contains 113 projects. Of these projects, 89 are Ruby projects and 24 are Java projects. For our analysis, we only consider the activity one year before and after the first build.

    We cloned the selected project repositories and extracted the version history for all branches (see https://github.com/sbaltes/git-log-parser). For each repo and branch, we created one log file with all regular commits and one log file with all merges. We only considered commits changing non-binary files and applied a file extension filter to only consider changes to Java or Ruby source code files. From the log files, we then extracted metadata about the commits and stored this data in CSV files (see https://github.com/sbaltes/git-log-parser).

    We also retrieved a random sample of GitHub project to validate the effects we observed in the CI project sample. We only considered projects that:

    1. have Java or Ruby as their project language
    2. used GitHub from the beginning (first commit not more than seven days before project creation date according to GHTorrent)
    3. have commit activity for at least two years (730 days)
    4. are engineered software projects (at least 10 watchers)
    5. were not in the TravisTorrent dataset

    In total, 8,046 projects satisfied those constraints. We drew a random sample of 800 projects from this sampling frame and retrieved the commit and merge data in the same way as for the CI sample. We then split the development activity at the median development date, removed projects without commits or merges in either of the two resulting time spans, and then manually checked the remaining projects to remove the ones with CI configuration files. The final comparision sample contained 60 non-CI projects.

    This dataset contains the following files:

    tr_projects_sample_filtered_2.csv
    A CSV file with information about the 113 selected projects.

    tr_sample_commits_default_branch_before_ci.csv
    tr_sample_commits_default_branch_during_ci.csv

    One CSV file with information about all commits to the default branch before and after the first CI build. Only commits modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:

    project: GitHub project name ("/" replaced by "_").
    branch: The branch to which the commit was made.
    hash_value: The SHA1 hash value of the commit.
    author_name: The author name.
    author_email: The author email address.
    author_date: The authoring timestamp.
    commit_name: The committer name.
    commit_email: The committer email address.
    commit_date: The commit timestamp.
    log_message_length: The length of the git commit messages (in characters).
    file_count: Files changed with this commit.
    lines_added: Lines added to all files changed with this commit.
    lines_deleted: Lines deleted in all files changed with this commit.
    file_extensions: Distinct file extensions of files changed with this commit.

    tr_sample_merges_default_branch_before_ci.csv
    tr_sample_merges_default_branch_during_ci.csv

    One CSV file with information about all merges into the default branch before and after the first CI build. Only merges modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:

    project: GitHub project name ("/" replaced by "_").
    branch: The destination branch of the merge.
    hash_value: The SHA1 hash value of the merge commit.
    merged_commits: Unique hash value prefixes of the commits merged with this commit.
    author_name: The author name.
    author_email: The author email address.
    author_date: The authoring timestamp.
    commit_name: The committer name.
    commit_email: The committer email address.
    commit_date: The commit timestamp.
    log_message_length: The length of the git commit messages (in characters).
    file_count: Files changed with this commit.
    lines_added: Lines added to all files changed with this commit.
    lines_deleted: Lines deleted in all files changed with this commit.
    file_extensions: Distinct file extensions of files changed with this commit.
    pull_request_id: ID of the GitHub pull request that has been merged with this commit (extracted from log message).
    source_user: GitHub login name of the user who initiated the pull request (extracted from log message).
    source_branch : Source branch of the pull request (extracted from log message).

    comparison_project_sample_800.csv
    A CSV file with information about the 800 projects in the comparison sample.

    commits_default_branch_before_mid.csv
    commits_default_branch_after_mid.csv

    One CSV file with information about all commits to the default branch before and after the medium date of the commit history. Only commits modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the same columns as the commits tables described above.

    merges_default_branch_before_mid.csv
    merges_default_branch_after_mid.csv

    One CSV file with information about all merges into the default branch before and after the medium date of the commit history. Only merges modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the same columns as the merge tables described above.

  20. Data from: A Panel Data Set of Cryptocurrency Development Activity on GitHub...

    • zenodo.org
    application/gzip, bin +2
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rijnard van Tonder; Asher Trockman; Claire Le Goues; Rijnard van Tonder; Asher Trockman; Claire Le Goues (2020). A Panel Data Set of Cryptocurrency Development Activity on GitHub [Dataset]. http://doi.org/10.5281/zenodo.2595588
    Explore at:
    txt, application/gzip, bin, csvAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Rijnard van Tonder; Asher Trockman; Claire Le Goues; Rijnard van Tonder; Asher Trockman; Claire Le Goues
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Contents:

    • all-sorted-recovered-normalized-2018-01-21-to-2019-02-04.csv: CSV format of all data, sorted by date. This file contains some imputed values for missing data, and all fields across all repositories and normalized to "null". This is the most convenient form to use.
    • all-sorted-2018-01-21-to-2019-02-04.csv: CSV format of all, sorted by date. It is the raw data after processing the raw format.
    • raw-data-2018-01-21-to-2019-02-04.tar.gz: The raw format of data collected (S-expressions). Contains additional contributor data and CoinMarketCap data not currently in the CSV datasets.
    • recovered.patch: The modification on all-sorted-2018-01-21-to-2019-02-04.csv after recovering (imputing) data, showing what was recovered.
    • recovered-normalized.patch: The modification of all-sorted-2018-01-21-to-2019-02-04.csv after normalizing the recovered data set. Thus, patching all-sorted-2018-01-21-to-2019-02-04.csv with recovered.patch, then recovered-normalized.patch gives all-sorted-recovered-normalized-2018-01-21-to-2019-02-04.csv
    • missing-dates.txt: Days for which we missed GitHub data collection (partial or completely).

    Related publications:

    @inproceedings{van-tonder-crypto-oss-2019, 
     title = {{A Panel Data Set of Cryptocurrency Development Activity on GitHub}},
     booktitle = "International Conference on Mining Software Repositories",
     author = "{van~Tonder}, Rijnard and Trockman, Asher and {Le~Goues}, Claire",
     series = {MSR '19},
     year = 2019
    } 
    
    @inproceedings{trockman-striking-gold-2019, 
     title = {{Striking Gold in Software Repositories? An Econometric Study of Cryptocurrencies on GitHub}},
     booktitle = "International Conference on Mining Software Repositories", author = "Trockman, Asher and {van~Tonder}, Rijnard and Vasilescu, Bogdan",
     series = {MSR '19},
     year = 2019
    }

    Related code: https://github.com/rvantonder/CryptOSS

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:GitHub&inv=1&invt=Ab2YDQ (2022). GitHub Activity Data [Dataset]. https://console.cloud.google.com/marketplace/product/github/github-repos
Organization logoOrganization logo

GitHub Activity Data

Explore at:
53 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jun 23, 2022
Dataset provided by
GitHubhttps://github.com/
Googlehttp://google.com/
Description

GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008. This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .

Search
Clear search
Close search
Google apps
Main menu