100+ datasets found
  1. h

    github-code

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CodeParrot, github-code [Dataset]. https://huggingface.co/datasets/codeparrot/github-code
    Explore at:
    Dataset provided by
    Good Engineering, Inc
    Authors
    CodeParrot
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    The GitHub Code dataest consists of 115M code files from GitHub in 32 programming languages with 60 extensions totalling in 1TB of text data. The dataset was created from the GitHub dataset on BiqQuery.

  2. GitHub Repos

    • kaggle.com
    zip
    Updated Mar 20, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Mar 20, 2019
    Dataset provided by
    GitHubhttps://github.com/
    Authors
    Github
    Description

    GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

    This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

    Querying BigQuery tables

    You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

    Acknowledgements

    This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

    Inspiration

    • This is the perfect dataset for fighting language wars.
    • Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?
  3. h

    github-code-clean

    • huggingface.co
    • opendatalab.com
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CodeParrot (2022). github-code-clean [Dataset]. https://huggingface.co/datasets/codeparrot/github-code-clean
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    Good Engineering, Inc
    Authors
    CodeParrot
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The GitHub Code clean dataset in a more filtered version of codeparrot/github-code dataset, it consists of 115M code files from GitHub in 32 programming languages with 60 extensions totaling in almost 1TB of text data.

  4. P

    GitHub-Python Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Jun 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michihiro Yasunaga; Percy Liang (2021). GitHub-Python Dataset [Dataset]. https://paperswithcode.com/dataset/github-python
    Explore at:
    Dataset updated
    Jun 15, 2021
    Authors
    Michihiro Yasunaga; Percy Liang
    Description

    Repair AST parse (syntax) errors in Python code

  5. h

    codeparrot-java-all

    • huggingface.co
    Updated Mar 17, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aditya Goswami (2022). codeparrot-java-all [Dataset]. https://huggingface.co/datasets/Aditya78b/codeparrot-java-all
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 17, 2022
    Authors
    Aditya Goswami
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    GitHub Code Dataset

      Dataset Description
    

    The GitHub Code dataset consists of 115M code files from GitHub in 32 programming languages with 60 extensions totaling in 1TB of data. The dataset was created from the public GitHub dataset on Google BiqQuery.

      How to use it
    

    The GitHub Code dataset is a very large dataset so for most use cases it is recommended to make use of the streaming API of datasets. You can load and iterate through the dataset with the… See the full description on the dataset page: https://huggingface.co/datasets/Aditya78b/codeparrot-java-all.

  6. g

    Data from: Data Science Problems

    • github.com
    • opendatalab.com
    Updated Feb 8, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Data Science Problems [Dataset]. https://github.com/microsoft/DataScienceProblems
    Explore at:
    Dataset updated
    Feb 8, 2022
    License

    https://github.com/microsoft/DataScienceProblems/blob/main/LICENSE.txthttps://github.com/microsoft/DataScienceProblems/blob/main/LICENSE.txt

    Description

    Evaluate a natural language code generation model on real data science pedagogical notebooks! Data Science Problems (DSP) includes well-posed data science problems in Markdown along with unit tests to verify correctness and a Docker environment for reproducible execution. About 1/3 of notebooks in this benchmark also include data dependencies, so this benchmark not only can test a model's ability to chain together complex tasks, but also evaluate the solutions on real data! See our paper Training and Evaluating a Jupyter Notebook Data Science Assistant (https://arxiv.org/abs/2201.12901) for more details about state of the art results and other properties of the dataset.

  7. f

    Data from: An Empirical Evaluation of GitHub Copilot’s Code Suggestions

    • figshare.com
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nhan Nguyen; Sarah Nadi (2023). An Empirical Evaluation of GitHub Copilot’s Code Suggestions [Dataset]. http://doi.org/10.6084/m9.figshare.18515141.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Authors
    Nhan Nguyen; Sarah Nadi
    License

    https://www.apache.org/licenses/LICENSE-2.0.htmlhttps://www.apache.org/licenses/LICENSE-2.0.html

    Description

    Artifact accompanying MSR 2022 paper titled "An Empirical Evaluation of GitHub Copilot’s Code Suggestions" by Nhan Nguyen and Sarah Nadi

  8. Dataset of a Study of Computational reproducibility of Jupyter notebooks...

    • zenodo.org
    • explore.openaire.eu
    pdf, zip
    Updated Jul 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sheeba Samuel; Sheeba Samuel; Daniel Mietchen; Daniel Mietchen (2024). Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications [Dataset]. http://doi.org/10.5281/zenodo.8226725
    Explore at:
    zip, pdfAvailable download formats
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sheeba Samuel; Sheeba Samuel; Daniel Mietchen; Daniel Mietchen
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This repository contains the dataset for the study of computational reproducibility of Jupyter notebooks from biomedical publications. Our focus lies in evaluating the extent of reproducibility of Jupyter notebooks derived from GitHub repositories linked to publications present in the biomedical literature repository, PubMed Central. We analyzed the reproducibility of Jupyter notebooks from GitHub repositories associated with publications indexed in the biomedical literature repository PubMed Central. The dataset includes the metadata information of the journals, publications, the Github repositories mentioned in the publications and the notebooks present in the Github repositories.

    Data Collection and Analysis

    We use the code for reproducibility of Jupyter notebooks from the study done by Pimentel et al., 2019 and adapted the code from ReproduceMeGit. We provide code for collecting the publication metadata from PubMed Central using NCBI Entrez utilities via Biopython.

    Our approach involves searching PMC using the esearch function for Jupyter notebooks using the query: ``(ipynb OR jupyter OR ipython) AND github''. We meticulously retrieve data in XML format, capturing essential details about journals and articles. By systematically scanning the entire article, encompassing the abstract, body, data availability statement, and supplementary materials, we extract GitHub links. Additionally, we mine repositories for key information such as dependency declarations found in files like requirements.txt, setup.py, and pipfile. Leveraging the GitHub API, we enrich our data by incorporating repository creation dates, update histories, pushes, and programming languages.

    All the extracted information is stored in a SQLite database. After collecting and creating the database tables, we ran a pipeline to collect the Jupyter notebooks contained in the GitHub repositories based on the code from Pimentel et al., 2019.

    Our reproducibility pipeline was started on 27 March 2023.

    Repository Structure

    Our repository is organized into two main folders:

    • archaeology: This directory hosts scripts designed to download, parse, and extract metadata from PubMed Central publications and associated repositories. There are 24 database tables created which store the information on articles, journals, authors, repositories, notebooks, cells, modules, executions, etc. in the db.sqlite database file.
    • analyses: Here, you will find notebooks instrumental in the in-depth analysis of data related to our study. The db.sqlite file generated by running the archaelogy folder is stored in the analyses folder for further analysis. The path can however be configured in the config.py file. There are two sets of notebooks: one set (naming pattern N[0-9]*.ipynb) is focused on examining data pertaining to repositories and notebooks, while the other set (PMC[0-9]*.ipynb) is for analyzing data associated with publications in PubMed Central, i.e.\ for plots involving data about articles, journals, publication dates or research fields. The resultant figures from the these notebooks are stored in the 'outputs' folder.
    • MethodsWorkflow: The MethodsWorkflow file provides a conceptual overview of the workflow used in this study.

    Accessing Data and Resources:

    • All the data generated during the initial study can be accessed at https://doi.org/10.5281/zenodo.6802158
    • For the latest results and re-run data, refer to this link.
    • The comprehensive SQLite database that encapsulates all the study's extracted data is stored in the db.sqlite file.
    • The metadata in xml format extracted from PubMed Central which contains the information about the articles and journal can be accessed in pmc.xml file.

    System Requirements:

    Running the pipeline:

    • Clone the computational-reproducibility-pmc repository using Git:
      git clone https://github.com/fusion-jena/computational-reproducibility-pmc.git
    • Navigate to the computational-reproducibility-pmc directory:
      cd computational-reproducibility-pmc/computational-reproducibility-pmc
    • Configure environment variables in the config.py file:
      GITHUB_USERNAME = os.environ.get("JUP_GITHUB_USERNAME", "add your github username here")
      GITHUB_TOKEN = os.environ.get("JUP_GITHUB_PASSWORD", "add your github token here")
    • Other environment variables can also be set in the config.py file.
      BASE_DIR = Path(os.environ.get("JUP_BASE_DIR", "./")).expanduser() # Add the path of directory where the GitHub repositories will be saved
      DB_CONNECTION = os.environ.get("JUP_DB_CONNECTION", "sqlite:///db.sqlite") # Add the path where the database is stored.
    • To set up conda environments for each python versions, upgrade pip, install pipenv, and install the archaeology package in each environment, execute:
      source conda-setup.sh
    • Change to the archaeology directory
      cd archaeology
    • Activate conda environment. We used py36 to run the pipeline.
      conda activate py36
    • Execute the main pipeline script (r0_main.py):
      python r0_main.py

    Running the analysis:

    • Navigate to the analysis directory.
      cd analyses
    • Activate conda environment. We use raw38 for the analysis of the metadata collected in the study.
      conda activate raw38
    • Install the required packages using the requirements.txt file.
      pip install -r requirements.txt
    • Launch Jupyterlab
      jupyter lab
    • Refer to the Index.ipynb notebook for the execution order and guidance.

    References:

  9. d

    Code review regression analysis of open source GitHub projects

    • datadryad.org
    zip
    Updated Aug 31, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christopher Thompson; David Wagner (2017). Code review regression analysis of open source GitHub projects [Dataset]. http://doi.org/10.6078/D14X0T
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 31, 2017
    Dataset provided by
    Dryad
    Authors
    Christopher Thompson; David Wagner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2017
    Description

    This dataset contains the repository data used for our study "A Large-Scale Study of Modern Code Review and Security in Open Source Projects". This dataset was collected from GitHub, and includes 3,126 projects in 143 languages, with 489,038 issues and 382,771 pull requests. We also include the regression analysis notebooks for reproducing our results from this data.

  10. Z

    PIPr: A Dataset of Public Infrastructure as Code Programs

    • data.niaid.nih.gov
    • zenodo.org
    Updated Nov 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Spielmann, David (2023). PIPr: A Dataset of Public Infrastructure as Code Programs [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8262770
    Explore at:
    Dataset updated
    Nov 28, 2023
    Dataset provided by
    Spielmann, David
    Sokolowski, Daniel
    Salvaneschi, Guido
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    Programming Languages Infrastructure as Code (PL-IaC) enables IaC programs written in general-purpose programming languages like Python and TypeScript. The currently available PL-IaC solutions are Pulumi and the Cloud Development Kits (CDKs) of Amazon Web Services (AWS) and Terraform. This dataset provides metadata and initial analyses of all public GitHub repositories in August 2022 with an IaC program, including their programming languages, applied testing techniques, and licenses. Further, we provide a shallow copy of the head state of those 7104 repositories whose licenses permit redistribution. The dataset is available under the Open Data Commons Attribution License (ODC-By) v1.0. Contents:

    metadata.zip: The dataset metadata and analysis results as CSV files. scripts-and-logs.zip: Scripts and logs of the dataset creation. LICENSE: The Open Data Commons Attribution License (ODC-By) v1.0 text. README.md: This document. redistributable-repositiories.zip: Shallow copies of the head state of all redistributable repositories with an IaC program. This artifact is part of the ProTI Infrastructure as Code testing project: https://proti-iac.github.io. Metadata The dataset's metadata comprises three tabular CSV files containing metadata about all analyzed repositories, IaC programs, and testing source code files. repositories.csv:

    ID (integer): GitHub repository ID url (string): GitHub repository URL downloaded (boolean): Whether cloning the repository succeeded name (string): Repository name description (string): Repository description licenses (string, list of strings): Repository licenses redistributable (boolean): Whether the repository's licenses permit redistribution created (string, date & time): Time of the repository's creation updated (string, date & time): Time of the last update to the repository pushed (string, date & time): Time of the last push to the repository fork (boolean): Whether the repository is a fork forks (integer): Number of forks archive (boolean): Whether the repository is archived programs (string, list of strings): Project file path of each IaC program in the repository programs.csv:

    ID (string): Project file path of the IaC program repository (integer): GitHub repository ID of the repository containing the IaC program directory (string): Path of the directory containing the IaC program's project file solution (string, enum): PL-IaC solution of the IaC program ("AWS CDK", "CDKTF", "Pulumi") language (string, enum): Programming language of the IaC program (enum values: "csharp", "go", "haskell", "java", "javascript", "python", "typescript", "yaml") name (string): IaC program name description (string): IaC program description runtime (string): Runtime string of the IaC program testing (string, list of enum): Testing techniques of the IaC program (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking") tests (string, list of strings): File paths of IaC program's tests testing-files.csv:

    file (string): Testing file path language (string, enum): Programming language of the testing file (enum values: "csharp", "go", "java", "javascript", "python", "typescript") techniques (string, list of enum): Testing techniques used in the testing file (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking") keywords (string, list of enum): Keywords found in the testing file (enum values: "/go/auto", "/testing/integration", "@AfterAll", "@BeforeAll", "@Test", "@aws-cdk", "@aws-cdk/assert", "@pulumi.runtime.test", "@pulumi/", "@pulumi/policy", "@pulumi/pulumi/automation", "Amazon.CDK", "Amazon.CDK.Assertions", "Assertions_", "HashiCorp.Cdktf", "IMocks", "Moq", "NUnit", "PolicyPack(", "ProgramTest", "Pulumi", "Pulumi.Automation", "PulumiTest", "ResourceValidationArgs", "ResourceValidationPolicy", "SnapshotTest()", "StackValidationPolicy", "Testing", "Testing_ToBeValidTerraform(", "ToBeValidTerraform(", "Verifier.Verify(", "WithMocks(", "[Fact]", "[TestClass]", "[TestFixture]", "[TestMethod]", "[Test]", "afterAll(", "assertions", "automation", "aws-cdk-lib", "aws-cdk-lib/assert", "aws_cdk", "aws_cdk.assertions", "awscdk", "beforeAll(", "cdktf", "com.pulumi", "def test_", "describe(", "github.com/aws/aws-cdk-go/awscdk", "github.com/hashicorp/terraform-cdk-go/cdktf", "github.com/pulumi/pulumi", "integration", "junit", "pulumi", "pulumi.runtime.setMocks(", "pulumi.runtime.set_mocks(", "pulumi_policy", "pytest", "setMocks(", "set_mocks(", "snapshot", "software.amazon.awscdk.assertions", "stretchr", "test(", "testing", "toBeValidTerraform(", "toMatchInlineSnapshot(", "toMatchSnapshot(", "to_be_valid_terraform(", "unittest", "withMocks(") program (string): Project file path of the testing file's IaC program Dataset Creation scripts-and-logs.zip contains all scripts and logs of the creation of this dataset. In it, executions/executions.log documents the commands that generated this dataset in detail. On a high level, the dataset was created as follows:

    A list of all repositories with a PL-IaC program configuration file was created using search-repositories.py (documented below). The execution took two weeks due to the non-deterministic nature of GitHub's REST API, causing excessive retries. A shallow copy of the head of all repositories was downloaded using download-repositories.py (documented below). Using analysis.ipynb, the repositories were analyzed for the programs' metadata, including the used programming languages and licenses. Based on the analysis, all repositories with at least one IaC program and a redistributable license were packaged into redistributable-repositiories.zip, excluding any node_modules and .git directories. Searching Repositories The repositories are searched through search-repositories.py and saved in a CSV file. The script takes these arguments in the following order:

    Github access token. Name of the CSV output file. Filename to search for. File extensions to search for, separated by commas. Min file size for the search (for all files: 0). Max file size for the search or * for unlimited (for all files: *). Pulumi projects have a Pulumi.yaml or Pulumi.yml (case-sensitive file name) file in their root folder, i.e., (3) is Pulumi and (4) is yml,yaml. https://www.pulumi.com/docs/intro/concepts/project/ AWS CDK projects have a cdk.json (case-sensitive file name) file in their root folder, i.e., (3) is cdk and (4) is json. https://docs.aws.amazon.com/cdk/v2/guide/cli.html CDK for Terraform (CDKTF) projects have a cdktf.json (case-sensitive file name) file in their root folder, i.e., (3) is cdktf and (4) is json. https://www.terraform.io/cdktf/create-and-deploy/project-setup Limitations The script uses the GitHub code search API and inherits its limitations:

    Only forks with more stars than the parent repository are included. Only the repositories' default branches are considered. Only files smaller than 384 KB are searchable. Only repositories with fewer than 500,000 files are considered. Only repositories that have had activity or have been returned in search results in the last year are considered. More details: https://docs.github.com/en/search-github/searching-on-github/searching-code The results of the GitHub code search API are not stable. However, the generally more robust GraphQL API does not support searching for files in repositories: https://stackoverflow.com/questions/45382069/search-for-code-in-github-using-graphql-v4-api Downloading Repositories download-repositories.py downloads all repositories in CSV files generated through search-respositories.py and generates an overview CSV file of the downloads. The script takes these arguments in the following order:

    Name of the repositories CSV files generated through search-repositories.py, separated by commas. Output directory to download the repositories to. Name of the CSV output file. The script only downloads a shallow recursive copy of the HEAD of the repo, i.e., only the main branch's most recent state, including submodules, without the rest of the git history. Each repository is downloaded to a subfolder named by the repository's ID.

  11. d

    Data from: Data and Code from: Environment, plant genetics, and their...

    • catalog.data.gov
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data and Code from: Environment, plant genetics, and their interaction shape important aspects of sunflower rhizosphere microbial communities [GitHub repository] [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-environment-plant-genetics-and-their-interaction-shape-important-aspect
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    GitHub Repository - Description: Contains data, processing and analysis code, initial exploratory figures, final publication figures, and final publication tables. - Link: https://github.com/cliffbueno/Sunflower_AEM - Note: A release of this repository has been archived on Zenodo with a stable DOI: https://zenodo.org/doi/10.5281/zenodo.12193724

  12. h

    GitHubData

    • huggingface.co
    Updated Oct 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ZelonPrograms (2024). GitHubData [Dataset]. https://huggingface.co/datasets/ZelonPrograms/GitHubData
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 16, 2024
    Authors
    ZelonPrograms
    Description

    GitHub Repository Data for Hugging Face

    This repository contains a dataset of information collected from various GitHub repositories. The data includes key metrics about popular repositories, suitable for analysis and machine learning tasks.

      Data Description
    

    The dataset contains the following fields for each repository:

    url: The URL of the GitHub repository. repo_name: The name of the repository. description: A brief description of the repository. stars: The number of… See the full description on the dataset page: https://huggingface.co/datasets/ZelonPrograms/GitHubData.

  13. h

    github-issues

    • huggingface.co
    Updated Jan 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ilya Baksayar (2024). github-issues [Dataset]. https://huggingface.co/datasets/baksalyar/github-issues
    Explore at:
    Dataset updated
    Jan 30, 2024
    Authors
    Ilya Baksayar
    Description

    Dataset Card for github_issues

      Dataset Summary
    

    [More Information Needed]

      Supported Tasks and Leaderboards
    

    [More Information Needed]

      Languages
    

    [More Information Needed]

      Dataset Structure
    
    
    
    
    
      Data Instances
    

    [More Information Needed]

      Data Fields
    

    [More Information Needed]

      Data Splits
    

    [More Information Needed]

      Dataset Creation
    
    
    
    
    
      Curation Rationale
    

    [More Information Needed]

      Source Data… See the full description on the dataset page: https://huggingface.co/datasets/baksalyar/github-issues.
    
  14. s

    Data from: Can we make it better? Assessing and improving quality of GitHub...

    • researchdata.smu.edu.sg
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GEDE ARTHA AZRIADI PRANA (SMU) (2023). Data from: Can we make it better? Assessing and improving quality of GitHub repositories [Dataset]. http://doi.org/10.25440/smu.17073050.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    SMU Research Data Repository (RDR)
    Authors
    GEDE ARTHA AZRIADI PRANA (SMU)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the related dataset for the PhD dissertation by G. A. A. Prana, "Can We Make It Better? Assessing and Improving Quality of GitHub Repositories", available at https://ink.library.smu.edu.sg/etd_coll/373/The code hosting platform GitHub has gained immense popularity worldwide in recent years, with over 200 million repositories hosted as of June 2021. Due to its popularity, it has great potential to facilitate widespread improvements across many software projects. Naturally, GitHub has attracted much research attention, and the source code in the various repositories it hosts also provide opportunity to apply techniques and tools developed by software engineering researchers over the years. However, much of existing body of research applicable to GitHub focuses on code quality of the software projects and ways to improve them. Fewer work focus on potential ways to improve quality of GitHub repositories through other aspects, although quality of a software project on GitHub is also affected by factors outside a project's source code, such as documentation, the project's dependencies, and pool of contributors.The three works that form this dissertation focus on investigating aspects of GitHub repositories beyond the code quality, and identify specific potential improvements that can be applied to improve wide range of GitHub repositories. In the first work, we aim to systematically understand the content of README files in GitHub software projects, and develop a tool that can process them automatically. The work begins with a qualitative study involving 4,226 README file sections from 393 randomly-sampled GitHub repositories, which reveals that many README files contain the What'' andHow'' of the software project, but often do not contain the purpose and status of the project. This is followed by a development and evaluation of a multi-label classifier that can predict eight different README content categories with F1 of 0.746. From our subsequent evaluation of the classifier, which involve twenty software professionals, we find that adding labels generated by the classifier to README files ease information discovery.Our second work focuses on characteristics of vulnerabilities in open-source libraries used by 450 software projects on GitHub that are written in Java, Python, and Ruby. Using an industrial software composition analysis tool, we scanned every version of the projects after each commit made between November 1, 2017 and October 31, 2018. Our subsequent analyses on the discovered library names, versions, and associated vulnerabilities reveal, among others, that Denial of Service'' andInformation Disclosure'' vulnerability types are common. In addition, we also find that most of the vulnerabilities persist throughout the observation period, and that attributes such as project size, project popularity, and experience level of commit authors do not translate to better or worse handling of vulnerabilities in dependent libraries. Based on the findings in the second work, we list a number of implications for library users, library developers, as well as researchers, and provide several concrete recommendations. This includes recommendations to simplify projects' dependency sets, as well as to encourage research into ways to automatically recommend libraries known to be secure to developers.In our third work, we conduct a multi-region geographical analysis of gender inclusion on GitHub. We use a mixed-methods approach involving a quantitative analysis of commit authors of 21,456 project repositories, followed by a survey that is strategically targeted to developers in various regions worldwide and a qualitative analysis of the survey responses. Among other findings, we discover differences in diversity levels between regions, with Asia and Americas being highest. We also find no strong correlation between gender and geographic diversity of a repository's commit authors. Further, from our survey respondents worldwide, we also identify barriers and motivations to contribute to open-source software. The results of this work provides insights on the current state of gender diversity in open source software and potential ways to improve participation of developers from under-represented regions and gender, and subsequently improve the open-source software community in general. Such potential ways include creation of codes of conduct, proximity-based mentorship schemes, and highlighting of women / regional role models.

  15. d

    Project GitHub

    • catalog.data.gov
    • s.cnmilf.com
    • +2more
    Updated Apr 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Project GitHub [Dataset]. https://catalog.data.gov/dataset/project-github
    Explore at:
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Dashlink
    Description

    Release of CertWare was announced 23 Mar 2012 on: code.nasa.gov The announcement points to the Certware project on NASA’s GitHub repository at: nasa.github.com/CertWare The project site contains install instructions as an Eclipse feature, various tutorials and resources, and a link to the GitHub source repository. CertWare is released under the NASA Open Source Agreement (NOSA).

  16. Data from: Mining the Technical Roles of GitHub Users

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv +2
    Updated Feb 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    João Eduardo; Luciana L.; Marco Tulio; João Eduardo; Luciana L.; Marco Tulio (2022). Mining the Technical Roles of GitHub Users [Dataset]. http://doi.org/10.5281/zenodo.3986172
    Explore at:
    csv, bin, text/x-python, txtAvailable download formats
    Dataset updated
    Feb 24, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    João Eduardo; Luciana L.; Marco Tulio; João Eduardo; Luciana L.; Marco Tulio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the scripts and dataset used in the study reported at Mining the Technical Roles of GitHub Users paper. The files are described in more detailed below:

    • processed_ground_truth.csv: A CSV file with the information of the developers considered in the study. Due to privacy issues, we already preprocessed the dataset to remove identification clues. Please contact the authors in case you need the original one.
    • processed_ground_truth_fullstack.csv: Same CSV file but with fullstack developers.
    • script.ipynb, utils.py: Source code of the script used in our study.
    • Dockerfile, docker-compose.yml, requirements.txt: Files to replicate the code environment used in this study.
    • BoW-tuning.csv: List of classifications results for different bag of words parameters.
  17. g

    Coronavirus COVID-19 Global Cases by the Center for Systems Science and...

    • github.com
    • systems.jhu.edu
    • +1more
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE), Coronavirus COVID-19 Global Cases by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU) [Dataset]. https://github.com/CSSEGISandData/COVID-19
    Explore at:
    Dataset provided by
    Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE)
    Area covered
    Global
    Description

    2019 Novel Coronavirus COVID-19 (2019-nCoV) Visual Dashboard and Map:
    https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6

    • Confirmed Cases by Country/Region/Sovereignty
    • Confirmed Cases by Province/State/Dependency
    • Deaths
    • Recovered

    Downloadable data:
    https://github.com/CSSEGISandData/COVID-19

    Additional Information about the Visual Dashboard:
    https://systems.jhu.edu/research/public-health/ncov

  18. GitSED: GitHub Socially Enhanced Dataset

    • zenodo.org
    xz
    Updated Jul 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriel P. Oliveira; Gabriel P. Oliveira; Ana Flávia C. Moura; Natércia A. Batista; Michele A. Brandão; Mirella M. Moro; Mirella M. Moro; Ana Flávia C. Moura; Natércia A. Batista; Michele A. Brandão (2021). GitSED: GitHub Socially Enhanced Dataset [Dataset]. http://doi.org/10.5281/zenodo.5021329
    Explore at:
    xzAvailable download formats
    Dataset updated
    Jul 2, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gabriel P. Oliveira; Gabriel P. Oliveira; Ana Flávia C. Moura; Natércia A. Batista; Michele A. Brandão; Mirella M. Moro; Mirella M. Moro; Ana Flávia C. Moura; Natércia A. Batista; Michele A. Brandão
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Software Engineering has evolved as a field to study not only the many ways software is created but also how it evolves, becomes successful, is effective and efficient in its objectives, satisfies its quality attributes, and much more. Nonetheless, there are still many open issues during its conception, development, and maintenance phases. Especially, understanding how developers collaborate may help in all such phases, but it is also challenging. Luckily, we may now explore a novel angle to deal with such a challenge: studying the social aspects of software development over social networks.

    With GitHub becoming the main representative of collaborative software development online tools, there are approaches to assess the follow-network, stargazer-network, and contributors-network. Moreover, having such networks built from real software projects offers support for relevant applications, such as detection of key developers, recommendation of collaboration among developers, detection of developer communities, and analyses of collaboration patterns in agile development.

    GitSED is a dataset based on GitHub that is curated (cleaned and reduced), augmented with external data, and enriched with social information on developers’ interactions. The original data is extracted from GHTorrent (an offline repository of data collected through the GitHub REST API). Our final dataset contains data from up to June 2019. It comprises:

    • 8,556,778 repositories
    • 32,411,674 developers
    • 6 programming languages (Assembly, JavaScript, Pascal, Python, Ruby, Visual Basic)
    • 13 collaboration metrics

    There are two previous versions of GitSED, which were originally built for the following conference papers:

    v2 (May 2017): Gabriel P. Oliveira, Natércia A. Batista, Michele A. Brandão, and Mirella M. Moro. Tie Strength in GitHub Heterogeneous Networks. In Proceedings of the 24th Brazilian Symposium on Multimedia and the Web (WebMedia'18), 2018.

    v1 (Sep 2015): Natércia A. Batista, Michele A. Brandão, Gabriela B. Alves, Ana Paula Couto da Silva, and Mirella M. Moro. Collaboration strength metrics and analyses on GitHub. In Proceedings of the International Conference on Web Intelligence (WI'17), 2017.

  19. Data from: A Dataset of Bot and Human Activities in GitHub

    • zenodo.org
    json, txt
    Updated Jan 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Natarajan Chidambaram; Natarajan Chidambaram; Alexandre Decan; Alexandre Decan; Tom Mens; Tom Mens (2024). A Dataset of Bot and Human Activities in GitHub [Dataset]. http://doi.org/10.5281/zenodo.8219470
    Explore at:
    json, txtAvailable download formats
    Dataset updated
    Jan 5, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Natarajan Chidambaram; Natarajan Chidambaram; Alexandre Decan; Alexandre Decan; Tom Mens; Tom Mens
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A Dataset of Bot and Human Activities in GitHub

    This repository provides an updated version of a dataset of GitHub contributor activities that is accompanied by a paper published at MSR 2023 in the Data and Tool Showcase Track. The paper is entitled A Dataset of Bot and Human Activities in GitHub and is co-authored by Natarajan Chidambaram, Alexandre Decan and Tom Mens (Software Engineering Lab, University of Mons, Belgium). DOI: https://www.doi.org/10.1109/MSR59073.2023.00070. This work is done as a part of Natarajan Chdiambaram's PhD research in the context of DigitalWallonia4.AI research project ARIAC (grant number 2010235) and TRAIL.

    The dataset contains 1,015,422 high-level activities made by 350 bots and 620 human contributors on GitHub between 25 November 2022 and 15 April 2023. The activities were generated from 1,221,907 low-level events obtained from the GitHub's Event API and cover 24 distinct activity types. This dataset facilitates the characterisation of bot and human behaviour in GitHub repositories, by enabling the analysis of activity sequences and activity patterns of bot and human contributors. This dataset could lead to better bot identification tools and empirical studies on how bots play a role in collaborative software development.

    Files description

    The following files are provided as part of the archive:

    • bot_activities.json - A JSON file containing 754,165 activities made by 350 bot contributors;
    • human_activities.json - A JSON file containing 261,258 activities made by 620 human contributors (anonymized);
    • JsonSchema.json - A JSON schema that validates the above datasets;
    • bots.txt - A TEXT file containing login names of all the 350 bots

    Example

    Below is an example of a Closing pull request activity:

    {
     "date": "2022-11-25T18:49:09+00:00",
     "activity": "Closing pull request",
     "contributor": "typescript-bot",
     "repository": "DefinitelyTyped/DefinitelyTyped",
     "comment": {
       "length": 249,
       "GH_node": "IC_kwDOAFz6BM5PJG7l"
     },
     "pull_request": {
       "id": 62328,
       "title": "[qunit] Add `test.each()`",
       "created_at": "2022-09-19T17:34:28+00:00",
       "status": "closed",
       "closed_at": "2022-11-25T18:49:08+00:00",
       "merged": false,
       "GH_node": "PR_kwDOAFz6BM4_N5ib"
     },
     "conversation": {
       "comments": 19
     },
     "payload": {
       "pr_commits": 1,
       "pr_changed_files": 5
     }
    }

    List of activity types

    In total, we have identified 24 different high-level activity types from 15 different low-level event types. They are Creating repository, Creating branch, Creating tag, Deleting tag, Deleting repository, Publishing a release, Making repository public, Adding collaborator to repository, Forking repository, Starring repository, Editing wiki page, Opening issue, Closing issue, Reopening issue, Transferring issue, Commenting issue, Opening pull request, Closing pull request, Reopening pull request, Commenting pull request, Commenting pull request changes, Reviewing code, Commenting commits, Pushing commits.

    List of fields

    Not only does the dataset contain a list of activities made by bot and human contributors, but it also contains some details about these activities. For example, commenting issue activities provide details about the author of the comment, the repository and issue in which the comment was created, and so on.

    For all activity types, we provide the date of the activity, the contributor that made the activity, and the repository in which the activity took place. Depending on the activity type, additional fields are provided. In this section, we describe for each activity type the different fields that are provided in the JSON file. It is worth to mention that we also provide the corresponding JSON schema alongside with the datasets.

    Properties

    • date
      • Date on which the activity is performed
      • Type: string
      • e.g., "2022-11-25T09:55:19+00:00"
      • String format must be a "date-time"

    • activity
      • The activity performed by the contributor
      • Type: string
      • e.g., "Commenting pull request"
    • contributor
      • The login name of the contributor who performed this activity
      • Type: string
      • e.g., "analysis-bot", "anonymised" in the case of a human contributor
    • repository
      • The repository in which the activity is performed
      • Type: string
      • e.g., "apache/spark", "anonymised" in the case of a human contributor
    • issue
      • Issue information - provided for Opening issue, Closing issue, Reopening issue, Transferring issue and Commenting issue
      • Type: object
      • Properties
        • id
          • Issue number
          • Type: integer
          • e.g., 35471
        • title
          • Issue title
          • Type: string
          • e.g., "error building handtracking gpu example with bazel", "anonymised" in the case of a human contributor
        • created_at
          • The date on which this issue is created
          • Type: string
          • e.g., "2022-11-10T13:07:23+00:00"
          • String format must be a "date-time"
        • status
          • Current state of the issue
          • Type: string
          • "open" or "closed"
        • closed_at
          • The date on which this issue is closed. "null" will be provided if the issue is open
          • Types: string, null
          • e.g., "2022-11-25T10:42:39+00:00"
          • String format must be a "date-time"
        • resolved
          • The issue is resolved or not_planned/still open
          • Type: boolean
          • true or false
        • GH_node
          • The GitHub node of this issue
          • Type: string
          • e.g., "IC_kwDOC27xRM5PHTBU", "anonymised" in the case of a human contributor
    • pull_request
      • Pull request information - provided for Opening pull request, Closing pull request, Reopening pull request, Commenting pull request changes and Reviewing code
      • Type: object
      • Properties
        • id
          • Pull request number
          • Type: integer
          • e.g., 35471
        • title
          • Pull request title
          • Type: string
          • e.g., "error building handtracking gpu example with bazel", "anonymised" in the case of a human contributor
        • created_at
          • The date on which this pull request is created
          • Type: string
          • e.g., "2022-11-10T13:07:23+00:00"
          • String format must be a "date-time"
        • status
          • Current state of the pull request
          • Type: string
          • "open" or "closed"
        • closed_at
          • The date on which this pull request is closed. "null" will be provided if the pull request is open
          • Types: string, null
          • e.g., "2022-11-25T10:42:39+00:00"
          • String format must be a "date-time"
        • merged
          • The PR is merged or rejected/still open
          • Type: boolean
          • true or false
        • GH_node
          • The GitHub node of this pull request
          • Type: string
          • e.g., "PR_kwDOC7Q2kM5Dsu3-", "anonymised" in the case of a human contributor
    • review
      • Pull request review information - provided for Reviewing code
      • Type: object
      • Properties
        • status
          • Status of the review
          • Type: string
          • "changes_requested" or "approved" or "dismissed"
        • GH_node
          • The GitHub node of this review
          • Type: string
          • e.g., "PRR_kwDOEBHXU85HLfIn", "anonymised" in the case of a human contributor
    • conversation
      • Comments information in issue or pull request - Provided for Opening issue, Closing issue, Reopening issue, Transferring issue, Commenting issue, Opening pull request, Closing pull request, Reopening pull request and Commenting pull request
      • Type: object
      • Properties
        • comments
          • Number of comments present in the corresponding issue or pull request
          • Type: integer
          • e.g.,

  20. g

    Detailed Epidemiological Data from the COVID-19 Outbreak

    • github.com
    • catalog.midasnetwork.us
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Open COVID-19 Data Working Group, Detailed Epidemiological Data from the COVID-19 Outbreak [Dataset]. https://github.com/beoutbreakprepared/nCoV2019
    Explore at:
    Dataset provided by
    Open COVID-19 Data Working Group
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Data and code repository for the Open COVID-19 Data Working Group: a global and multi-organizational initative that aims to enable rapid sharing of trusted and open public health data to advance the response to infectious diseases.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
CodeParrot, github-code [Dataset]. https://huggingface.co/datasets/codeparrot/github-code

github-code

github-code

codeparrot/github-code

Explore at:
56 scholarly articles cite this dataset (View in Google Scholar)
Dataset provided by
Good Engineering, Inc
Authors
CodeParrot
License

https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

Description

The GitHub Code dataest consists of 115M code files from GitHub in 32 programming languages with 60 extensions totalling in 1TB of text data. The dataset was created from the GitHub dataset on BiqQuery.

Search
Clear search
Close search
Google apps
Main menu