100+ datasets found

h
github-code
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CodeParrot, github-code [Dataset]. https://huggingface.co/datasets/codeparrot/github-code
Explore at:
Dataset provided by
Good Engineering, Inc
Authors
CodeParrot
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
The GitHub Code dataest consists of 115M code files from GitHub in 32 programming languages with 60 extensions totalling in 1TB of text data. The dataset was created from the GitHub dataset on BiqQuery.
GitHub Repos
kaggle.com
zip
Updated Mar 20, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset provided by
GitHubhttps://github.com/
Authors
Github
Description
GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

Acknowledgements

This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

Inspiration

This is the perfect dataset for fighting language wars.

Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?
h
github-code-clean
huggingface.co
opendatalab.com
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CodeParrot (2022). github-code-clean [Dataset]. https://huggingface.co/datasets/codeparrot/github-code-clean
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 1, 2022
Dataset provided by
Good Engineering, Inc
Authors
CodeParrot
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The GitHub Code clean dataset in a more filtered version of codeparrot/github-code dataset, it consists of 115M code files from GitHub in 32 programming languages with 60 extensions totaling in almost 1TB of text data.
P
GitHub-Python Dataset
paperswithcode.com
opendatalab.com
Updated Jun 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michihiro Yasunaga; Percy Liang (2021). GitHub-Python Dataset [Dataset]. https://paperswithcode.com/dataset/github-python
Explore at:
Dataset updated
Jun 15, 2021
Authors
Michihiro Yasunaga; Percy Liang
Description
Repair AST parse (syntax) errors in Python code
h
codeparrot-java-all
huggingface.co
Updated Mar 17, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aditya Goswami (2022). codeparrot-java-all [Dataset]. https://huggingface.co/datasets/Aditya78b/codeparrot-java-all
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 17, 2022
Authors
Aditya Goswami
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
GitHub Code Dataset

Dataset Description

The GitHub Code dataset consists of 115M code files from GitHub in 32 programming languages with 60 extensions totaling in 1TB of data. The dataset was created from the public GitHub dataset on Google BiqQuery.

How to use it

The GitHub Code dataset is a very large dataset so for most use cases it is recommended to make use of the streaming API of datasets. You can load and iterate through the dataset with the… See the full description on the dataset page: https://huggingface.co/datasets/Aditya78b/codeparrot-java-all.
g
Data from: Data Science Problems
github.com
opendatalab.com
Updated Feb 8, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Data Science Problems [Dataset]. https://github.com/microsoft/DataScienceProblems
Explore at:
Dataset updated
Feb 8, 2022
License
https://github.com/microsoft/DataScienceProblems/blob/main/LICENSE.txthttps://github.com/microsoft/DataScienceProblems/blob/main/LICENSE.txt
Description
Evaluate a natural language code generation model on real data science pedagogical notebooks! Data Science Problems (DSP) includes well-posed data science problems in Markdown along with unit tests to verify correctness and a Docker environment for reproducible execution. About 1/3 of notebooks in this benchmark also include data dependencies, so this benchmark not only can test a model's ability to chain together complex tasks, but also evaluate the solutions on real data! See our paper Training and Evaluating a Jupyter Notebook Data Science Assistant (https://arxiv.org/abs/2201.12901) for more details about state of the art results and other properties of the dataset.
f
Data from: An Empirical Evaluation of GitHub Copilot’s Code Suggestions
figshare.com
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nhan Nguyen; Sarah Nadi (2023). An Empirical Evaluation of GitHub Copilot’s Code Suggestions [Dataset]. http://doi.org/10.6084/m9.figshare.18515141.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.18515141.v2
Dataset updated
May 30, 2023
Dataset provided by
figshare
Authors
Nhan Nguyen; Sarah Nadi
License
https://www.apache.org/licenses/LICENSE-2.0.htmlhttps://www.apache.org/licenses/LICENSE-2.0.html
Description
Artifact accompanying MSR 2022 paper titled "An Empirical Evaluation of GitHub Copilot’s Code Suggestions" by Nhan Nguyen and Sarah Nadi
Dataset of a Study of Computational reproducibility of Jupyter notebooks...
zenodo.org
explore.openaire.eu
pdf, zip
Updated Jul 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sheeba Samuel; Sheeba Samuel; Daniel Mietchen; Daniel Mietchen (2024). Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications [Dataset]. http://doi.org/10.5281/zenodo.8226725
Explore at:
zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8226725
Dataset updated
Jul 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sheeba Samuel; Sheeba Samuel; Daniel Mietchen; Daniel Mietchen
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This repository contains the dataset for the study of computational reproducibility of Jupyter notebooks from biomedical publications. Our focus lies in evaluating the extent of reproducibility of Jupyter notebooks derived from GitHub repositories linked to publications present in the biomedical literature repository, PubMed Central. We analyzed the reproducibility of Jupyter notebooks from GitHub repositories associated with publications indexed in the biomedical literature repository PubMed Central. The dataset includes the metadata information of the journals, publications, the Github repositories mentioned in the publications and the notebooks present in the Github repositories.

Data Collection and Analysis

We use the code for reproducibility of Jupyter notebooks from the study done by Pimentel et al., 2019 and adapted the code from ReproduceMeGit. We provide code for collecting the publication metadata from PubMed Central using NCBI Entrez utilities via Biopython.

Our approach involves searching PMC using the esearch function for Jupyter notebooks using the query: ``(ipynb OR jupyter OR ipython) AND github''. We meticulously retrieve data in XML format, capturing essential details about journals and articles. By systematically scanning the entire article, encompassing the abstract, body, data availability statement, and supplementary materials, we extract GitHub links. Additionally, we mine repositories for key information such as dependency declarations found in files like requirements.txt, setup.py, and pipfile. Leveraging the GitHub API, we enrich our data by incorporating repository creation dates, update histories, pushes, and programming languages.

All the extracted information is stored in a SQLite database. After collecting and creating the database tables, we ran a pipeline to collect the Jupyter notebooks contained in the GitHub repositories based on the code from Pimentel et al., 2019.

Our reproducibility pipeline was started on 27 March 2023.

Repository Structure

Our repository is organized into two main folders:

archaeology: This directory hosts scripts designed to download, parse, and extract metadata from PubMed Central publications and associated repositories. There are 24 database tables created which store the information on articles, journals, authors, repositories, notebooks, cells, modules, executions, etc. in the db.sqlite database file.

analyses: Here, you will find notebooks instrumental in the in-depth analysis of data related to our study. The db.sqlite file generated by running the archaelogy folder is stored in the analyses folder for further analysis. The path can however be configured in the config.py file. There are two sets of notebooks: one set (naming pattern N[0-9]*.ipynb) is focused on examining data pertaining to repositories and notebooks, while the other set (PMC[0-9]*.ipynb) is for analyzing data associated with publications in PubMed Central, i.e.\ for plots involving data about articles, journals, publication dates or research fields. The resultant figures from the these notebooks are stored in the 'outputs' folder.

MethodsWorkflow: The MethodsWorkflow file provides a conceptual overview of the workflow used in this study.

Accessing Data and Resources:

All the data generated during the initial study can be accessed at https://doi.org/10.5281/zenodo.6802158

For the latest results and re-run data, refer to this link.

The comprehensive SQLite database that encapsulates all the study's extracted data is stored in the db.sqlite file.

The metadata in xml format extracted from PubMed Central which contains the information about the articles and journal can be accessed in pmc.xml file.

System Requirements:

Centos 7 (Documentation: https://www.centos.org/)

Conda 4.9.4 (Installation Guide: https://docs.anaconda.com/anaconda/install/linux/)

Python 3.7.6 (Download Link: https://www.python.org/downloads/)

GitHub account (Get Started: https://github.com/, Requires GitHub Username and Token)

gcc 7.3.0 (Installation Guide: https://gcc.gnu.org/install/)

lbzip2 (Command: `conda install -c conda-forge lbzip2')

Running the pipeline:

Clone the computational-reproducibility-pmc repository using Git:
git clone https://github.com/fusion-jena/computational-reproducibility-pmc.git

Navigate to the computational-reproducibility-pmc directory:
cd computational-reproducibility-pmc/computational-reproducibility-pmc

Configure environment variables in the config.py file:
GITHUB_USERNAME = os.environ.get("JUP_GITHUB_USERNAME", "add your github username here")
GITHUB_TOKEN = os.environ.get("JUP_GITHUB_PASSWORD", "add your github token here")

Other environment variables can also be set in the config.py file.
BASE_DIR = Path(os.environ.get("JUP_BASE_DIR", "./")).expanduser() # Add the path of directory where the GitHub repositories will be saved
DB_CONNECTION = os.environ.get("JUP_DB_CONNECTION", "sqlite:///db.sqlite") # Add the path where the database is stored.

To set up conda environments for each python versions, upgrade pip, install pipenv, and install the archaeology package in each environment, execute:
source conda-setup.sh

Change to the archaeology directory
cd archaeology

Activate conda environment. We used py36 to run the pipeline.
conda activate py36

Execute the main pipeline script (r0_main.py):
python r0_main.py

Running the analysis:

Navigate to the analysis directory.
cd analyses

Activate conda environment. We use raw38 for the analysis of the metadata collected in the study.
conda activate raw38

Install the required packages using the requirements.txt file.
pip install -r requirements.txt

Launch Jupyterlab
jupyter lab

Refer to the Index.ipynb notebook for the execution order and guidance.

References:

Sheeba Samuel, Daniel Mietchen. (2024). Computational reproducibility of Jupyter notebooks from biomedical publications, https://doi.org/10.1093/gigascience/giad113, GigaScience

Sheeba Samuel, Daniel Mietchen. (2022). Computational reproducibility of Jupyter notebooks from biomedical publications, https://arxiv.org/pdf/2209.04308.pdf, CoRR abs/2209.04308

Sheeba Samuel, & Daniel Mietchen. (2022). Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6802158
d
Code review regression analysis of open source GitHub projects
datadryad.org
zip
Updated Aug 31, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christopher Thompson; David Wagner (2017). Code review regression analysis of open source GitHub projects [Dataset]. http://doi.org/10.6078/D14X0T
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6078/D14X0T
Dataset updated
Aug 31, 2017
Dataset provided by
Dryad
Authors
Christopher Thompson; David Wagner
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
2017
Description
This dataset contains the repository data used for our study "A Large-Scale Study of Modern Code Review and Security in Open Source Projects". This dataset was collected from GitHub, and includes 3,126 projects in 143 languages, with 489,038 issues and 382,771 pull requests. We also include the regression analysis notebooks for reproducing our results from this data.
Z
PIPr: A Dataset of Public Infrastructure as Code Programs
data.niaid.nih.gov
zenodo.org
Updated Nov 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Spielmann, David (2023). PIPr: A Dataset of Public Infrastructure as Code Programs [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8262770
Explore at:
Dataset updated
Nov 28, 2023
Dataset provided by
Spielmann, David
Sokolowski, Daniel
Salvaneschi, Guido
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
Programming Languages Infrastructure as Code (PL-IaC) enables IaC programs written in general-purpose programming languages like Python and TypeScript. The currently available PL-IaC solutions are Pulumi and the Cloud Development Kits (CDKs) of Amazon Web Services (AWS) and Terraform. This dataset provides metadata and initial analyses of all public GitHub repositories in August 2022 with an IaC program, including their programming languages, applied testing techniques, and licenses. Further, we provide a shallow copy of the head state of those 7104 repositories whose licenses permit redistribution. The dataset is available under the Open Data Commons Attribution License (ODC-By) v1.0. Contents:

metadata.zip: The dataset metadata and analysis results as CSV files. scripts-and-logs.zip: Scripts and logs of the dataset creation. LICENSE: The Open Data Commons Attribution License (ODC-By) v1.0 text. README.md: This document. redistributable-repositiories.zip: Shallow copies of the head state of all redistributable repositories with an IaC program. This artifact is part of the ProTI Infrastructure as Code testing project: https://proti-iac.github.io. Metadata The dataset's metadata comprises three tabular CSV files containing metadata about all analyzed repositories, IaC programs, and testing source code files. repositories.csv:

ID (integer): GitHub repository ID url (string): GitHub repository URL downloaded (boolean): Whether cloning the repository succeeded name (string): Repository name description (string): Repository description licenses (string, list of strings): Repository licenses redistributable (boolean): Whether the repository's licenses permit redistribution created (string, date & time): Time of the repository's creation updated (string, date & time): Time of the last update to the repository pushed (string, date & time): Time of the last push to the repository fork (boolean): Whether the repository is a fork forks (integer): Number of forks archive (boolean): Whether the repository is archived programs (string, list of strings): Project file path of each IaC program in the repository programs.csv:

ID (string): Project file path of the IaC program repository (integer): GitHub repository ID of the repository containing the IaC program directory (string): Path of the directory containing the IaC program's project file solution (string, enum): PL-IaC solution of the IaC program ("AWS CDK", "CDKTF", "Pulumi") language (string, enum): Programming language of the IaC program (enum values: "csharp", "go", "haskell", "java", "javascript", "python", "typescript", "yaml") name (string): IaC program name description (string): IaC program description runtime (string): Runtime string of the IaC program testing (string, list of enum): Testing techniques of the IaC program (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking") tests (string, list of strings): File paths of IaC program's tests testing-files.csv:

file (string): Testing file path language (string, enum): Programming language of the testing file (enum values: "csharp", "go", "java", "javascript", "python", "typescript") techniques (string, list of enum): Testing techniques used in the testing file (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking") keywords (string, list of enum): Keywords found in the testing file (enum values: "/go/auto", "/testing/integration", "@AfterAll", "@BeforeAll", "@Test", "@aws-cdk", "@aws-cdk/assert", "@pulumi.runtime.test", "@pulumi/", "@pulumi/policy", "@pulumi/pulumi/automation", "Amazon.CDK", "Amazon.CDK.Assertions", "Assertions_", "HashiCorp.Cdktf", "IMocks", "Moq", "NUnit", "PolicyPack(", "ProgramTest", "Pulumi", "Pulumi.Automation", "PulumiTest", "ResourceValidationArgs", "ResourceValidationPolicy", "SnapshotTest()", "StackValidationPolicy", "Testing", "Testing_ToBeValidTerraform(", "ToBeValidTerraform(", "Verifier.Verify(", "WithMocks(", "[Fact]", "[TestClass]", "[TestFixture]", "[TestMethod]", "[Test]", "afterAll(", "assertions", "automation", "aws-cdk-lib", "aws-cdk-lib/assert", "aws_cdk", "aws_cdk.assertions", "awscdk", "beforeAll(", "cdktf", "com.pulumi", "def test_", "describe(", "github.com/aws/aws-cdk-go/awscdk", "github.com/hashicorp/terraform-cdk-go/cdktf", "github.com/pulumi/pulumi", "integration", "junit", "pulumi", "pulumi.runtime.setMocks(", "pulumi.runtime.set_mocks(", "pulumi_policy", "pytest", "setMocks(", "set_mocks(", "snapshot", "software.amazon.awscdk.assertions", "stretchr", "test(", "testing", "toBeValidTerraform(", "toMatchInlineSnapshot(", "toMatchSnapshot(", "to_be_valid_terraform(", "unittest", "withMocks(") program (string): Project file path of the testing file's IaC program Dataset Creation scripts-and-logs.zip contains all scripts and logs of the creation of this dataset. In it, executions/executions.log documents the commands that generated this dataset in detail. On a high level, the dataset was created as follows:

A list of all repositories with a PL-IaC program configuration file was created using search-repositories.py (documented below). The execution took two weeks due to the non-deterministic nature of GitHub's REST API, causing excessive retries. A shallow copy of the head of all repositories was downloaded using download-repositories.py (documented below). Using analysis.ipynb, the repositories were analyzed for the programs' metadata, including the used programming languages and licenses. Based on the analysis, all repositories with at least one IaC program and a redistributable license were packaged into redistributable-repositiories.zip, excluding any node_modules and .git directories. Searching Repositories The repositories are searched through search-repositories.py and saved in a CSV file. The script takes these arguments in the following order:

Github access token. Name of the CSV output file. Filename to search for. File extensions to search for, separated by commas. Min file size for the search (for all files: 0). Max file size for the search or * for unlimited (for all files: *). Pulumi projects have a Pulumi.yaml or Pulumi.yml (case-sensitive file name) file in their root folder, i.e., (3) is Pulumi and (4) is yml,yaml. https://www.pulumi.com/docs/intro/concepts/project/ AWS CDK projects have a cdk.json (case-sensitive file name) file in their root folder, i.e., (3) is cdk and (4) is json. https://docs.aws.amazon.com/cdk/v2/guide/cli.html CDK for Terraform (CDKTF) projects have a cdktf.json (case-sensitive file name) file in their root folder, i.e., (3) is cdktf and (4) is json. https://www.terraform.io/cdktf/create-and-deploy/project-setup Limitations The script uses the GitHub code search API and inherits its limitations:

Only forks with more stars than the parent repository are included. Only the repositories' default branches are considered. Only files smaller than 384 KB are searchable. Only repositories with fewer than 500,000 files are considered. Only repositories that have had activity or have been returned in search results in the last year are considered. More details: https://docs.github.com/en/search-github/searching-on-github/searching-code The results of the GitHub code search API are not stable. However, the generally more robust GraphQL API does not support searching for files in repositories: https://stackoverflow.com/questions/45382069/search-for-code-in-github-using-graphql-v4-api Downloading Repositories download-repositories.py downloads all repositories in CSV files generated through search-respositories.py and generates an overview CSV file of the downloads. The script takes these arguments in the following order:

Name of the repositories CSV files generated through search-repositories.py, separated by commas. Output directory to download the repositories to. Name of the CSV output file. The script only downloads a shallow recursive copy of the HEAD of the repo, i.e., only the main branch's most recent state, including submodules, without the rest of the git history. Each repository is downloaded to a subfolder named by the repository's ID.
d
Data from: Data and Code from: Environment, plant genetics, and their...
catalog.data.gov
Updated Apr 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Data and Code from: Environment, plant genetics, and their interaction shape important aspects of sunflower rhizosphere microbial communities [GitHub repository] [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-environment-plant-genetics-and-their-interaction-shape-important-aspect
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Service
Description
GitHub Repository - Description: Contains data, processing and analysis code, initial exploratory figures, final publication figures, and final publication tables. - Link: https://github.com/cliffbueno/Sunflower_AEM - Note: A release of this repository has been archived on Zenodo with a stable DOI: https://zenodo.org/doi/10.5281/zenodo.12193724
h
GitHubData
huggingface.co
Updated Oct 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ZelonPrograms (2024). GitHubData [Dataset]. https://huggingface.co/datasets/ZelonPrograms/GitHubData
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 16, 2024
Authors
ZelonPrograms
Description
GitHub Repository Data for Hugging Face

This repository contains a dataset of information collected from various GitHub repositories. The data includes key metrics about popular repositories, suitable for analysis and machine learning tasks.

Data Description

The dataset contains the following fields for each repository:

url: The URL of the GitHub repository. repo_name: The name of the repository. description: A brief description of the repository. stars: The number of… See the full description on the dataset page: https://huggingface.co/datasets/ZelonPrograms/GitHubData.
h
github-issues
huggingface.co
Updated Jan 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ilya Baksayar (2024). github-issues [Dataset]. https://huggingface.co/datasets/baksalyar/github-issues
Explore at:
Dataset updated
Jan 30, 2024
Authors
Ilya Baksayar
Description
Dataset Card for github_issues

Dataset Summary

[More Information Needed]

Supported Tasks and Leaderboards

[More Information Needed]

Languages

[More Information Needed]

Dataset Structure Data Instances

[More Information Needed]

Data Fields

[More Information Needed]

Data Splits

[More Information Needed]

Dataset Creation Curation Rationale

[More Information Needed]

Source Data… See the full description on the dataset page: https://huggingface.co/datasets/baksalyar/github-issues.
s
Data from: Can we make it better? Assessing and improving quality of GitHub...
researchdata.smu.edu.sg
zip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GEDE ARTHA AZRIADI PRANA (SMU) (2023). Data from: Can we make it better? Assessing and improving quality of GitHub repositories [Dataset]. http://doi.org/10.25440/smu.17073050.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.25440/smu.17073050.v1
Dataset updated
May 31, 2023
Dataset provided by
SMU Research Data Repository (RDR)
Authors
GEDE ARTHA AZRIADI PRANA (SMU)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the related dataset for the PhD dissertation by G. A. A. Prana, "Can We Make It Better? Assessing and Improving Quality of GitHub Repositories", available at https://ink.library.smu.edu.sg/etd_coll/373/The code hosting platform GitHub has gained immense popularity worldwide in recent years, with over 200 million repositories hosted as of June 2021. Due to its popularity, it has great potential to facilitate widespread improvements across many software projects. Naturally, GitHub has attracted much research attention, and the source code in the various repositories it hosts also provide opportunity to apply techniques and tools developed by software engineering researchers over the years. However, much of existing body of research applicable to GitHub focuses on code quality of the software projects and ways to improve them. Fewer work focus on potential ways to improve quality of GitHub repositories through other aspects, although quality of a software project on GitHub is also affected by factors outside a project's source code, such as documentation, the project's dependencies, and pool of contributors.The three works that form this dissertation focus on investigating aspects of GitHub repositories beyond the code quality, and identify specific potential improvements that can be applied to improve wide range of GitHub repositories. In the first work, we aim to systematically understand the content of README files in GitHub software projects, and develop a tool that can process them automatically. The work begins with a qualitative study involving 4,226 README file sections from 393 randomly-sampled GitHub repositories, which reveals that many README files contain the What'' andHow'' of the software project, but often do not contain the purpose and status of the project. This is followed by a development and evaluation of a multi-label classifier that can predict eight different README content categories with F1 of 0.746. From our subsequent evaluation of the classifier, which involve twenty software professionals, we find that adding labels generated by the classifier to README files ease information discovery.Our second work focuses on characteristics of vulnerabilities in open-source libraries used by 450 software projects on GitHub that are written in Java, Python, and Ruby. Using an industrial software composition analysis tool, we scanned every version of the projects after each commit made between November 1, 2017 and October 31, 2018. Our subsequent analyses on the discovered library names, versions, and associated vulnerabilities reveal, among others, that Denial of Service'' andInformation Disclosure'' vulnerability types are common. In addition, we also find that most of the vulnerabilities persist throughout the observation period, and that attributes such as project size, project popularity, and experience level of commit authors do not translate to better or worse handling of vulnerabilities in dependent libraries. Based on the findings in the second work, we list a number of implications for library users, library developers, as well as researchers, and provide several concrete recommendations. This includes recommendations to simplify projects' dependency sets, as well as to encourage research into ways to automatically recommend libraries known to be secure to developers.In our third work, we conduct a multi-region geographical analysis of gender inclusion on GitHub. We use a mixed-methods approach involving a quantitative analysis of commit authors of 21,456 project repositories, followed by a survey that is strategically targeted to developers in various regions worldwide and a qualitative analysis of the survey responses. Among other findings, we discover differences in diversity levels between regions, with Asia and Americas being highest. We also find no strong correlation between gender and geographic diversity of a repository's commit authors. Further, from our survey respondents worldwide, we also identify barriers and motivations to contribute to open-source software. The results of this work provides insights on the current state of gender diversity in open source software and potential ways to improve participation of developers from under-represented regions and gender, and subsequently improve the open-source software community in general. Such potential ways include creation of codes of conduct, proximity-based mentorship schemes, and highlighting of women / regional role models.
d
Project GitHub
catalog.data.gov
s.cnmilf.com
+2more
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Project GitHub [Dataset]. https://catalog.data.gov/dataset/project-github
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description
Release of CertWare was announced 23 Mar 2012 on: code.nasa.gov The announcement points to the Certware project on NASA’s GitHub repository at: nasa.github.com/CertWare The project site contains install instructions as an Eclipse feature, various tutorials and resources, and a link to the GitHub source repository. CertWare is released under the NASA Open Source Agreement (NOSA).
Data from: Mining the Technical Roles of GitHub Users
zenodo.org
data.niaid.nih.gov
bin, csv +2
Updated Feb 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
João Eduardo; Luciana L.; Marco Tulio; João Eduardo; Luciana L.; Marco Tulio (2022). Mining the Technical Roles of GitHub Users [Dataset]. http://doi.org/10.5281/zenodo.3986172
Explore at:
csv, bin, text/x-python, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3986172
Dataset updated
Feb 24, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
João Eduardo; Luciana L.; Marco Tulio; João Eduardo; Luciana L.; Marco Tulio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains the scripts and dataset used in the study reported at Mining the Technical Roles of GitHub Users paper. The files are described in more detailed below:

processed_ground_truth.csv: A CSV file with the information of the developers considered in the study. Due to privacy issues, we already preprocessed the dataset to remove identification clues. Please contact the authors in case you need the original one.

processed_ground_truth_fullstack.csv: Same CSV file but with fullstack developers.

script.ipynb, utils.py: Source code of the script used in our study.

Dockerfile, docker-compose.yml, requirements.txt: Files to replicate the code environment used in this study.

BoW-tuning.csv: List of classifications results for different bag of words parameters.
g
Coronavirus COVID-19 Global Cases by the Center for Systems Science and...
github.com
systems.jhu.edu
+1more
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE), Coronavirus COVID-19 Global Cases by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU) [Dataset]. https://github.com/CSSEGISandData/COVID-19
Explore at:
Dataset provided by
Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE)
Area covered
Global
Description
2019 Novel Coronavirus COVID-19 (2019-nCoV) Visual Dashboard and Map:
https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6
Confirmed Cases by Country/Region/Sovereignty
Confirmed Cases by Province/State/Dependency
Deaths
Recovered
Downloadable data:
https://github.com/CSSEGISandData/COVID-19
Additional Information about the Visual Dashboard:
https://systems.jhu.edu/research/public-health/ncov
GitSED: GitHub Socially Enhanced Dataset
zenodo.org
xz
Updated Jul 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel P. Oliveira; Gabriel P. Oliveira; Ana Flávia C. Moura; Natércia A. Batista; Michele A. Brandão; Mirella M. Moro; Mirella M. Moro; Ana Flávia C. Moura; Natércia A. Batista; Michele A. Brandão (2021). GitSED: GitHub Socially Enhanced Dataset [Dataset]. http://doi.org/10.5281/zenodo.5021329
Explore at:
xzAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5021329
Dataset updated
Jul 2, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gabriel P. Oliveira; Gabriel P. Oliveira; Ana Flávia C. Moura; Natércia A. Batista; Michele A. Brandão; Mirella M. Moro; Mirella M. Moro; Ana Flávia C. Moura; Natércia A. Batista; Michele A. Brandão
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Software Engineering has evolved as a field to study not only the many ways software is created but also how it evolves, becomes successful, is effective and efficient in its objectives, satisfies its quality attributes, and much more. Nonetheless, there are still many open issues during its conception, development, and maintenance phases. Especially, understanding how developers collaborate may help in all such phases, but it is also challenging. Luckily, we may now explore a novel angle to deal with such a challenge: studying the social aspects of software development over social networks.

With GitHub becoming the main representative of collaborative software development online tools, there are approaches to assess the follow-network, stargazer-network, and contributors-network. Moreover, having such networks built from real software projects offers support for relevant applications, such as detection of key developers, recommendation of collaboration among developers, detection of developer communities, and analyses of collaboration patterns in agile development.

GitSED is a dataset based on GitHub that is curated (cleaned and reduced), augmented with external data, and enriched with social information on developers’ interactions. The original data is extracted from GHTorrent (an offline repository of data collected through the GitHub REST API). Our final dataset contains data from up to June 2019. It comprises:

8,556,778 repositories

32,411,674 developers

6 programming languages (Assembly, JavaScript, Pascal, Python, Ruby, Visual Basic)

13 collaboration metrics

There are two previous versions of GitSED, which were originally built for the following conference papers:

v2 (May 2017): Gabriel P. Oliveira, Natércia A. Batista, Michele A. Brandão, and Mirella M. Moro. Tie Strength in GitHub Heterogeneous Networks. In Proceedings of the 24th Brazilian Symposium on Multimedia and the Web (WebMedia'18), 2018.

v1 (Sep 2015): Natércia A. Batista, Michele A. Brandão, Gabriela B. Alves, Ana Paula Couto da Silva, and Mirella M. Moro. Collaboration strength metrics and analyses on GitHub. In Proceedings of the International Conference on Web Intelligence (WI'17), 2017.
Data from: A Dataset of Bot and Human Activities in GitHub
zenodo.org
json, txt
Updated Jan 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Natarajan Chidambaram; Natarajan Chidambaram; Alexandre Decan; Alexandre Decan; Tom Mens; Tom Mens (2024). A Dataset of Bot and Human Activities in GitHub [Dataset]. http://doi.org/10.5281/zenodo.8219470
Explore at:
json, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8219470
Dataset updated
Jan 5, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Natarajan Chidambaram; Natarajan Chidambaram; Alexandre Decan; Alexandre Decan; Tom Mens; Tom Mens
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A Dataset of Bot and Human Activities in GitHub

This repository provides an updated version of a dataset of GitHub contributor activities that is accompanied by a paper published at MSR 2023 in the Data and Tool Showcase Track. The paper is entitled A Dataset of Bot and Human Activities in GitHub and is co-authored by Natarajan Chidambaram, Alexandre Decan and Tom Mens (Software Engineering Lab, University of Mons, Belgium). DOI: https://www.doi.org/10.1109/MSR59073.2023.00070. This work is done as a part of Natarajan Chdiambaram's PhD research in the context of DigitalWallonia4.AI research project ARIAC (grant number 2010235) and TRAIL.

The dataset contains 1,015,422 high-level activities made by 350 bots and 620 human contributors on GitHub between 25 November 2022 and 15 April 2023. The activities were generated from 1,221,907 low-level events obtained from the GitHub's Event API and cover 24 distinct activity types. This dataset facilitates the characterisation of bot and human behaviour in GitHub repositories, by enabling the analysis of activity sequences and activity patterns of bot and human contributors. This dataset could lead to better bot identification tools and empirical studies on how bots play a role in collaborative software development.

Files description

The following files are provided as part of the archive:

bot_activities.json - A JSON file containing 754,165 activities made by 350 bot contributors;

human_activities.json - A JSON file containing 261,258 activities made by 620 human contributors (anonymized);

JsonSchema.json - A JSON schema that validates the above datasets;

bots.txt - A TEXT file containing login names of all the 350 bots

Example

Below is an example of a Closing pull request activity:

{ "date": "2022-11-25T18:49:09+00:00", "activity": "Closing pull request", "contributor": "typescript-bot", "repository": "DefinitelyTyped/DefinitelyTyped", "comment": { "length": 249, "GH_node": "IC_kwDOAFz6BM5PJG7l" }, "pull_request": { "id": 62328, "title": "[qunit] Add `test.each()`", "created_at": "2022-09-19T17:34:28+00:00", "status": "closed", "closed_at": "2022-11-25T18:49:08+00:00", "merged": false, "GH_node": "PR_kwDOAFz6BM4_N5ib" }, "conversation": { "comments": 19 }, "payload": { "pr_commits": 1, "pr_changed_files": 5 } }

List of activity types

In total, we have identified 24 different high-level activity types from 15 different low-level event types. They are Creating repository, Creating branch, Creating tag, Deleting tag, Deleting repository, Publishing a release, Making repository public, Adding collaborator to repository, Forking repository, Starring repository, Editing wiki page, Opening issue, Closing issue, Reopening issue, Transferring issue, Commenting issue, Opening pull request, Closing pull request, Reopening pull request, Commenting pull request, Commenting pull request changes, Reviewing code, Commenting commits, Pushing commits.

List of fields

Not only does the dataset contain a list of activities made by bot and human contributors, but it also contains some details about these activities. For example, commenting issue activities provide details about the author of the comment, the repository and issue in which the comment was created, and so on.

For all activity types, we provide the date of the activity, the contributor that made the activity, and the repository in which the activity took place. Depending on the activity type, additional fields are provided. In this section, we describe for each activity type the different fields that are provided in the JSON file. It is worth to mention that we also provide the corresponding JSON schema alongside with the datasets.

Properties

date

Date on which the activity is performed

Type: string

e.g., "2022-11-25T09:55:19+00:00"

String format must be a "date-time"

activity

The activity performed by the contributor

Type: string

e.g., "Commenting pull request"

contributor

The login name of the contributor who performed this activity

Type: string

e.g., "analysis-bot", "anonymised" in the case of a human contributor

repository

The repository in which the activity is performed

Type: string

e.g., "apache/spark", "anonymised" in the case of a human contributor

issue

Issue information - provided for Opening issue, Closing issue, Reopening issue, Transferring issue and Commenting issue

Type: object

Properties

id

Issue number

Type: integer

e.g., 35471

title

Issue title

Type: string

e.g., "error building handtracking gpu example with bazel", "anonymised" in the case of a human contributor

created_at

The date on which this issue is created

Type: string

e.g., "2022-11-10T13:07:23+00:00"

String format must be a "date-time"

status

Current state of the issue

Type: string

"open" or "closed"

closed_at

The date on which this issue is closed. "null" will be provided if the issue is open

Types: string, null

e.g., "2022-11-25T10:42:39+00:00"

String format must be a "date-time"

resolved

The issue is resolved or not_planned/still open

Type: boolean

true or false

GH_node

The GitHub node of this issue

Type: string

e.g., "IC_kwDOC27xRM5PHTBU", "anonymised" in the case of a human contributor

pull_request

Pull request information - provided for Opening pull request, Closing pull request, Reopening pull request, Commenting pull request changes and Reviewing code

Type: object

Properties

id

Pull request number

Type: integer

e.g., 35471

title

Pull request title

Type: string

e.g., "error building handtracking gpu example with bazel", "anonymised" in the case of a human contributor

created_at

The date on which this pull request is created

Type: string

e.g., "2022-11-10T13:07:23+00:00"

String format must be a "date-time"

status

Current state of the pull request

Type: string

"open" or "closed"

closed_at

The date on which this pull request is closed. "null" will be provided if the pull request is open

Types: string, null

e.g., "2022-11-25T10:42:39+00:00"

String format must be a "date-time"

merged

The PR is merged or rejected/still open

Type: boolean

true or false

GH_node

The GitHub node of this pull request

Type: string

e.g., "PR_kwDOC7Q2kM5Dsu3-", "anonymised" in the case of a human contributor

review

Pull request review information - provided for Reviewing code

Type: object

Properties

status

Status of the review

Type: string

"changes_requested" or "approved" or "dismissed"

GH_node

The GitHub node of this review

Type: string

e.g., "PRR_kwDOEBHXU85HLfIn", "anonymised" in the case of a human contributor

conversation

Comments information in issue or pull request - Provided for Opening issue, Closing issue, Reopening issue, Transferring issue, Commenting issue, Opening pull request, Closing pull request, Reopening pull request and Commenting pull request

Type: object

Properties

comments

Number of comments present in the corresponding issue or pull request

Type: integer

e.g.,
g
Detailed Epidemiological Data from the COVID-19 Outbreak
github.com
catalog.midasnetwork.us
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Open COVID-19 Data Working Group, Detailed Epidemiological Data from the COVID-19 Outbreak [Dataset]. https://github.com/beoutbreakprepared/nCoV2019
Explore at:
Dataset provided by
Open COVID-19 Data Working Group
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Data and code repository for the Open COVID-19 Data Working Group: a global and multi-organizational initative that aims to enable rapid sharing of trusted and open public health data to advance the response to infectious diseases.

Facebook

Twitter

Click to copy link

Link copied

Cite

CodeParrot, github-code [Dataset]. https://huggingface.co/datasets/codeparrot/github-code

github-code

codeparrot/github-code

Explore at:

56 scholarly articles cite this dataset (View in Google Scholar)

Dataset provided by

Good Engineering, Inc

Authors

CodeParrot

License

https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

Description

The GitHub Code dataest consists of 115M code files from GitHub in 32 programming languages with 60 extensions totalling in 1TB of text data. The dataset was created from the GitHub dataset on BiqQuery.

Clear search

Close search

Google apps

Main menu

github-code

GitHub Repos

Querying BigQuery tables

Acknowledgements

Inspiration

github-code-clean

GitHub-Python Dataset

codeparrot-java-all

Data from: Data Science Problems

Data from: An Empirical Evaluation of GitHub Copilot’s Code Suggestions

Dataset of a Study of Computational reproducibility of Jupyter notebooks...

Code review regression analysis of open source GitHub projects

PIPr: A Dataset of Public Infrastructure as Code Programs

Data from: Data and Code from: Environment, plant genetics, and their...

GitHubData

github-issues

Data from: Can we make it better? Assessing and improving quality of GitHub...

Project GitHub

Data from: Mining the Technical Roles of GitHub Users

Coronavirus COVID-19 Global Cases by the Center for Systems Science and...

GitSED: GitHub Socially Enhanced Dataset

Data from: A Dataset of Bot and Human Activities in GitHub

Detailed Epidemiological Data from the COVID-19 Outbreak

github-code

github-code

codeparrot/github-code