100+ datasets found

h
github-code
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CodeParrot, github-code [Dataset]. https://huggingface.co/datasets/codeparrot/github-code
Explore at:
Dataset provided by
Good Engineering, Inc
Authors
CodeParrot
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
The GitHub Code dataest consists of 115M code files from GitHub in 32 programming languages with 60 extensions totalling in 1TB of text data. The dataset was created from the GitHub dataset on BiqQuery.
GitHub Repos
kaggle.com
zip
Updated Mar 20, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset provided by
GitHubhttps://github.com/
Authors
Github
Description
GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

Acknowledgements

This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

Inspiration

This is the perfect dataset for fighting language wars.

Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?
h
github-code-clean
huggingface.co
opendatalab.com
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CodeParrot (2022). github-code-clean [Dataset]. https://huggingface.co/datasets/codeparrot/github-code-clean
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 1, 2022
Dataset provided by
Good Engineering, Inc
Authors
CodeParrot
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The GitHub Code clean dataset in a more filtered version of codeparrot/github-code dataset, it consists of 115M code files from GitHub in 32 programming languages with 60 extensions totaling in almost 1TB of text data.
Global developers using GitHub 2013-2023
statista.com
Updated Jun 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Global developers using GitHub 2013-2023 [Dataset]. https://www.statista.com/statistics/1363004/developer-github-worldwide/
Explore at:
Dataset updated
Jun 26, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2023
Area covered
Worldwide
Description
In the past decade, the number of developers using the GitHub platform has grown from just ************* users in 2013 to *** million users in 2023. GitHub was acquired by Microsoft in 2018 for a whopping *** billion U.S. dollars.
h
github-issues
huggingface.co
opendatalab.com
Updated Mar 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lewis Tunstall (2025). github-issues [Dataset]. https://huggingface.co/datasets/lewtun/github-issues
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 27, 2025
Authors
Lewis Tunstall
Description
Dataset Card for GitHub Issues

Dataset Summary

GitHub Issues is a dataset consisting of GitHub issues and pull requests associated with the 🤗 Datasets repository. It is intended for educational purposes and can be used for semantic search or multilabel text classification. The contents of each GitHub issue are in English and concern the domain of datasets for NLP, computer vision, and beyond.

Supported Tasks and Leaderboards

For each of the tasks tagged… See the full description on the dataset page: https://huggingface.co/datasets/lewtun/github-issues.
g
Amazon review data 2018
nijianmo.github.io
cseweb.ucsd.edu
+1more
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCSD CSE Research Project, Amazon review data 2018 [Dataset]. https://nijianmo.github.io/amazon/
Explore at:
Dataset authored and provided by
UCSD CSE Research Project
Description
Context

This Dataset is an updated version of the Amazon review dataset released in 2014. As in the previous version, this dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs). In addition, this version provides the following features:

More reviews:

The total number of reviews is 233.1 million (142.8 million in 2014).

New reviews:

Current data includes reviews in the range May 1996 - Oct 2018.

Metadata: - We have added transaction metadata for each review shown on the review page.

Added more detailed metadata of the product landing page.

Acknowledgements

If you publish articles based on this dataset, please cite the following paper:

Jianmo Ni, Jiacheng Li, Julian McAuley. Justifying recommendations using distantly-labeled reviews and fined-grained aspects. EMNLP, 2019.
P
Public Git Archive Dataset
paperswithcode.com
opendatalab.com
Updated Nov 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Public Git Archive Dataset [Dataset]. https://paperswithcode.com/dataset/public-git-archive
Explore at:
Dataset updated
Nov 27, 2021
Description
The Public Git Archive is a dataset of 182,014 top-bookmarked Git repositories from GitHub totalling 6 TB. The dataset provides the source code of the projects, the related metadata, and development history.
Project GitHub
data.nasa.gov
data.staging.idas-ds1.appdat.jsc.nasa.gov
+1more
Updated Mar 31, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Project GitHub [Dataset]. https://data.nasa.gov/dataset/project-github
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
Release of CertWare was announced 23 Mar 2012 on: code.nasa.gov The announcement points to the Certware project on NASA’s GitHub repository at: nasa.github.com/CertWare The project site contains install instructions as an Eclipse feature, various tutorials and resources, and a link to the GitHub source repository. CertWare is released under the NASA Open Source Agreement (NOSA).
P
GitHub-Python Dataset
paperswithcode.com
opendatalab.com
Updated Jun 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michihiro Yasunaga; Percy Liang (2021). GitHub-Python Dataset [Dataset]. https://paperswithcode.com/dataset/github-python
Explore at:
Dataset updated
Jun 15, 2021
Authors
Michihiro Yasunaga; Percy Liang
Description
Repair AST parse (syntax) errors in Python code
o
Readme Files In 16,000,000 Public Github Repositories (October 2016)
explore.openaire.eu
data.niaid.nih.gov
+1more
Updated Feb 9, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Markovtsev Vadim (2017). Readme Files In 16,000,000 Public Github Repositories (October 2016) [Dataset]. http://doi.org/10.5281/zenodo.285419
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.285419
Dataset updated
Feb 9, 2017
Authors
Markovtsev Vadim
Description
Format index.csv.gz - CSV comma separated file with 3 columns:
The Quick, Draw! Dataset
github.com
carrfratagen43.blogspot.com
Updated Mar 1, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google (2017). The Quick, Draw! Dataset [Dataset]. https://github.com/googlecreativelab/quickdraw-dataset
Explore at:
Dataset updated
Mar 1, 2017
Dataset provided by
Googlehttp://google.com/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Quick Draw Dataset is a collection of 50 million drawings across 345 categories, contributed by players of the game "Quick, Draw!". The drawings were captured as timestamped vectors, tagged with metadata including what the player was asked to draw and in which country the player was located.

Example drawings: https://raw.githubusercontent.com/googlecreativelab/quickdraw-dataset/master/preview.jpg" alt="preview">
Data from: GHALogs: Large-Scale Dataset of GitHub Actions Runs
zenodo.org
application/gzip, zip
Updated Dec 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florent Moriconi; Thomas Durieux; Jean-Rémy Falleri; Raphael Troncy; Aurélien Francillon; Florent Moriconi; Thomas Durieux; Jean-Rémy Falleri; Raphael Troncy; Aurélien Francillon (2024). GHALogs: Large-Scale Dataset of GitHub Actions Runs [Dataset]. http://doi.org/10.5281/zenodo.10154920
Explore at:
application/gzip, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10154920
Dataset updated
Dec 5, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Florent Moriconi; Thomas Durieux; Jean-Rémy Falleri; Raphael Troncy; Aurélien Francillon; Florent Moriconi; Thomas Durieux; Jean-Rémy Falleri; Raphael Troncy; Aurélien Francillon
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Time period covered
Oct 2023
Description
In recent years, continuous integration and deployment (CI/CD) has become increasingly popular in both the open-source community and industry. Evaluating CI/CD performance is a critical aspect of software development, as it not only helps minimize execution costs but also ensures faster feedback for developers. Despite its importance, there is limited fine-grained knowledge about the performance of CI/CD processes—knowledge that is essential for identifying bottlenecks and optimization opportunities.
Moreover, the availability of large-scale, publicly accessible datasets of CI/CD logs remains scarce. The few datasets that do exist are often outdated and lack comprehensive coverage. To address this gap, we introduce a new dataset comprising 116k CI/CD workflows executed using GitHub Actions (GHA) across 25k public code projects spanning 20 different programming languages.
This dataset includes 513k workflow runs encompassing 2.3 million individual steps. For each workflow run, we provide detailed metadata along with complete run logs. To the best of our knowledge, this is the largest dataset of CI/CD runs that includes full log data. The inclusion of these logs enables more in-depth analysis of CI/CD pipelines, offering insights that cannot be gleaned solely from code repositories.
We postulate that this dataset will facilitate future CI/CD pipeline behavior research through log-based analysis. Potential applications include performance evaluation (e.g., measuring task execution times) and root cause analysis (e.g., identifying reasons for pipeline failures).
Data from: A Dataset of Bot and Human Activities in GitHub
zenodo.org
json, txt
Updated Jan 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Natarajan Chidambaram; Natarajan Chidambaram; Alexandre Decan; Alexandre Decan; Tom Mens; Tom Mens (2024). A Dataset of Bot and Human Activities in GitHub [Dataset]. http://doi.org/10.5281/zenodo.8219470
Explore at:
json, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8219470
Dataset updated
Jan 5, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Natarajan Chidambaram; Natarajan Chidambaram; Alexandre Decan; Alexandre Decan; Tom Mens; Tom Mens
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A Dataset of Bot and Human Activities in GitHub

This repository provides an updated version of a dataset of GitHub contributor activities that is accompanied by a paper published at MSR 2023 in the Data and Tool Showcase Track. The paper is entitled A Dataset of Bot and Human Activities in GitHub and is co-authored by Natarajan Chidambaram, Alexandre Decan and Tom Mens (Software Engineering Lab, University of Mons, Belgium). DOI: https://www.doi.org/10.1109/MSR59073.2023.00070. This work is done as a part of Natarajan Chdiambaram's PhD research in the context of DigitalWallonia4.AI research project ARIAC (grant number 2010235) and TRAIL.

The dataset contains 1,015,422 high-level activities made by 350 bots and 620 human contributors on GitHub between 25 November 2022 and 15 April 2023. The activities were generated from 1,221,907 low-level events obtained from the GitHub's Event API and cover 24 distinct activity types. This dataset facilitates the characterisation of bot and human behaviour in GitHub repositories, by enabling the analysis of activity sequences and activity patterns of bot and human contributors. This dataset could lead to better bot identification tools and empirical studies on how bots play a role in collaborative software development.

Files description

The following files are provided as part of the archive:

bot_activities.json - A JSON file containing 754,165 activities made by 350 bot contributors;

human_activities.json - A JSON file containing 261,258 activities made by 620 human contributors (anonymized);

JsonSchema.json - A JSON schema that validates the above datasets;

bots.txt - A TEXT file containing login names of all the 350 bots

Example

Below is an example of a Closing pull request activity:

{ "date": "2022-11-25T18:49:09+00:00", "activity": "Closing pull request", "contributor": "typescript-bot", "repository": "DefinitelyTyped/DefinitelyTyped", "comment": { "length": 249, "GH_node": "IC_kwDOAFz6BM5PJG7l" }, "pull_request": { "id": 62328, "title": "[qunit] Add `test.each()`", "created_at": "2022-09-19T17:34:28+00:00", "status": "closed", "closed_at": "2022-11-25T18:49:08+00:00", "merged": false, "GH_node": "PR_kwDOAFz6BM4_N5ib" }, "conversation": { "comments": 19 }, "payload": { "pr_commits": 1, "pr_changed_files": 5 } }

List of activity types

In total, we have identified 24 different high-level activity types from 15 different low-level event types. They are Creating repository, Creating branch, Creating tag, Deleting tag, Deleting repository, Publishing a release, Making repository public, Adding collaborator to repository, Forking repository, Starring repository, Editing wiki page, Opening issue, Closing issue, Reopening issue, Transferring issue, Commenting issue, Opening pull request, Closing pull request, Reopening pull request, Commenting pull request, Commenting pull request changes, Reviewing code, Commenting commits, Pushing commits.

List of fields

Not only does the dataset contain a list of activities made by bot and human contributors, but it also contains some details about these activities. For example, commenting issue activities provide details about the author of the comment, the repository and issue in which the comment was created, and so on.

For all activity types, we provide the date of the activity, the contributor that made the activity, and the repository in which the activity took place. Depending on the activity type, additional fields are provided. In this section, we describe for each activity type the different fields that are provided in the JSON file. It is worth to mention that we also provide the corresponding JSON schema alongside with the datasets.

Properties

date

Date on which the activity is performed

Type: string

e.g., "2022-11-25T09:55:19+00:00"

String format must be a "date-time"

activity

The activity performed by the contributor

Type: string

e.g., "Commenting pull request"

contributor

The login name of the contributor who performed this activity

Type: string

e.g., "analysis-bot", "anonymised" in the case of a human contributor

repository

The repository in which the activity is performed

Type: string

e.g., "apache/spark", "anonymised" in the case of a human contributor

issue

Issue information - provided for Opening issue, Closing issue, Reopening issue, Transferring issue and Commenting issue

Type: object

Properties

id

Issue number

Type: integer

e.g., 35471

title

Issue title

Type: string

e.g., "error building handtracking gpu example with bazel", "anonymised" in the case of a human contributor

created_at

The date on which this issue is created

Type: string

e.g., "2022-11-10T13:07:23+00:00"

String format must be a "date-time"

status

Current state of the issue

Type: string

"open" or "closed"

closed_at

The date on which this issue is closed. "null" will be provided if the issue is open

Types: string, null

e.g., "2022-11-25T10:42:39+00:00"

String format must be a "date-time"

resolved

The issue is resolved or not_planned/still open

Type: boolean

true or false

GH_node

The GitHub node of this issue

Type: string

e.g., "IC_kwDOC27xRM5PHTBU", "anonymised" in the case of a human contributor

pull_request

Pull request information - provided for Opening pull request, Closing pull request, Reopening pull request, Commenting pull request changes and Reviewing code

Type: object

Properties

id

Pull request number

Type: integer

e.g., 35471

title

Pull request title

Type: string

e.g., "error building handtracking gpu example with bazel", "anonymised" in the case of a human contributor

created_at

The date on which this pull request is created

Type: string

e.g., "2022-11-10T13:07:23+00:00"

String format must be a "date-time"

status

Current state of the pull request

Type: string

"open" or "closed"

closed_at

The date on which this pull request is closed. "null" will be provided if the pull request is open

Types: string, null

e.g., "2022-11-25T10:42:39+00:00"

String format must be a "date-time"

merged

The PR is merged or rejected/still open

Type: boolean

true or false

GH_node

The GitHub node of this pull request

Type: string

e.g., "PR_kwDOC7Q2kM5Dsu3-", "anonymised" in the case of a human contributor

review

Pull request review information - provided for Reviewing code

Type: object

Properties

status

Status of the review

Type: string

"changes_requested" or "approved" or "dismissed"

GH_node

The GitHub node of this review

Type: string

e.g., "PRR_kwDOEBHXU85HLfIn", "anonymised" in the case of a human contributor

conversation

Comments information in issue or pull request - Provided for Opening issue, Closing issue, Reopening issue, Transferring issue, Commenting issue, Opening pull request, Closing pull request, Reopening pull request and Commenting pull request

Type: object

Properties

comments

Number of comments present in the corresponding issue or pull request

Type: integer

e.g.,
g
Coronavirus (Covid-19) Data in the United States
github.com
openicpsr.org
+2more
csv
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
New York Times, Coronavirus (Covid-19) Data in the United States [Dataset]. https://github.com/nytimes/covid-19-data
Explore at:
csvAvailable download formats
Dataset provided by
New York Times
License
https://github.com/nytimes/covid-19-data/blob/master/LICENSEhttps://github.com/nytimes/covid-19-data/blob/master/LICENSE
Description
The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.
Since the first reported coronavirus case in Washington State on Jan. 21, 2020, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.
We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.
The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository.
g
Data from: Data Science Problems
github.com
opendatalab.com
Updated Feb 8, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Data Science Problems [Dataset]. https://github.com/microsoft/DataScienceProblems
Explore at:
Dataset updated
Feb 8, 2022
License
https://github.com/microsoft/DataScienceProblems/blob/main/LICENSE.txthttps://github.com/microsoft/DataScienceProblems/blob/main/LICENSE.txt
Description
Evaluate a natural language code generation model on real data science pedagogical notebooks! Data Science Problems (DSP) includes well-posed data science problems in Markdown along with unit tests to verify correctness and a Docker environment for reproducible execution. About 1/3 of notebooks in this benchmark also include data dependencies, so this benchmark not only can test a model's ability to chain together complex tasks, but also evaluate the solutions on real data! See our paper Training and Evaluating a Jupyter Notebook Data Science Assistant (https://arxiv.org/abs/2201.12901) for more details about state of the art results and other properties of the dataset.
Dataset of a Study of Computational reproducibility of Jupyter notebooks...
zenodo.org
pdf, zip
Updated Jul 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sheeba Samuel; Sheeba Samuel; Daniel Mietchen; Daniel Mietchen (2024). Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications [Dataset]. http://doi.org/10.5281/zenodo.8226725
Explore at:
zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8226725
Dataset updated
Jul 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sheeba Samuel; Sheeba Samuel; Daniel Mietchen; Daniel Mietchen
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This repository contains the dataset for the study of computational reproducibility of Jupyter notebooks from biomedical publications. Our focus lies in evaluating the extent of reproducibility of Jupyter notebooks derived from GitHub repositories linked to publications present in the biomedical literature repository, PubMed Central. We analyzed the reproducibility of Jupyter notebooks from GitHub repositories associated with publications indexed in the biomedical literature repository PubMed Central. The dataset includes the metadata information of the journals, publications, the Github repositories mentioned in the publications and the notebooks present in the Github repositories.

Data Collection and Analysis

We use the code for reproducibility of Jupyter notebooks from the study done by Pimentel et al., 2019 and adapted the code from ReproduceMeGit. We provide code for collecting the publication metadata from PubMed Central using NCBI Entrez utilities via Biopython.

Our approach involves searching PMC using the esearch function for Jupyter notebooks using the query: ``(ipynb OR jupyter OR ipython) AND github''. We meticulously retrieve data in XML format, capturing essential details about journals and articles. By systematically scanning the entire article, encompassing the abstract, body, data availability statement, and supplementary materials, we extract GitHub links. Additionally, we mine repositories for key information such as dependency declarations found in files like requirements.txt, setup.py, and pipfile. Leveraging the GitHub API, we enrich our data by incorporating repository creation dates, update histories, pushes, and programming languages.

All the extracted information is stored in a SQLite database. After collecting and creating the database tables, we ran a pipeline to collect the Jupyter notebooks contained in the GitHub repositories based on the code from Pimentel et al., 2019.

Our reproducibility pipeline was started on 27 March 2023.

Repository Structure

Our repository is organized into two main folders:

archaeology: This directory hosts scripts designed to download, parse, and extract metadata from PubMed Central publications and associated repositories. There are 24 database tables created which store the information on articles, journals, authors, repositories, notebooks, cells, modules, executions, etc. in the db.sqlite database file.

analyses: Here, you will find notebooks instrumental in the in-depth analysis of data related to our study. The db.sqlite file generated by running the archaelogy folder is stored in the analyses folder for further analysis. The path can however be configured in the config.py file. There are two sets of notebooks: one set (naming pattern N[0-9]*.ipynb) is focused on examining data pertaining to repositories and notebooks, while the other set (PMC[0-9]*.ipynb) is for analyzing data associated with publications in PubMed Central, i.e.\ for plots involving data about articles, journals, publication dates or research fields. The resultant figures from the these notebooks are stored in the 'outputs' folder.

MethodsWorkflow: The MethodsWorkflow file provides a conceptual overview of the workflow used in this study.

Accessing Data and Resources:

All the data generated during the initial study can be accessed at https://doi.org/10.5281/zenodo.6802158

For the latest results and re-run data, refer to this link.

The comprehensive SQLite database that encapsulates all the study's extracted data is stored in the db.sqlite file.

The metadata in xml format extracted from PubMed Central which contains the information about the articles and journal can be accessed in pmc.xml file.

System Requirements:

Centos 7 (Documentation: https://www.centos.org/)

Conda 4.9.4 (Installation Guide: https://docs.anaconda.com/anaconda/install/linux/)

Python 3.7.6 (Download Link: https://www.python.org/downloads/)

GitHub account (Get Started: https://github.com/, Requires GitHub Username and Token)

gcc 7.3.0 (Installation Guide: https://gcc.gnu.org/install/)

lbzip2 (Command: `conda install -c conda-forge lbzip2')

Running the pipeline:

Clone the computational-reproducibility-pmc repository using Git:
git clone https://github.com/fusion-jena/computational-reproducibility-pmc.git

Navigate to the computational-reproducibility-pmc directory:
cd computational-reproducibility-pmc/computational-reproducibility-pmc

Configure environment variables in the config.py file:
GITHUB_USERNAME = os.environ.get("JUP_GITHUB_USERNAME", "add your github username here")
GITHUB_TOKEN = os.environ.get("JUP_GITHUB_PASSWORD", "add your github token here")

Other environment variables can also be set in the config.py file.
BASE_DIR = Path(os.environ.get("JUP_BASE_DIR", "./")).expanduser() # Add the path of directory where the GitHub repositories will be saved
DB_CONNECTION = os.environ.get("JUP_DB_CONNECTION", "sqlite:///db.sqlite") # Add the path where the database is stored.

To set up conda environments for each python versions, upgrade pip, install pipenv, and install the archaeology package in each environment, execute:
source conda-setup.sh

Change to the archaeology directory
cd archaeology

Activate conda environment. We used py36 to run the pipeline.
conda activate py36

Execute the main pipeline script (r0_main.py):
python r0_main.py

Running the analysis:

Navigate to the analysis directory.
cd analyses

Activate conda environment. We use raw38 for the analysis of the metadata collected in the study.
conda activate raw38

Install the required packages using the requirements.txt file.
pip install -r requirements.txt

Launch Jupyterlab
jupyter lab

Refer to the Index.ipynb notebook for the execution order and guidance.

References:

Sheeba Samuel, Daniel Mietchen. (2024). Computational reproducibility of Jupyter notebooks from biomedical publications, https://doi.org/10.1093/gigascience/giad113, GigaScience

Sheeba Samuel, Daniel Mietchen. (2022). Computational reproducibility of Jupyter notebooks from biomedical publications, https://arxiv.org/pdf/2209.04308.pdf, CoRR abs/2209.04308

Sheeba Samuel, & Daniel Mietchen. (2022). Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6802158
Z
(No) Influence of Continuous Integration on the Development Activity in...
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Knack, Jascha (2020). (No) Influence of Continuous Integration on the Development Activity in GitHub Projects — Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1140260
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Baltes, Sebastian
Knack, Jascha
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is based on the TravisTorrent dataset released 2017-01-11 (https://travistorrent.testroots.org), the Google BigQuery GHTorrent dataset accessed 2017-07-03, and the Git log history of all projects in the dataset, retrieved 2017-07-16 and 2017-07-17.

We selected projects hosted on GitHub that employ the Continuous Integration (CI) system Travis CI. We identified the projects using the TravisTorrent data set and considered projects that:

used GitHub from the beginning (first commit not more than seven days before project creation date according to GHTorrent),

were active for at least one year (365 days) before the first build with Travis CI (before_ci),

used Travis CI at least for one year (during_ci),

had commit or merge activity on the default branch in both of these phases, and

used the default branch to trigger builds.

To derive the time frames, we employed the GHTorrent Big Query data set. The resulting sample contains 113 projects. Of these projects, 89 are Ruby projects and 24 are Java projects. For our analysis, we only consider the activity one year before and after the first build.

We cloned the selected project repositories and extracted the version history for all branches (see https://github.com/sbaltes/git-log-parser). For each repo and branch, we created one log file with all regular commits and one log file with all merges. We only considered commits changing non-binary files and applied a file extension filter to only consider changes to Java or Ruby source code files. From the log files, we then extracted metadata about the commits and stored this data in CSV files (see https://github.com/sbaltes/git-log-parser).

We also retrieved a random sample of GitHub project to validate the effects we observed in the CI project sample. We only considered projects that:

have Java or Ruby as their project language

used GitHub from the beginning (first commit not more than seven days before project creation date according to GHTorrent)

have commit activity for at least two years (730 days)

are engineered software projects (at least 10 watchers)

were not in the TravisTorrent dataset

In total, 8,046 projects satisfied those constraints. We drew a random sample of 800 projects from this sampling frame and retrieved the commit and merge data in the same way as for the CI sample. We then split the development activity at the median development date, removed projects without commits or merges in either of the two resulting time spans, and then manually checked the remaining projects to remove the ones with CI configuration files. The final comparision sample contained 60 non-CI projects.

This dataset contains the following files:

tr_projects_sample_filtered_2.csv A CSV file with information about the 113 selected projects.

tr_sample_commits_default_branch_before_ci.csv tr_sample_commits_default_branch_during_ci.csv One CSV file with information about all commits to the default branch before and after the first CI build. Only commits modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:

project: GitHub project name ("/" replaced by "_"). branch: The branch to which the commit was made. hash_value: The SHA1 hash value of the commit. author_name: The author name. author_email: The author email address. author_date: The authoring timestamp. commit_name: The committer name. commit_email: The committer email address. commit_date: The commit timestamp. log_message_length: The length of the git commit messages (in characters). file_count: Files changed with this commit. lines_added: Lines added to all files changed with this commit. lines_deleted: Lines deleted in all files changed with this commit. file_extensions: Distinct file extensions of files changed with this commit.

tr_sample_merges_default_branch_before_ci.csv tr_sample_merges_default_branch_during_ci.csv One CSV file with information about all merges into the default branch before and after the first CI build. Only merges modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:

project: GitHub project name ("/" replaced by "_"). branch: The destination branch of the merge. hash_value: The SHA1 hash value of the merge commit. merged_commits: Unique hash value prefixes of the commits merged with this commit. author_name: The author name. author_email: The author email address. author_date: The authoring timestamp. commit_name: The committer name. commit_email: The committer email address. commit_date: The commit timestamp. log_message_length: The length of the git commit messages (in characters). file_count: Files changed with this commit. lines_added: Lines added to all files changed with this commit. lines_deleted: Lines deleted in all files changed with this commit. file_extensions: Distinct file extensions of files changed with this commit. pull_request_id: ID of the GitHub pull request that has been merged with this commit (extracted from log message). source_user: GitHub login name of the user who initiated the pull request (extracted from log message). source_branch : Source branch of the pull request (extracted from log message).

comparison_project_sample_800.csv A CSV file with information about the 800 projects in the comparison sample.

commits_default_branch_before_mid.csv commits_default_branch_after_mid.csv One CSV file with information about all commits to the default branch before and after the medium date of the commit history. Only commits modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the same columns as the commits tables described above.

merges_default_branch_before_mid.csv merges_default_branch_after_mid.csv One CSV file with information about all merges into the default branch before and after the medium date of the commit history. Only merges modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the same columns as the merge tables described above.
GitHub Activity Data
console.cloud.google.com
Updated Jul 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:GitHub&hl=pl&inv=1&invt=Ab1XRw (2023). GitHub Activity Data [Dataset]. https://console.cloud.google.com/marketplace/product/github/github-repos?hl=pl
Explore at:
Dataset updated
Jul 23, 2023
Dataset provided by
Googlehttp://google.com/
GitHubhttps://github.com/
Description
GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008. This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
g
Detailed Epidemiological Data from the COVID-19 Outbreak
github.com
catalog.midasnetwork.us
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Open COVID-19 Data Working Group, Detailed Epidemiological Data from the COVID-19 Outbreak [Dataset]. https://github.com/beoutbreakprepared/nCoV2019
Explore at:
Dataset provided by
Open COVID-19 Data Working Group
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Data and code repository for the Open COVID-19 Data Working Group: a global and multi-organizational initative that aims to enable rapid sharing of trusted and open public health data to advance the response to infectious diseases.
i
Github Dataset for Authorship Attribution
ieee-dataport.org
Updated Oct 21, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Farzaneh Abazari (2021). Github Dataset for Authorship Attribution [Dataset]. https://ieee-dataport.org/documents/github-dataset-authorship-attribution
Explore at:
Dataset updated
Oct 21, 2021
Authors
Farzaneh Abazari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
919 java source codes written by 3

Facebook

Twitter

Click to copy link

Link copied

Cite

CodeParrot, github-code [Dataset]. https://huggingface.co/datasets/codeparrot/github-code

github-code

codeparrot/github-code

Explore at:

55 scholarly articles cite this dataset (View in Google Scholar)

Dataset provided by

Good Engineering, Inc

Authors

CodeParrot

License

https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

Description

The GitHub Code dataest consists of 115M code files from GitHub in 32 programming languages with 60 extensions totalling in 1TB of text data. The dataset was created from the GitHub dataset on BiqQuery.

Clear search

Close search

Google apps

Main menu

github-code

GitHub Repos

Querying BigQuery tables

Acknowledgements

Inspiration

github-code-clean

Global developers using GitHub 2013-2023

github-issues

Amazon review data 2018

Context

Acknowledgements

Public Git Archive Dataset

Project GitHub

GitHub-Python Dataset

Readme Files In 16,000,000 Public Github Repositories (October 2016)

The Quick, Draw! Dataset

Data from: GHALogs: Large-Scale Dataset of GitHub Actions Runs

Data from: A Dataset of Bot and Human Activities in GitHub

Coronavirus (Covid-19) Data in the United States

Data from: Data Science Problems

Dataset of a Study of Computational reproducibility of Jupyter notebooks...

(No) Influence of Continuous Integration on the Development Activity in...

GitHub Activity Data

Detailed Epidemiological Data from the COVID-19 Outbreak

Github Dataset for Authorship Attribution

github-code

github-code

codeparrot/github-code