19 datasets found

GHTorrent Project Commits Dataset
figshare.com
bin
Updated Jun 25, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rayce Rossum (2019). GHTorrent Project Commits Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.8321285.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.8321285.v1
Dataset updated
Jun 25, 2019
Dataset provided by
Figsharehttp://figshare.com/
Authors
Rayce Rossum
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data is a pull from GHTorrent, converted to a feather format. This data was used in https://github.com/UBC-MDS/RStudio-GitHub-Analysis.
ghtorrent-projects Dataset
zenodo.org
data.niaid.nih.gov
bin, txt
Updated Jul 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marios Papachristou; Marios Papachristou (2021). ghtorrent-projects Dataset [Dataset]. http://doi.org/10.5281/zenodo.5111043
Explore at:
txt, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5111043
Dataset updated
Jul 17, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Marios Papachristou; Marios Papachristou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A hypergraph dataset mined from the GHTorrent project is presented. The dataset contains two files

1. project_members.txt: Contains GitHub projects with at least 2 contributors and the corresponding contributors (as a hyperedge). The format of the data is:

2. num_followers.txt: Contains all GitHub users and their number of followers.

The artifact also contains the SQL queries used to obtain the data from GHTorrent (schema).
Pull request contributors analysis dataset
zenodo.org
explore.openaire.eu
zip
Updated Jan 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Georgios Gousios; Margaret-Anne Storey; Alberto Bacchelli; Georgios Gousios; Margaret-Anne Storey; Alberto Bacchelli (2020). Pull request contributors analysis dataset [Dataset]. http://doi.org/10.5281/zenodo.46063
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.46063
Dataset updated
Jan 21, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Georgios Gousios; Margaret-Anne Storey; Alberto Bacchelli; Georgios Gousios; Margaret-Anne Storey; Alberto Bacchelli
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset for the paper: G. Gousios, M.-A. Storey, and A. Bacchelli, “Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective,” in Proceedings of the 38th International Conference on Software Engineering, 2016.
(No) Influence of Continuous Integration on the Development Activity in...
zenodo.org
data.niaid.nih.gov
csv
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sebastian Baltes; Sebastian Baltes; Jascha Knack; Jascha Knack (2020). (No) Influence of Continuous Integration on the Development Activity in GitHub Projects — Dataset [Dataset]. http://doi.org/10.5281/zenodo.1291582
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1291582
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sebastian Baltes; Sebastian Baltes; Jascha Knack; Jascha Knack
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is based on the TravisTorrent dataset released 2017-01-11 (https://travistorrent.testroots.org), the Google BigQuery GHTorrent dataset accessed 2017-07-03, and the Git log history of all projects in the dataset, retrieved 2017-07-16 and 2017-07-17.

We selected projects hosted on GitHub that employ the Continuous Integration (CI) system Travis CI. We identified the projects using the TravisTorrent data set and considered projects that:

used GitHub from the beginning (first commit not more than seven days before project creation date according to GHTorrent),

were active for at least one year (365 days) before the first build with Travis CI (before_ci),

used Travis CI at least for one year (during_ci),

had commit or merge activity on the default branch in both of these phases, and

used the default branch to trigger builds.

To derive the time frames, we employed the GHTorrent Big Query data set. The resulting sample contains 113 projects. Of these projects, 89 are Ruby projects and 24 are Java projects. For our analysis, we only consider the activity one year before and after the first build.

We cloned the selected project repositories and extracted the version history for all branches (see https://github.com/sbaltes/git-log-parser). For each repo and branch, we created one log file with all regular commits and one log file with all merges. We only considered commits changing non-binary files and applied a file extension filter to only consider changes to Java or Ruby source code files. From the log files, we then extracted metadata about the commits and stored this data in CSV files (see https://github.com/sbaltes/git-log-parser).

We also retrieved a random sample of GitHub project to validate the effects we observed in the CI project sample. We only considered projects that:

have Java or Ruby as their project language

used GitHub from the beginning (first commit not more than seven days before project creation date according to GHTorrent)

have commit activity for at least two years (730 days)

are engineered software projects (at least 10 watchers)

were not in the TravisTorrent dataset

In total, 8,046 projects satisfied those constraints. We drew a random sample of 800 projects from this sampling frame and retrieved the commit and merge data in the same way as for the CI sample. We then split the development activity at the median development date, removed projects without commits or merges in either of the two resulting time spans, and then manually checked the remaining projects to remove the ones with CI configuration files. The final comparision sample contained 60 non-CI projects.

This dataset contains the following files:

tr_projects_sample_filtered_2.csv
A CSV file with information about the 113 selected projects.

tr_sample_commits_default_branch_before_ci.csv
tr_sample_commits_default_branch_during_ci.csv
One CSV file with information about all commits to the default branch before and after the first CI build. Only commits modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:

project: GitHub project name ("/" replaced by "_").
branch: The branch to which the commit was made.
hash_value: The SHA1 hash value of the commit.
author_name: The author name.
author_email: The author email address.
author_date: The authoring timestamp.
commit_name: The committer name.
commit_email: The committer email address.
commit_date: The commit timestamp.
log_message_length: The length of the git commit messages (in characters).
file_count: Files changed with this commit.
lines_added: Lines added to all files changed with this commit.
lines_deleted: Lines deleted in all files changed with this commit.
file_extensions: Distinct file extensions of files changed with this commit.

tr_sample_merges_default_branch_before_ci.csv
tr_sample_merges_default_branch_during_ci.csv
One CSV file with information about all merges into the default branch before and after the first CI build. Only merges modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:

project: GitHub project name ("/" replaced by "_").
branch: The destination branch of the merge.
hash_value: The SHA1 hash value of the merge commit.
merged_commits: Unique hash value prefixes of the commits merged with this commit.
author_name: The author name.
author_email: The author email address.
author_date: The authoring timestamp.
commit_name: The committer name.
commit_email: The committer email address.
commit_date: The commit timestamp.
log_message_length: The length of the git commit messages (in characters).
file_count: Files changed with this commit.
lines_added: Lines added to all files changed with this commit.
lines_deleted: Lines deleted in all files changed with this commit.
file_extensions: Distinct file extensions of files changed with this commit.
pull_request_id: ID of the GitHub pull request that has been merged with this commit (extracted from log message).
source_user: GitHub login name of the user who initiated the pull request (extracted from log message).
source_branch : Source branch of the pull request (extracted from log message).

comparison_project_sample_800.csv
A CSV file with information about the 800 projects in the comparison sample.

commits_default_branch_before_mid.csv
commits_default_branch_after_mid.csv
One CSV file with information about all commits to the default branch before and after the medium date of the commit history. Only commits modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the same columns as the commits tables described above.

merges_default_branch_before_mid.csv
merges_default_branch_after_mid.csv
One CSV file with information about all merges into the default branch before and after the medium date of the commit history. Only merges modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the same columns as the merge tables described above.
Z
msr14
data.niaid.nih.gov
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Olga Baysal (2020). msr14 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_268528
Explore at:
Dataset updated
Jan 24, 2020
Dataset authored and provided by
Olga Baysal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The MSR 2014 challenge dataset is a (very) trimmed down version of the original GHTorrent dataset. It includes data from the top-10 starred software projects for the top programming languages on Github, which gives 90 projects and their forks. For each project, we retrieved all data including issues, pull requests organizations, followers, stars and labels (milestones and events not included). The dataset was constructed from scratch to ensure the latest information is in it.

More information at http://openscience.us/repo/msr/msr14.html.
Z
Dataset - How do you propose your code changes? Empirical Analysis of Affect...
data.niaid.nih.gov
zenodo.org
Updated May 13, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marco Ortu (2020). Dataset - How do you propose your code changes? Empirical Analysis of Affect Metrics of Pull Requests on GitHub [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3825043
Explore at:
Dataset updated
May 13, 2020
Dataset provided by
Giuseppe Destefanis
Marco Tonelli
Daniel Graziotin
Michele Marchesi
Marco Ortu
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This package contains the raw open data for the study

Marco Ortu, Giuseppe Destefanis, Daniel Graziotin, Michele Marchesi, Roberto Tonelli. 2020. How do you propose your code changes? Empirical Analysis of Affect Metrics of Pull Requests on GitHub. Under Review.

The dataset is based on GHTorrent dataset:

Georgios Gousios. 2013. The GHTorent dataset and tool suite. In Proceedings of the 10th Working Conference on Mining Software Repositories (MSR ’13). IEEE Press, 233–236

And released with the same license (CC BY-SA 4.0).
Github BPMN Artifacts Dataset 2021
zenodo.org
explore.openaire.eu
bin
Updated Jan 26, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jasmin Türker; Jasmin Türker; Michael Völske; Michael Völske; Thomas Heinze; Thomas Heinze (2022). Github BPMN Artifacts Dataset 2021 [Dataset]. http://doi.org/10.5281/zenodo.5903352
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5903352
Dataset updated
Jan 26, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jasmin Türker; Jasmin Türker; Michael Völske; Michael Völske; Thomas Heinze; Thomas Heinze
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Information about 327,436 potential BPMN artifacts identified in all public Github repositories referenced in the GHTorrent dump from March 2021.

The data file is in line-delimited JSON format, with each row containing an array with the following six elements:

GHTorrent project ID

GitHub user name

GitHub repository name

GitHub branch name

Path to file inside repository

SHA1 hash of the file's contents

To get a list of retrievable URLs, use e.g. the following Python one-liner:

python3 -c 'import json; import sys; print(*[f"https://raw.githubusercontent.com/{u}/{r}/{b}/{f}" for _, u, r, b, f, _ in map(json.loads, sys.stdin)], sep=" ")' < bpmn-artifacts.jsonl > urls.txt

(using the hashes to filter out duplicates first is recommended though)
Data from: Dependency Smells in JavaScript Projects
zenodo.org
zip
Updated May 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abbas Javan Jafari; Diego Elias Costa; Rabe Abdalkareem; Emad Shihab; Nikolaos Tsantalis; Abbas Javan Jafari; Diego Elias Costa; Rabe Abdalkareem; Emad Shihab; Nikolaos Tsantalis (2021). Dependency Smells in JavaScript Projects [Dataset]. http://doi.org/10.5281/zenodo.4735566
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4735566
Dataset updated
May 4, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Abbas Javan Jafari; Diego Elias Costa; Rabe Abdalkareem; Emad Shihab; Nikolaos Tsantalis; Abbas Javan Jafari; Diego Elias Costa; Rabe Abdalkareem; Emad Shihab; Nikolaos Tsantalis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the replication package for our paper on dependency smells.

Here is a short description of what is contained in this package:

Code

This folder contains the code used for extracting, parsing, and analyzing the smells in the dataset, along with statistical analyses. The "parser.py" parses the project information (such as package.json) and prepares them in the databases. The "analyzer.py" file is responsible for the majority of the empirical analyses.

Datasets

This folder contains the intermediate datasets created and used in our analyses. The "smelldataset.db" file contains all smelly and clean dependencies for the latest snapshot. The "smell_counts.csv" file contains smells statistics for the projects in our dataset. The "changehistory.db" file contains the historical smell statistics for a period of 5 years. The code also requires the GhTorrent Dataset available at: https://ghtorrent.org/downloads.html.

Survey Questionnaires and Responses

These two folders contain the full set of questions that we asked the developers in our surveys along with the responses for survey 2.

Tool

This is the published tool which is also available at: https://github.com/abbasjavan/DependencySniffer

Visualization Scripts

This folder contains the scripts used to create the figures for the paper.
CMUSTRUDEL/need-for-tweet-data: Initial release
zenodo.org
zip
Updated Mar 16, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hongbo Fang; Bogdan Vasilescu; Hongbo Fang; Bogdan Vasilescu (2020). CMUSTRUDEL/need-for-tweet-data: Initial release [Dataset]. http://doi.org/10.5281/zenodo.3711630
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3711630
Dataset updated
Mar 16, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Hongbo Fang; Bogdan Vasilescu; Hongbo Fang; Bogdan Vasilescu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains 70,427 cross-linked Twitter-GHTorrent user pairs identified as likely belonging to the same users. The dataset accompanies our research paper (PDF preprint here):

@inproceedings{fang2020tweet, author = {Fang, Hongbo and Klug, Daniel and Lamba, Hemank and Herbsleb, James and Vasilescu, Bogdan}, title = {Need for Tweet: How Open Source Developers Talk About Their GitHub Work on Twitter}, booktitle = {International Conference on Mining Software Repositories (MSR)}, year = {2020}, pages = {to appear}, publisher = {ACM}, }

The data cannot be used for any purpose other than conducting research.

Due to privacy concerns, we only release the user IDs in Twitter and GHTorrent, respectively. We expect that users of this dataset will be able to collect other data using the Twitter API and GHTorrent, as needed. Please see below for an example.

To query the Twitter API for a given user_id, you can:

Apply for Twitter developer account here.

Create an APP with your Twitter developer account, and create an "API key" and "API secret key".

Obtain an access token. Given the previous API keys, run:

curl -u "

The response looks like this: {"token_type":"bearer","access_token":"<...>"}

Copy the "access_token".

Given the previous access token, run:

curl --request GET --url "https://api.twitter.com/1.1/users/show.json?user_id=

The GHTorrent user ids map to the users table in the MySQL version of GHTorrent. To use GHTorrent, please follow instructions on the GHTorrent website.
Replication package of the Paper "On the Relationships between the Initial...
zenodo.org
zip
Updated Dec 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous; Anonymous (2024). Replication package of the Paper "On the Relationships between the Initial Ecology Indicators of OSS Projects and Their Long-Term Popularity: An Exploratory Study on GitHub" [Dataset]. http://doi.org/10.5281/zenodo.14393491
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14393491
Dataset updated
Dec 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous; Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is collected from GitHub API and GitHub GHTorrent dataset. A brief description of each folder is provided below:

1. "Dataset_and_Code" folder

Contains the final dataset and algorithms

2. "Test_parameters" folders

Includes datasets under different parameters and the corresponding reproduction code, which corresponds to the first experiment of RQ1

3. "Compare_baseline"folder

Includes the dataset used by our method, the dataset used by the baseline method, and the reproduction code, corresponding to the second experiment of RQ1

4. "PLS" folder

Includes the dataset used by PLS and the corresponding reproduction code, which corresponds to experiment of RQ2

5. " Indicator_Calculation " folder
It contains the calculation methods for various metrics in the paper, as well as the corresponding key files.

6. " Appendix " folder
It includes supplementary materials such as the methodology for metric calculations to address the reviewers' questions.

Note

We have provided a corresponding README file in each folder to help others reproduce our results
Software Developer Expertise GitHub and Stack Overflow data sets
zenodo.org
data.niaid.nih.gov
bin, csv, html, txt
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Norbert Eke; Olga Baysal; Norbert Eke; Olga Baysal (2025). Software Developer Expertise GitHub and Stack Overflow data sets [Dataset]. http://doi.org/10.5281/zenodo.3696079
Explore at:
csv, html, bin, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3696079
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Norbert Eke; Olga Baysal; Norbert Eke; Olga Baysal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cross-Platform Software Developer Expertise Learning by Norbert Eke

This data set is part of my Master's thesis project on developer expertise learning by mining Stack Overflow (SOTorrent) and Github (GHTorrent) data. Check out my portfolio website at norberte.github.io
Pull Requests Acceptance Across Progamming Languages
figshare.com
zip
Updated Jul 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ondrej Kuhejda; Bruno Rossi (2023). Pull Requests Acceptance Across Progamming Languages [Dataset]. http://doi.org/10.6084/m9.figshare.20299275.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20299275.v1
Dataset updated
Jul 10, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Ondrej Kuhejda; Bruno Rossi
License
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Description
Replication package for the paper O. Kuhejda and B. Rossi, "Pull Requests Acceptance: A Study Across Programming Languages" accepted at 49th Euromicro Conference on Software Engineering and Advanced Applications (SEAA'23). Content

Projects.zip: JSON files containing the data mined from GitHub and GHTorrent for the analysis; Scripts.zip: source code used for the data mining process, running of linters and classification analysis (Python/R). For instructions and pre-requisites, please refer to READMEs: scripts/README.org and scripts/git-contrast/README.org. Please note that the script pr_classification.py is a modified version of the file created by Lenarduzzi et al. under the CC BY 4.0 license. The original file is available at https://figshare.com/s/d47b6f238b5c92430dd7?file=14949029 *_projects.png: tables about the descriptive statistics of the projects analyzed;
Data from: TDMentions: A Dataset of Technical Debt Mentions in Online Posts
zenodo.org
data.niaid.nih.gov
bin, bz2
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Morgan Ericsson; Morgan Ericsson; Anna Wingkvist; Anna Wingkvist (2020). TDMentions: A Dataset of Technical Debt Mentions in Online Posts [Dataset]. http://doi.org/10.5281/zenodo.2593142
Explore at:
bin, bz2Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.2593142
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Morgan Ericsson; Morgan Ericsson; Anna Wingkvist; Anna Wingkvist
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
# TDMentions: A Dataset of Technical Debt Mentions in Online Posts (version 1.0)

TDMentions is a dataset that contains mentions of technical debt from Reddit, Hacker News, and Stack Exchange. It also contains a list of blog posts on Medium that were tagged as technical debt. The dataset currently contains approximately 35,000 items.

## Data collection and processing

The dataset is mainly collected from existing datasets. We used data from:

- the archive of Reddit posts by Jason Baumgartner (available at [https://pushshift.io](https://pushshift.io),
- the archive of Hacker News available at Google's BigQuery (available at [https://console.cloud.google.com/marketplace/details/y-combinator/hacker-news](https://console.cloud.google.com/marketplace/details/y-combinator/hacker-news)), and the Stack Exchange data dump (available at [https://archive.org/details/stackexchange](https://archive.org/details/stackexchange)).
- the [GHTorrent](http://ghtorrent.org) project
- the [GH Archive](https://www.gharchive.org)

The data set currently contains data from the start of each source/service until 2018-12-31. For GitHub, we currently only include data from 2015-01-01.

We use the regular expression `tech(nical)?[\s\-_]*?debt` to find mentions in all sources except for Medium. We decided to limit our matches to variations of technical debt and tech debt. Other shorter forms, such as TD, can result in too many false positives. For Medium, we used the tag `technical-debt`.

## Data Format

The dataset is stored as a compressed (bzip2) JSON file with one JSON object per line. Each mention is represented as a JSON object with the following keys.

- `id`: the id used in the original source. We use the URL path to identify Medium posts.
- `body`: the text that contains the mention. This is either the comment or the title of the post. For Medium posts this is the title and subtitle (which might not mention technical debt, since posts are identified by the tag).
- `created_utc`: the time the item was posted in seconds since epoch in UTC.
- `author`: the author of the item. We use the username or userid from the source.
- `source`: where the item was posted. Valid sources are:
- HackerNews Comment
- HackerNews Job
- HackerNews Submission
- Reddit Comment
- Reddit Submission
- StackExchange Answer
- StackExchange Comment
- StackExchange Question
- Medium Post
- `meta`: Additional information about the item specific to the source. This includes, e.g., the subreddit a Reddit submission or comment was posted to, the score, etc. We try to use the same names, e.g., `score` and `num_comments` for keys that have the same meaning/information across multiple sources.

This is a sample item from Reddit:

```JSON
{
"id": "ab8auf",
"body": "Technical Debt Explained (x-post r/Eve)",
"created_utc": 1546271789,
"author": "totally_100_human",
"source": "Reddit Submission",
"meta": {
"title": "Technical Debt Explained (x-post r/Eve)",
"score": 1,
"num_comments": 0,
"url": "http://jestertrek.com/eve/technical-debt-2.png",
"subreddit": "RCBRedditBot"
}
}
```

## Sample Analyses

We decided to use JSON to store the data, since it is easy to work with from multiple programming languages. In the following examples, we use [`jq`](https://stedolan.github.io/jq/) to process the JSON.

### How many items are there for each source?

```
lbzip2 -cd postscomments.json.bz2 | jq '.source' | sort | uniq -c
```

### How many submissions that mentioned technical debt were posted each month?

```
lbzip2 -cd postscomments.json.bz2 | jq 'select(.source == "Reddit Submission") | .created_utc | strftime("%Y-%m")' | sort | uniq -c
```

### What are the titles of items that link (`meta.url`) to PDF documents?

```
lbzip2 -cd postscomments.json.bz2 | jq '. as $r | select(.meta.url?) | .meta.url | select(endswith(".pdf")) | $r.body'
```

### Please, I want CSV!

```
lbzip2 -cd postscomments.json.bz2 | jq -r '[.id, .body, .author] | @csv'
```

Note that you need to specify the keys you want to include for the CSV, so it is easier to either ignore the meta information or process each source.

Please see [https://github.com/sse-lnu/tdmentions](https://github.com/sse-lnu/tdmentions) for more analyses

# Limitations and Future updates

The current version of the dataset lacks GitHub data and Medium comments. GitHub data will be added in the next update. Medium comments (responses) will be added in a future update if we find a good way to represent these.
Enterprise-Driven Open Source Software
zenodo.org
opendatalab.com
+1more
application/gzip
Updated Apr 22, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Diomidis Spinellis; Diomidis Spinellis; Zoe Kotti; Zoe Kotti; Konstantinos Kravvaritis; Konstantinos Kravvaritis; Georgios Theodorou; Georgios Theodorou; Panos Louridas; Panos Louridas (2020). Enterprise-Driven Open Source Software [Dataset]. http://doi.org/10.5281/zenodo.3742962
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3742962
Dataset updated
Apr 22, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Diomidis Spinellis; Diomidis Spinellis; Zoe Kotti; Zoe Kotti; Konstantinos Kravvaritis; Konstantinos Kravvaritis; Georgios Theodorou; Georgios Theodorou; Panos Louridas; Panos Louridas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present a dataset of open source software developed mainly by enterprises rather than volunteers. This can be used to address known generalizability concerns, and, also, to perform research on open source business software development. Based on the premise that an enterprise's employees are likely to contribute to a project developed by their organization using the email account provided by it, we mine domain names associated with enterprises from open data sources as well as through white- and blacklisting, and use them through three heuristics to identify 17,264 enterprise GitHub projects. We provide these as a dataset detailing their provenance and properties. A manual evaluation of a dataset sample shows an identification accuracy of 89%. Through an exploratory data analysis we found that projects are staffed by a plurality of enterprise insiders, who appear to be pulling more than their weight, and that in a small percentage of relatively large projects development happens exclusively through enterprise insiders.

The main dataset is provided as a 17,264 record tab-separated file named enterprise_projects.txt with the following 29 fields.

url: the project's GitHub URL

project_id: the project's GHTorrent identifier

sdtc: true if selected using the same domain top committers heuristic (9,016 records)

mcpc: true if selected using the multiple committers from a valid enterprise heuristic (8,314 records)

mcve: true if selected using the multiple committers from a probable company heuristic (8,015 records),

star_number: number of GitHub watchers

commit_count: number of commits

files: number of files in current main branch

lines: corresponding number of lines in text files

pull_requests: number of pull requests

github_repo_creation: timestamp of the GitHub repository creation

earliest_commit: timestamp of the earliest commit

most_recent_commit: date of the most recent commit

committer_count: number of different committers

author_count: number of different authors

dominant_domain: the projects dominant email domain

dominant_domain_committer_commits: number of commits made by committers whose email matches the project's dominant domain

dominant_domain_author_commits: corresponding number for commit authors

dominant_domain_committers: number of committers whose email matches the project's dominant domain

dominant_domain_authors: corresponding number for commit authors

cik: SEC's EDGAR "central index key"

fg500: true if this is a Fortune Global 500 company (2,233 records)

sec10k: true if the company files SEC 10-K forms (4,180 records)

sec20f: true if the company files SEC 20-F forms (429 records)

project_name: GitHub project name

owner_login: GitHub project's owner login

company_name: company name as derived from the SEC and Fortune 500 data

owner_company: GitHub project's owner company name

license: SPDX license identifier

The file cohost_project_details.txt provides the full set of 311,223 cohort projects that are not part of the enterprise data set, but have comparable quality attributes.

url: the project's GitHub URL

project_id: the project's GHTorrent identifier

stars: number of GitHub watchers

commit_count: number of commits
Data from: GHTraffic: A Dataset for Reproducible Research in...
zenodo.org
zip
Updated Aug 29, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thilini Bhagya; Jens Dietrich; Hans Guesgen; Steve Versteeg; Thilini Bhagya; Jens Dietrich; Hans Guesgen; Steve Versteeg (2020). GHTraffic: A Dataset for Reproducible Research in Service-Oriented Computing [Dataset]. http://doi.org/10.5281/zenodo.1034573
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1034573
Dataset updated
Aug 29, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Thilini Bhagya; Jens Dietrich; Hans Guesgen; Steve Versteeg; Thilini Bhagya; Jens Dietrich; Hans Guesgen; Steve Versteeg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present GHTraffic, a dataset of significant size comprising HTTP transactions extracted from GitHub data (i.e., from 04 August 2015 GHTorrent issues snapshot) and augmented with synthetic transaction data. This dataset facilitates reproducible research on many aspects of service-oriented computing.

The GHTraffic dataset comprises three different editions: Small (S), Medium (M) and Large (L). The S dataset includes HTTP transaction records created from google/guava repository. Guava is a popular Java library containing utilities and data structures. The M dataset includes records from the npm/npm project. It is the popular de-facto standard package manager for JavaScript. The L dataset contains data that were created by selecting eight repositories containing large and very active projects, including twbs/bootstrap, symfony/symfony, docker/docker, Homebrew/homebrew, rust-lang/rust, kubernetes/kubernetes, rails/rails, and angular/angular.js.

We also provide access to the scripts used to generate GHTraffic. Using these scripts, users can modify the configuration properties in the config.properties file in order to create a customised version of GHTraffic datasets for their own use. The readme.md file included in the distribution provides further information on how to build the code and run the scripts.

The GHTraffic scripts can be accessed by downloading the pre-configured VirtualBox image or by cloning the repository.
GitSED: GitHub Socially Enhanced Dataset
zenodo.org
xz
Updated Jul 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel P. Oliveira; Gabriel P. Oliveira; Ana Flávia C. Moura; Natércia A. Batista; Michele A. Brandão; Mirella M. Moro; Mirella M. Moro; Ana Flávia C. Moura; Natércia A. Batista; Michele A. Brandão (2021). GitSED: GitHub Socially Enhanced Dataset [Dataset]. http://doi.org/10.5281/zenodo.5021329
Explore at:
xzAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5021329
Dataset updated
Jul 2, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gabriel P. Oliveira; Gabriel P. Oliveira; Ana Flávia C. Moura; Natércia A. Batista; Michele A. Brandão; Mirella M. Moro; Mirella M. Moro; Ana Flávia C. Moura; Natércia A. Batista; Michele A. Brandão
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Software Engineering has evolved as a field to study not only the many ways software is created but also how it evolves, becomes successful, is effective and efficient in its objectives, satisfies its quality attributes, and much more. Nonetheless, there are still many open issues during its conception, development, and maintenance phases. Especially, understanding how developers collaborate may help in all such phases, but it is also challenging. Luckily, we may now explore a novel angle to deal with such a challenge: studying the social aspects of software development over social networks.

With GitHub becoming the main representative of collaborative software development online tools, there are approaches to assess the follow-network, stargazer-network, and contributors-network. Moreover, having such networks built from real software projects offers support for relevant applications, such as detection of key developers, recommendation of collaboration among developers, detection of developer communities, and analyses of collaboration patterns in agile development.

GitSED is a dataset based on GitHub that is curated (cleaned and reduced), augmented with external data, and enriched with social information on developers’ interactions. The original data is extracted from GHTorrent (an offline repository of data collected through the GitHub REST API). Our final dataset contains data from up to June 2019. It comprises:

8,556,778 repositories

32,411,674 developers

6 programming languages (Assembly, JavaScript, Pascal, Python, Ruby, Visual Basic)

13 collaboration metrics

There are two previous versions of GitSED, which were originally built for the following conference papers:

v2 (May 2017): Gabriel P. Oliveira, Natércia A. Batista, Michele A. Brandão, and Mirella M. Moro. Tie Strength in GitHub Heterogeneous Networks. In Proceedings of the 24th Brazilian Symposium on Multimedia and the Web (WebMedia'18), 2018.

v1 (Sep 2015): Natércia A. Batista, Michele A. Brandão, Gabriela B. Alves, Ana Paula Couto da Silva, and Mirella M. Moro. Collaboration strength metrics and analyses on GitHub. In Proceedings of the International Conference on Web Intelligence (WI'17), 2017.
msr2020_new_pullreq_public
zenodo.org
bin, csv
Updated Jun 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xunhui Zhang; Ayushi Rastogi; Yue Yu; Xunhui Zhang; Ayushi Rastogi; Yue Yu (2020). msr2020_new_pullreq_public [Dataset]. http://doi.org/10.5281/zenodo.3922907
Explore at:
bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3922907
Dataset updated
Jun 30, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Xunhui Zhang; Ayushi Rastogi; Yue Yu; Xunhui Zhang; Ayushi Rastogi; Yue Yu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the dataset with public access for MSR 2020 data showcase paper "On the Shoulders of Giants: A New Dataset for Pull-based Development Research" , where both the pull request id on Github and GHTorrent are deleted. The reason to do this is because there are some person related factors in the dataset (country, affiliation, personality and etc). By deleting the pull request id, users cannot get the personal information.

Please use the latest version and only use it for research.

If you want to use these information for further research, please see the dataset msr2020_new_pullreq_restricted
Replication Package for Paper "How Early Participation Determines Long-Term...
zenodo.org
Updated Sep 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
anonymous; anonymous (2022). Replication Package for Paper "How Early Participation Determines Long-Term Sustained Activity in GitHub Projects" [Dataset]. http://doi.org/10.5281/zenodo.7059020
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7059020
Dataset updated
Sep 8, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
anonymous; anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This replication package can be used for replicating results in the paper. It contains 1) a dataset of 290,255 repositories; and 2) Python scripts for training and interpreting models.

We recommend manually setup the required environment in a commodity Linux machine with at least 1 CPU Core, 8GB Memory and 100GB empty storage space. We conduct development and execute all our experiments on a Ubuntu 20.04 server with two Intel Xeon Gold CPUs, 320GB memory, and 36TB RAID 5 Storage.

We use GHTorrent to restore historical states of 290,255 repositories with more than 57 commits, 4 PRs, 1 issue, 1 fork and 2 stars. The raw data of repositories are stored in `Replication Package/data/prodata.pkl`, and the contribution of features resulting from LIME model is stored in `Replication Package/data/limeres_m2_k1.pkl`. We sort items by the order in `Replication Package/data/randind.npy`, which can be used to reproduce the same results as in the paper.
`Replication Package/data/X_test_m2_k1.pkl` and `Replication Package/data/y_test_m2_k1.pkl` store the test dataset for the LIME model. You can run `Replication Package/fitdata.py` to get the results in Table III and IV, run `Replication Package/draw_compare_variable.py` to get Figure 2 and run `Replication Package/allvari_statistics.py` to get Table II. In `Replication Package/Variable_comparison_with_different_parameter.pdf`, we show the LIME results under different parameters. In `Replication Package/sample_pros.csv`, we also provide the list of randomly selected repositories in Section III.B.
Extracted MSR GitHub Repository URLs
zenodo.org
data.niaid.nih.gov
Updated Oct 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicholas Synovic; Nicholas Synovic (2022). Extracted MSR GitHub Repository URLs [Dataset]. http://doi.org/10.5281/zenodo.7226299
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7226299
Dataset updated
Oct 21, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nicholas Synovic; Nicholas Synovic
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains text files of GitHub URLs pointing to hosted git repositories.

These URLs come from mining software repository (MSR) datasets. URLs are built by taking the repository owner's name (OWNER) and it's name (REPO) and appending them to https://github.com/. There is one URL per line. URLs have not been tested for their current availibility. An example URL format is provided below:

https://github.com/OWNER/REPO

Current URLs are from the following datasets:

libraies.io January 12th, 2020 dataset

Jeremy Katz, "Libraries.io Open Source Repository and Dependency Metadata". Zenodo, Jan. 12, 2020. doi: 10.5281/zenodo.3626071.

RepoReapers/reaper dataset

Munaiah, N., Kroh, S., Cabrey, C. et al. Curating GitHub for engineered software projects. Empir Software Eng 22, 3219–3253 (2017). https://doi.org/10.1007/s10664-017-9512-6

GH Torrent partial dataset

G. Gousios, “The GHTorent dataset and tool suite,” in Proceedings of the 10th Working Conference on Mining Software Repositories, San Francisco, CA, USA, May 2013, pp. 233–236.
Not seeing a result you expected?
Learn how you can add new datasets to our index.