Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data is a pull from GHTorrent, converted to a feather format. This data was used in https://github.com/UBC-MDS/RStudio-GitHub-Analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A hypergraph dataset mined from the GHTorrent project is presented. The dataset contains two files
1. project_members.txt: Contains GitHub projects with at least 2 contributors and the corresponding contributors (as a hyperedge). The format of the data is:
2. num_followers.txt: Contains all GitHub users and their number of followers.
The artifact also contains the SQL queries used to obtain the data from GHTorrent (schema).
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset for the paper: G. Gousios, M.-A. Storey, and A. Bacchelli, “Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective,” in Proceedings of the 38th International Conference on Software Engineering, 2016.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is based on the TravisTorrent dataset released 2017-01-11 (https://travistorrent.testroots.org), the Google BigQuery GHTorrent dataset accessed 2017-07-03, and the Git log history of all projects in the dataset, retrieved 2017-07-16 and 2017-07-17.
We selected projects hosted on GitHub that employ the Continuous Integration (CI) system Travis CI. We identified the projects using the TravisTorrent data set and considered projects that:
To derive the time frames, we employed the GHTorrent Big Query data set. The resulting sample contains 113 projects. Of these projects, 89 are Ruby projects and 24 are Java projects. For our analysis, we only consider the activity one year before and after the first build.
We cloned the selected project repositories and extracted the version history for all branches (see https://github.com/sbaltes/git-log-parser). For each repo and branch, we created one log file with all regular commits and one log file with all merges. We only considered commits changing non-binary files and applied a file extension filter to only consider changes to Java or Ruby source code files. From the log files, we then extracted metadata about the commits and stored this data in CSV files (see https://github.com/sbaltes/git-log-parser).
We also retrieved a random sample of GitHub project to validate the effects we observed in the CI project sample. We only considered projects that:
In total, 8,046 projects satisfied those constraints. We drew a random sample of 800 projects from this sampling frame and retrieved the commit and merge data in the same way as for the CI sample. We then split the development activity at the median development date, removed projects without commits or merges in either of the two resulting time spans, and then manually checked the remaining projects to remove the ones with CI configuration files. The final comparision sample contained 60 non-CI projects.
This dataset contains the following files:
tr_projects_sample_filtered_2.csv
A CSV file with information about the 113 selected projects.
tr_sample_commits_default_branch_before_ci.csv
tr_sample_commits_default_branch_during_ci.csv
One CSV file with information about all commits to the default branch before and after the first CI build. Only commits modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:
project: GitHub project name ("/" replaced by "_").
branch: The branch to which the commit was made.
hash_value: The SHA1 hash value of the commit.
author_name: The author name.
author_email: The author email address.
author_date: The authoring timestamp.
commit_name: The committer name.
commit_email: The committer email address.
commit_date: The commit timestamp.
log_message_length: The length of the git commit messages (in characters).
file_count: Files changed with this commit.
lines_added: Lines added to all files changed with this commit.
lines_deleted: Lines deleted in all files changed with this commit.
file_extensions: Distinct file extensions of files changed with this commit.
tr_sample_merges_default_branch_before_ci.csv
tr_sample_merges_default_branch_during_ci.csv
One CSV file with information about all merges into the default branch before and after the first CI build. Only merges modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:
project: GitHub project name ("/" replaced by "_").
branch: The destination branch of the merge.
hash_value: The SHA1 hash value of the merge commit.
merged_commits: Unique hash value prefixes of the commits merged with this commit.
author_name: The author name.
author_email: The author email address.
author_date: The authoring timestamp.
commit_name: The committer name.
commit_email: The committer email address.
commit_date: The commit timestamp.
log_message_length: The length of the git commit messages (in characters).
file_count: Files changed with this commit.
lines_added: Lines added to all files changed with this commit.
lines_deleted: Lines deleted in all files changed with this commit.
file_extensions: Distinct file extensions of files changed with this commit.
pull_request_id: ID of the GitHub pull request that has been merged with this commit (extracted from log message).
source_user: GitHub login name of the user who initiated the pull request (extracted from log message).
source_branch : Source branch of the pull request (extracted from log message).
comparison_project_sample_800.csv
A CSV file with information about the 800 projects in the comparison sample.
commits_default_branch_before_mid.csv
commits_default_branch_after_mid.csv
One CSV file with information about all commits to the default branch before and after the medium date of the commit history. Only commits modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the same columns as the commits tables described above.
merges_default_branch_before_mid.csv
merges_default_branch_after_mid.csv
One CSV file with information about all merges into the default branch before and after the medium date of the commit history. Only merges modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the same columns as the merge tables described above.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The MSR 2014 challenge dataset is a (very) trimmed down version of the original GHTorrent dataset. It includes data from the top-10 starred software projects for the top programming languages on Github, which gives 90 projects and their forks. For each project, we retrieved all data including issues, pull requests organizations, followers, stars and labels (milestones and events not included). The dataset was constructed from scratch to ensure the latest information is in it.
More information at http://openscience.us/repo/msr/msr14.html.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This package contains the raw open data for the study
Marco Ortu, Giuseppe Destefanis, Daniel Graziotin, Michele Marchesi, Roberto Tonelli. 2020. How do you propose your code changes? Empirical Analysis of Affect Metrics of Pull Requests on GitHub. Under Review.
The dataset is based on GHTorrent dataset:
Georgios Gousios. 2013. The GHTorent dataset and tool suite. In Proceedings of the 10th Working Conference on Mining Software Repositories (MSR ’13). IEEE Press, 233–236
And released with the same license (CC BY-SA 4.0).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Information about 327,436 potential BPMN artifacts identified in all public Github repositories referenced in the GHTorrent dump from March 2021.
The data file is in line-delimited JSON format, with each row containing an array with the following six elements:
To get a list of retrievable URLs, use e.g. the following Python one-liner:
python3 -c 'import json; import sys; print(*[f"https://raw.githubusercontent.com/{u}/{r}/{b}/{f}" for _, u, r, b, f, _ in map(json.loads, sys.stdin)], sep="
")' < bpmn-artifacts.jsonl > urls.txt
(using the hashes to filter out duplicates first is recommended though)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the replication package for our paper on dependency smells.
Here is a short description of what is contained in this package:
Code
This folder contains the code used for extracting, parsing, and analyzing the smells in the dataset, along with statistical analyses. The "parser.py" parses the project information (such as package.json) and prepares them in the databases. The "analyzer.py" file is responsible for the majority of the empirical analyses.
Datasets
This folder contains the intermediate datasets created and used in our analyses. The "smelldataset.db" file contains all smelly and clean dependencies for the latest snapshot. The "smell_counts.csv" file contains smells statistics for the projects in our dataset. The "changehistory.db" file contains the historical smell statistics for a period of 5 years. The code also requires the GhTorrent Dataset available at: https://ghtorrent.org/downloads.html.
Survey Questionnaires and Responses
These two folders contain the full set of questions that we asked the developers in our surveys along with the responses for survey 2.
Tool
This is the published tool which is also available at: https://github.com/abbasjavan/DependencySniffer
Visualization Scripts
This folder contains the scripts used to create the figures for the paper.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 70,427 cross-linked Twitter-GHTorrent user pairs identified as likely belonging to the same users. The dataset accompanies our research paper (PDF preprint here):
@inproceedings{fang2020tweet,
author = {Fang, Hongbo and Klug, Daniel and Lamba, Hemank and Herbsleb, James and Vasilescu, Bogdan},
title = {Need for Tweet: How Open Source Developers Talk About Their GitHub Work on Twitter},
booktitle = {International Conference on Mining Software Repositories (MSR)},
year = {2020},
pages = {to appear},
publisher = {ACM},
}
The data cannot be used for any purpose other than conducting research.
Due to privacy concerns, we only release the user IDs in Twitter and GHTorrent, respectively. We expect that users of this dataset will be able to collect other data using the Twitter API and GHTorrent, as needed. Please see below for an example.
To query the Twitter API for a given user_id, you can:
Apply for Twitter developer account here.
Create an APP with your Twitter developer account, and create an "API key" and "API secret key".
Obtain an access token. Given the previous API keys, run:
curl -u "
The response looks like this: {"token_type":"bearer","access_token":"<...>"}
Copy the "access_token".
Given the previous access token, run:
curl --request GET --url "https://api.twitter.com/1.1/users/show.json?user_id=
The GHTorrent user ids map to the users table in the MySQL version of GHTorrent. To use GHTorrent, please follow instructions on the GHTorrent website.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset is collected from GitHub API and GitHub GHTorrent dataset. A brief description of each folder is provided below:
1. "Dataset_and_Code" folder
Contains the final dataset and algorithms
2. "Test_parameters" folders
Includes datasets under different parameters and the corresponding reproduction code, which corresponds to the first experiment of RQ1
3. "Compare_baseline"folder
Includes the dataset used by our method, the dataset used by the baseline method, and the reproduction code, corresponding to the second experiment of RQ1
4. "PLS" folder
Includes the dataset used by PLS and the corresponding reproduction code, which corresponds to experiment of RQ2
5. " Indicator_Calculation " folder
It contains the calculation methods for various metrics in the paper, as well as the corresponding key files.
6. " Appendix " folder
It includes supplementary materials such as the methodology for metric calculations to address the reviewers' questions.
Note
We have provided a corresponding README file in each folder to help others reproduce our results
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cross-Platform Software Developer Expertise Learning by Norbert Eke
This data set is part of my Master's thesis project on developer expertise learning by mining Stack Overflow (SOTorrent) and Github (GHTorrent) data. Check out my portfolio website at norberte.github.io
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Replication package for the paper O. Kuhejda and B. Rossi, "Pull Requests Acceptance: A Study Across Programming Languages" accepted at 49th Euromicro Conference on Software Engineering and Advanced Applications (SEAA'23). Content
Projects.zip: JSON files containing the data mined from GitHub and GHTorrent for the analysis; Scripts.zip: source code used for the data mining process, running of linters and classification analysis (Python/R). For instructions and pre-requisites, please refer to READMEs: scripts/README.org and scripts/git-contrast/README.org. Please note that the script pr_classification.py is a modified version of the file created by Lenarduzzi et al. under the CC BY 4.0 license. The original file is available at https://figshare.com/s/d47b6f238b5c92430dd7?file=14949029 *_projects.png: tables about the descriptive statistics of the projects analyzed;
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
# TDMentions: A Dataset of Technical Debt Mentions in Online Posts (version 1.0)
TDMentions is a dataset that contains mentions of technical debt from Reddit, Hacker News, and Stack Exchange. It also contains a list of blog posts on Medium that were tagged as technical debt. The dataset currently contains approximately 35,000 items.
## Data collection and processing
The dataset is mainly collected from existing datasets. We used data from:
- the archive of Reddit posts by Jason Baumgartner (available at [https://pushshift.io](https://pushshift.io),
- the archive of Hacker News available at Google's BigQuery (available at [https://console.cloud.google.com/marketplace/details/y-combinator/hacker-news](https://console.cloud.google.com/marketplace/details/y-combinator/hacker-news)), and the Stack Exchange data dump (available at [https://archive.org/details/stackexchange](https://archive.org/details/stackexchange)).
- the [GHTorrent](http://ghtorrent.org) project
- the [GH Archive](https://www.gharchive.org)
The data set currently contains data from the start of each source/service until 2018-12-31. For GitHub, we currently only include data from 2015-01-01.
We use the regular expression `tech(nical)?[\s\-_]*?debt` to find mentions in all sources except for Medium. We decided to limit our matches to variations of technical debt and tech debt. Other shorter forms, such as TD, can result in too many false positives. For Medium, we used the tag `technical-debt`.
## Data Format
The dataset is stored as a compressed (bzip2) JSON file with one JSON object per line. Each mention is represented as a JSON object with the following keys.
- `id`: the id used in the original source. We use the URL path to identify Medium posts.
- `body`: the text that contains the mention. This is either the comment or the title of the post. For Medium posts this is the title and subtitle (which might not mention technical debt, since posts are identified by the tag).
- `created_utc`: the time the item was posted in seconds since epoch in UTC.
- `author`: the author of the item. We use the username or userid from the source.
- `source`: where the item was posted. Valid sources are:
- HackerNews Comment
- HackerNews Job
- HackerNews Submission
- Reddit Comment
- Reddit Submission
- StackExchange Answer
- StackExchange Comment
- StackExchange Question
- Medium Post
- `meta`: Additional information about the item specific to the source. This includes, e.g., the subreddit a Reddit submission or comment was posted to, the score, etc. We try to use the same names, e.g., `score` and `num_comments` for keys that have the same meaning/information across multiple sources.
This is a sample item from Reddit:
```JSON
{
"id": "ab8auf",
"body": "Technical Debt Explained (x-post r/Eve)",
"created_utc": 1546271789,
"author": "totally_100_human",
"source": "Reddit Submission",
"meta": {
"title": "Technical Debt Explained (x-post r/Eve)",
"score": 1,
"num_comments": 0,
"url": "http://jestertrek.com/eve/technical-debt-2.png",
"subreddit": "RCBRedditBot"
}
}
```
## Sample Analyses
We decided to use JSON to store the data, since it is easy to work with from multiple programming languages. In the following examples, we use [`jq`](https://stedolan.github.io/jq/) to process the JSON.
### How many items are there for each source?
```
lbzip2 -cd postscomments.json.bz2 | jq '.source' | sort | uniq -c
```
### How many submissions that mentioned technical debt were posted each month?
```
lbzip2 -cd postscomments.json.bz2 | jq 'select(.source == "Reddit Submission") | .created_utc | strftime("%Y-%m")' | sort | uniq -c
```
### What are the titles of items that link (`meta.url`) to PDF documents?
```
lbzip2 -cd postscomments.json.bz2 | jq '. as $r | select(.meta.url?) | .meta.url | select(endswith(".pdf")) | $r.body'
```
### Please, I want CSV!
```
lbzip2 -cd postscomments.json.bz2 | jq -r '[.id, .body, .author] | @csv'
```
Note that you need to specify the keys you want to include for the CSV, so it is easier to either ignore the meta information or process each source.
Please see [https://github.com/sse-lnu/tdmentions](https://github.com/sse-lnu/tdmentions) for more analyses
# Limitations and Future updates
The current version of the dataset lacks GitHub data and Medium comments. GitHub data will be added in the next update. Medium comments (responses) will be added in a future update if we find a good way to represent these.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present a dataset of open source software developed mainly by enterprises rather than volunteers. This can be used to address known generalizability concerns, and, also, to perform research on open source business software development. Based on the premise that an enterprise's employees are likely to contribute to a project developed by their organization using the email account provided by it, we mine domain names associated with enterprises from open data sources as well as through white- and blacklisting, and use them through three heuristics to identify 17,264 enterprise GitHub projects. We provide these as a dataset detailing their provenance and properties. A manual evaluation of a dataset sample shows an identification accuracy of 89%. Through an exploratory data analysis we found that projects are staffed by a plurality of enterprise insiders, who appear to be pulling more than their weight, and that in a small percentage of relatively large projects development happens exclusively through enterprise insiders.
The main dataset is provided as a 17,264 record tab-separated file named enterprise_projects.txt
with the following 29 fields.
The file cohost_project_details.txt
provides the full set of 311,223 cohort projects that are not part of the enterprise data set, but have comparable quality attributes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present GHTraffic, a dataset of significant size comprising HTTP transactions extracted from GitHub data (i.e., from 04 August 2015 GHTorrent issues snapshot) and augmented with synthetic transaction data. This dataset facilitates reproducible research on many aspects of service-oriented computing.
The GHTraffic dataset comprises three different editions: Small (S), Medium (M) and Large (L). The S dataset includes HTTP transaction records created from google/guava repository. Guava is a popular Java library containing utilities and data structures. The M dataset includes records from the npm/npm project. It is the popular de-facto standard package manager for JavaScript. The L dataset contains data that were created by selecting eight repositories containing large and very active projects, including twbs/bootstrap, symfony/symfony, docker/docker, Homebrew/homebrew, rust-lang/rust, kubernetes/kubernetes, rails/rails, and angular/angular.js.
We also provide access to the scripts used to generate GHTraffic. Using these scripts, users can modify the configuration properties in the config.properties file in order to create a customised version of GHTraffic datasets for their own use. The readme.md file included in the distribution provides further information on how to build the code and run the scripts.
The GHTraffic scripts can be accessed by downloading the pre-configured VirtualBox image or by cloning the repository.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Software Engineering has evolved as a field to study not only the many ways software is created but also how it evolves, becomes successful, is effective and efficient in its objectives, satisfies its quality attributes, and much more. Nonetheless, there are still many open issues during its conception, development, and maintenance phases. Especially, understanding how developers collaborate may help in all such phases, but it is also challenging. Luckily, we may now explore a novel angle to deal with such a challenge: studying the social aspects of software development over social networks.
With GitHub becoming the main representative of collaborative software development online tools, there are approaches to assess the follow-network, stargazer-network, and contributors-network. Moreover, having such networks built from real software projects offers support for relevant applications, such as detection of key developers, recommendation of collaboration among developers, detection of developer communities, and analyses of collaboration patterns in agile development.
GitSED is a dataset based on GitHub that is curated (cleaned and reduced), augmented with external data, and enriched with social information on developers’ interactions. The original data is extracted from GHTorrent (an offline repository of data collected through the GitHub REST API). Our final dataset contains data from up to June 2019. It comprises:
There are two previous versions of GitSED, which were originally built for the following conference papers:
v2 (May 2017): Gabriel P. Oliveira, Natércia A. Batista, Michele A. Brandão, and Mirella M. Moro. Tie Strength in GitHub Heterogeneous Networks. In Proceedings of the 24th Brazilian Symposium on Multimedia and the Web (WebMedia'18), 2018.
v1 (Sep 2015): Natércia A. Batista, Michele A. Brandão, Gabriela B. Alves, Ana Paula Couto da Silva, and Mirella M. Moro. Collaboration strength metrics and analyses on GitHub. In Proceedings of the International Conference on Web Intelligence (WI'17), 2017.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the dataset with public access for MSR 2020 data showcase paper "On the Shoulders of Giants: A New Dataset for Pull-based Development Research" , where both the pull request id on Github and GHTorrent are deleted. The reason to do this is because there are some person related factors in the dataset (country, affiliation, personality and etc). By deleting the pull request id, users cannot get the personal information.
Please use the latest version and only use it for research.
If you want to use these information for further research, please see the dataset msr2020_new_pullreq_restricted
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This replication package can be used for replicating results in the paper. It contains 1) a dataset of 290,255 repositories; and 2) Python scripts for training and interpreting models.
We recommend manually setup the required environment in a commodity Linux machine with at least 1 CPU Core, 8GB Memory and 100GB empty storage space. We conduct development and execute all our experiments on a Ubuntu 20.04 server with two Intel Xeon Gold CPUs, 320GB memory, and 36TB RAID 5 Storage.
We use GHTorrent to restore historical states of 290,255 repositories with more than 57 commits, 4 PRs, 1 issue, 1 fork and 2 stars. The raw data of repositories are stored in `Replication Package/data/prodata.pkl`, and the contribution of features resulting from LIME model is stored in `Replication Package/data/limeres_m2_k1.pkl`. We sort items by the order in `Replication Package/data/randind.npy`, which can be used to reproduce the same results as in the paper.
`Replication Package/data/X_test_m2_k1.pkl` and `Replication Package/data/y_test_m2_k1.pkl` store the test dataset for the LIME model. You can run `Replication Package/fitdata.py` to get the results in Table III and IV, run `Replication Package/draw_compare_variable.py` to get Figure 2 and run `Replication Package/allvari_statistics.py` to get Table II. In `Replication Package/Variable_comparison_with_different_parameter.pdf`, we show the LIME results under different parameters. In `Replication Package/sample_pros.csv`, we also provide the list of randomly selected repositories in Section III.B.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains text files of GitHub URLs pointing to hosted git repositories.
These URLs come from mining software repository (MSR) datasets. URLs are built by taking the repository owner's name (OWNER) and it's name (REPO) and appending them to https://github.com/. There is one URL per line. URLs have not been tested for their current availibility. An example URL format is provided below:
https://github.com/OWNER/REPO
Current URLs are from the following datasets:
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data is a pull from GHTorrent, converted to a feather format. This data was used in https://github.com/UBC-MDS/RStudio-GitHub-Analysis.