19 datasets found
  1. GHTorrent Project Commits Dataset

    • figshare.com
    bin
    Updated Jun 25, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rayce Rossum (2019). GHTorrent Project Commits Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.8321285.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 25, 2019
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Rayce Rossum
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data is a pull from GHTorrent, converted to a feather format. This data was used in https://github.com/UBC-MDS/RStudio-GitHub-Analysis.

  2. ghtorrent-projects Dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin, txt
    Updated Jul 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marios Papachristou; Marios Papachristou (2021). ghtorrent-projects Dataset [Dataset]. http://doi.org/10.5281/zenodo.5111043
    Explore at:
    txt, binAvailable download formats
    Dataset updated
    Jul 17, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marios Papachristou; Marios Papachristou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A hypergraph dataset mined from the GHTorrent project is presented. The dataset contains two files

    1. project_members.txt: Contains GitHub projects with at least 2 contributors and the corresponding contributors (as a hyperedge). The format of the data is:

    2. num_followers.txt: Contains all GitHub users and their number of followers.

    The artifact also contains the SQL queries used to obtain the data from GHTorrent (schema).

  3. Pull request contributors analysis dataset

    • zenodo.org
    • explore.openaire.eu
    zip
    Updated Jan 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Georgios Gousios; Margaret-Anne Storey; Alberto Bacchelli; Georgios Gousios; Margaret-Anne Storey; Alberto Bacchelli (2020). Pull request contributors analysis dataset [Dataset]. http://doi.org/10.5281/zenodo.46063
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 21, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Georgios Gousios; Margaret-Anne Storey; Alberto Bacchelli; Georgios Gousios; Margaret-Anne Storey; Alberto Bacchelli
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset for the paper: G. Gousios, M.-A. Storey, and A. Bacchelli, “Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective,” in Proceedings of the 38th International Conference on Software Engineering, 2016.

  4. (No) Influence of Continuous Integration on the Development Activity in...

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sebastian Baltes; Sebastian Baltes; Jascha Knack; Jascha Knack (2020). (No) Influence of Continuous Integration on the Development Activity in GitHub Projects — Dataset [Dataset]. http://doi.org/10.5281/zenodo.1291582
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sebastian Baltes; Sebastian Baltes; Jascha Knack; Jascha Knack
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is based on the TravisTorrent dataset released 2017-01-11 (https://travistorrent.testroots.org), the Google BigQuery GHTorrent dataset accessed 2017-07-03, and the Git log history of all projects in the dataset, retrieved 2017-07-16 and 2017-07-17.

    We selected projects hosted on GitHub that employ the Continuous Integration (CI) system Travis CI. We identified the projects using the TravisTorrent data set and considered projects that:

    1. used GitHub from the beginning (first commit not more than seven days before project creation date according to GHTorrent),
    2. were active for at least one year (365 days) before the first build with Travis CI (before_ci),
    3. used Travis CI at least for one year (during_ci),
    4. had commit or merge activity on the default branch in both of these phases, and
    5. used the default branch to trigger builds.

    To derive the time frames, we employed the GHTorrent Big Query data set. The resulting sample contains 113 projects. Of these projects, 89 are Ruby projects and 24 are Java projects. For our analysis, we only consider the activity one year before and after the first build.

    We cloned the selected project repositories and extracted the version history for all branches (see https://github.com/sbaltes/git-log-parser). For each repo and branch, we created one log file with all regular commits and one log file with all merges. We only considered commits changing non-binary files and applied a file extension filter to only consider changes to Java or Ruby source code files. From the log files, we then extracted metadata about the commits and stored this data in CSV files (see https://github.com/sbaltes/git-log-parser).

    We also retrieved a random sample of GitHub project to validate the effects we observed in the CI project sample. We only considered projects that:

    1. have Java or Ruby as their project language
    2. used GitHub from the beginning (first commit not more than seven days before project creation date according to GHTorrent)
    3. have commit activity for at least two years (730 days)
    4. are engineered software projects (at least 10 watchers)
    5. were not in the TravisTorrent dataset

    In total, 8,046 projects satisfied those constraints. We drew a random sample of 800 projects from this sampling frame and retrieved the commit and merge data in the same way as for the CI sample. We then split the development activity at the median development date, removed projects without commits or merges in either of the two resulting time spans, and then manually checked the remaining projects to remove the ones with CI configuration files. The final comparision sample contained 60 non-CI projects.

    This dataset contains the following files:

    tr_projects_sample_filtered_2.csv
    A CSV file with information about the 113 selected projects.

    tr_sample_commits_default_branch_before_ci.csv
    tr_sample_commits_default_branch_during_ci.csv

    One CSV file with information about all commits to the default branch before and after the first CI build. Only commits modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:

    project: GitHub project name ("/" replaced by "_").
    branch: The branch to which the commit was made.
    hash_value: The SHA1 hash value of the commit.
    author_name: The author name.
    author_email: The author email address.
    author_date: The authoring timestamp.
    commit_name: The committer name.
    commit_email: The committer email address.
    commit_date: The commit timestamp.
    log_message_length: The length of the git commit messages (in characters).
    file_count: Files changed with this commit.
    lines_added: Lines added to all files changed with this commit.
    lines_deleted: Lines deleted in all files changed with this commit.
    file_extensions: Distinct file extensions of files changed with this commit.

    tr_sample_merges_default_branch_before_ci.csv
    tr_sample_merges_default_branch_during_ci.csv

    One CSV file with information about all merges into the default branch before and after the first CI build. Only merges modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:

    project: GitHub project name ("/" replaced by "_").
    branch: The destination branch of the merge.
    hash_value: The SHA1 hash value of the merge commit.
    merged_commits: Unique hash value prefixes of the commits merged with this commit.
    author_name: The author name.
    author_email: The author email address.
    author_date: The authoring timestamp.
    commit_name: The committer name.
    commit_email: The committer email address.
    commit_date: The commit timestamp.
    log_message_length: The length of the git commit messages (in characters).
    file_count: Files changed with this commit.
    lines_added: Lines added to all files changed with this commit.
    lines_deleted: Lines deleted in all files changed with this commit.
    file_extensions: Distinct file extensions of files changed with this commit.
    pull_request_id: ID of the GitHub pull request that has been merged with this commit (extracted from log message).
    source_user: GitHub login name of the user who initiated the pull request (extracted from log message).
    source_branch : Source branch of the pull request (extracted from log message).

    comparison_project_sample_800.csv
    A CSV file with information about the 800 projects in the comparison sample.

    commits_default_branch_before_mid.csv
    commits_default_branch_after_mid.csv

    One CSV file with information about all commits to the default branch before and after the medium date of the commit history. Only commits modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the same columns as the commits tables described above.

    merges_default_branch_before_mid.csv
    merges_default_branch_after_mid.csv

    One CSV file with information about all merges into the default branch before and after the medium date of the commit history. Only merges modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the same columns as the merge tables described above.

  5. Z

    msr14

    • data.niaid.nih.gov
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Olga Baysal (2020). msr14 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_268528
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset authored and provided by
    Olga Baysal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The MSR 2014 challenge dataset is a (very) trimmed down version of the original GHTorrent dataset. It includes data from the top-10 starred software projects for the top programming languages on Github, which gives 90 projects and their forks. For each project, we retrieved all data including issues, pull requests organizations, followers, stars and labels (milestones and events not included). The dataset was constructed from scratch to ensure the latest information is in it.

    More information at http://openscience.us/repo/msr/msr14.html.

  6. Z

    Dataset - How do you propose your code changes? Empirical Analysis of Affect...

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 13, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marco Ortu (2020). Dataset - How do you propose your code changes? Empirical Analysis of Affect Metrics of Pull Requests on GitHub [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3825043
    Explore at:
    Dataset updated
    May 13, 2020
    Dataset provided by
    Giuseppe Destefanis
    Marco Tonelli
    Daniel Graziotin
    Michele Marchesi
    Marco Ortu
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This package contains the raw open data for the study

    Marco Ortu, Giuseppe Destefanis, Daniel Graziotin, Michele Marchesi, Roberto Tonelli. 2020. How do you propose your code changes? Empirical Analysis of Affect Metrics of Pull Requests on GitHub. Under Review.

    The dataset is based on GHTorrent dataset:

    Georgios Gousios. 2013. The GHTorent dataset and tool suite. In Proceedings of the 10th Working Conference on Mining Software Repositories (MSR ’13). IEEE Press, 233–236

    And released with the same license (CC BY-SA 4.0).

  7. Github BPMN Artifacts Dataset 2021

    • zenodo.org
    • explore.openaire.eu
    bin
    Updated Jan 26, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jasmin Türker; Jasmin Türker; Michael Völske; Michael Völske; Thomas Heinze; Thomas Heinze (2022). Github BPMN Artifacts Dataset 2021 [Dataset]. http://doi.org/10.5281/zenodo.5903352
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 26, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jasmin Türker; Jasmin Türker; Michael Völske; Michael Völske; Thomas Heinze; Thomas Heinze
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Information about 327,436 potential BPMN artifacts identified in all public Github repositories referenced in the GHTorrent dump from March 2021.

    The data file is in line-delimited JSON format, with each row containing an array with the following six elements:

    1. GHTorrent project ID
    2. GitHub user name
    3. GitHub repository name
    4. GitHub branch name
    5. Path to file inside repository
    6. SHA1 hash of the file's contents

    To get a list of retrievable URLs, use e.g. the following Python one-liner:

    python3 -c 'import json; import sys; print(*[f"https://raw.githubusercontent.com/{u}/{r}/{b}/{f}" for _, u, r, b, f, _ in map(json.loads, sys.stdin)], sep="
    ")' < bpmn-artifacts.jsonl > urls.txt

    (using the hashes to filter out duplicates first is recommended though)

  8. Data from: Dependency Smells in JavaScript Projects

    • zenodo.org
    zip
    Updated May 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abbas Javan Jafari; Diego Elias Costa; Rabe Abdalkareem; Emad Shihab; Nikolaos Tsantalis; Abbas Javan Jafari; Diego Elias Costa; Rabe Abdalkareem; Emad Shihab; Nikolaos Tsantalis (2021). Dependency Smells in JavaScript Projects [Dataset]. http://doi.org/10.5281/zenodo.4735566
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 4, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Abbas Javan Jafari; Diego Elias Costa; Rabe Abdalkareem; Emad Shihab; Nikolaos Tsantalis; Abbas Javan Jafari; Diego Elias Costa; Rabe Abdalkareem; Emad Shihab; Nikolaos Tsantalis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the replication package for our paper on dependency smells.

    Here is a short description of what is contained in this package:

    Code

    This folder contains the code used for extracting, parsing, and analyzing the smells in the dataset, along with statistical analyses. The "parser.py" parses the project information (such as package.json) and prepares them in the databases. The "analyzer.py" file is responsible for the majority of the empirical analyses.

    Datasets

    This folder contains the intermediate datasets created and used in our analyses. The "smelldataset.db" file contains all smelly and clean dependencies for the latest snapshot. The "smell_counts.csv" file contains smells statistics for the projects in our dataset. The "changehistory.db" file contains the historical smell statistics for a period of 5 years. The code also requires the GhTorrent Dataset available at: https://ghtorrent.org/downloads.html.

    Survey Questionnaires and Responses

    These two folders contain the full set of questions that we asked the developers in our surveys along with the responses for survey 2.

    Tool

    This is the published tool which is also available at: https://github.com/abbasjavan/DependencySniffer

    Visualization Scripts

    This folder contains the scripts used to create the figures for the paper.

  9. CMUSTRUDEL/need-for-tweet-data: Initial release

    • zenodo.org
    zip
    Updated Mar 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hongbo Fang; Bogdan Vasilescu; Hongbo Fang; Bogdan Vasilescu (2020). CMUSTRUDEL/need-for-tweet-data: Initial release [Dataset]. http://doi.org/10.5281/zenodo.3711630
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 16, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Hongbo Fang; Bogdan Vasilescu; Hongbo Fang; Bogdan Vasilescu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains 70,427 cross-linked Twitter-GHTorrent user pairs identified as likely belonging to the same users. The dataset accompanies our research paper (PDF preprint here):

    @inproceedings{fang2020tweet,
     author = {Fang, Hongbo and Klug, Daniel and Lamba, Hemank and Herbsleb, James and Vasilescu, Bogdan},
     title = {Need for Tweet: How Open Source Developers Talk About Their GitHub Work on Twitter},
     booktitle = {International Conference on Mining Software Repositories (MSR)},
     year = {2020},
     pages = {to appear},
     publisher = {ACM},
    }

    The data cannot be used for any purpose other than conducting research.

    Due to privacy concerns, we only release the user IDs in Twitter and GHTorrent, respectively. We expect that users of this dataset will be able to collect other data using the Twitter API and GHTorrent, as needed. Please see below for an example.

    To query the Twitter API for a given user_id, you can:

    • Apply for Twitter developer account here.

    • Create an APP with your Twitter developer account, and create an "API key" and "API secret key".

    • Obtain an access token. Given the previous API keys, run:

      curl -u "

      The response looks like this: {"token_type":"bearer","access_token":"<...>"}

      Copy the "access_token".

    • Given the previous access token, run:

      curl --request GET --url "https://api.twitter.com/1.1/users/show.json?user_id=

    The GHTorrent user ids map to the users table in the MySQL version of GHTorrent. To use GHTorrent, please follow instructions on the GHTorrent website.

  10. Replication package of the Paper "On the Relationships between the Initial...

    • zenodo.org
    zip
    Updated Dec 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous; Anonymous (2024). Replication package of the Paper "On the Relationships between the Initial Ecology Indicators of OSS Projects and Their Long-Term Popularity: An Exploratory Study on GitHub" [Dataset]. http://doi.org/10.5281/zenodo.14393491
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous; Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset is collected from GitHub API and GitHub GHTorrent dataset. A brief description of each folder is provided below:

    1. "Dataset_and_Code" folder

    Contains the final dataset and algorithms

    2. "Test_parameters" folders

    Includes datasets under different parameters and the corresponding reproduction code, which corresponds to the first experiment of RQ1

    3. "Compare_baseline"folder

    Includes the dataset used by our method, the dataset used by the baseline method, and the reproduction code, corresponding to the second experiment of RQ1

    4. "PLS" folder

    Includes the dataset used by PLS and the corresponding reproduction code, which corresponds to experiment of RQ2

    5. " Indicator_Calculation " folder
    It contains the calculation methods for various metrics in the paper, as well as the corresponding key files.

    6. " Appendix " folder
    It includes supplementary materials such as the methodology for metric calculations to address the reviewers' questions.

    Note

    We have provided a corresponding README file in each folder to help others reproduce our results

  11. Software Developer Expertise GitHub and Stack Overflow data sets

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, html, txt
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Norbert Eke; Olga Baysal; Norbert Eke; Olga Baysal (2025). Software Developer Expertise GitHub and Stack Overflow data sets [Dataset]. http://doi.org/10.5281/zenodo.3696079
    Explore at:
    csv, html, bin, txtAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Norbert Eke; Olga Baysal; Norbert Eke; Olga Baysal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Cross-Platform Software Developer Expertise Learning by Norbert Eke

    This data set is part of my Master's thesis project on developer expertise learning by mining Stack Overflow (SOTorrent) and Github (GHTorrent) data. Check out my portfolio website at norberte.github.io

  12. Pull Requests Acceptance Across Progamming Languages

    • figshare.com
    zip
    Updated Jul 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ondrej Kuhejda; Bruno Rossi (2023). Pull Requests Acceptance Across Progamming Languages [Dataset]. http://doi.org/10.6084/m9.figshare.20299275.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 10, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Ondrej Kuhejda; Bruno Rossi
    License

    https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html

    Description

    Replication package for the paper O. Kuhejda and B. Rossi, "Pull Requests Acceptance: A Study Across Programming Languages" accepted at 49th Euromicro Conference on Software Engineering and Advanced Applications (SEAA'23). Content

    Projects.zip: JSON files containing the data mined from GitHub and GHTorrent for the analysis; Scripts.zip: source code used for the data mining process, running of linters and classification analysis (Python/R). For instructions and pre-requisites, please refer to READMEs: scripts/README.org and scripts/git-contrast/README.org. Please note that the script pr_classification.py is a modified version of the file created by Lenarduzzi et al. under the CC BY 4.0 license. The original file is available at https://figshare.com/s/d47b6f238b5c92430dd7?file=14949029 *_projects.png: tables about the descriptive statistics of the projects analyzed;

  13. Data from: TDMentions: A Dataset of Technical Debt Mentions in Online Posts

    • zenodo.org
    • data.niaid.nih.gov
    bin, bz2
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Morgan Ericsson; Morgan Ericsson; Anna Wingkvist; Anna Wingkvist (2020). TDMentions: A Dataset of Technical Debt Mentions in Online Posts [Dataset]. http://doi.org/10.5281/zenodo.2593142
    Explore at:
    bin, bz2Available download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Morgan Ericsson; Morgan Ericsson; Anna Wingkvist; Anna Wingkvist
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    # TDMentions: A Dataset of Technical Debt Mentions in Online Posts (version 1.0)

    TDMentions is a dataset that contains mentions of technical debt from Reddit, Hacker News, and Stack Exchange. It also contains a list of blog posts on Medium that were tagged as technical debt. The dataset currently contains approximately 35,000 items.

    ## Data collection and processing

    The dataset is mainly collected from existing datasets. We used data from:

    - the archive of Reddit posts by Jason Baumgartner (available at [https://pushshift.io](https://pushshift.io),
    - the archive of Hacker News available at Google's BigQuery (available at [https://console.cloud.google.com/marketplace/details/y-combinator/hacker-news](https://console.cloud.google.com/marketplace/details/y-combinator/hacker-news)), and the Stack Exchange data dump (available at [https://archive.org/details/stackexchange](https://archive.org/details/stackexchange)).
    - the [GHTorrent](http://ghtorrent.org) project
    - the [GH Archive](https://www.gharchive.org)

    The data set currently contains data from the start of each source/service until 2018-12-31. For GitHub, we currently only include data from 2015-01-01.

    We use the regular expression `tech(nical)?[\s\-_]*?debt` to find mentions in all sources except for Medium. We decided to limit our matches to variations of technical debt and tech debt. Other shorter forms, such as TD, can result in too many false positives. For Medium, we used the tag `technical-debt`.

    ## Data Format

    The dataset is stored as a compressed (bzip2) JSON file with one JSON object per line. Each mention is represented as a JSON object with the following keys.

    - `id`: the id used in the original source. We use the URL path to identify Medium posts.
    - `body`: the text that contains the mention. This is either the comment or the title of the post. For Medium posts this is the title and subtitle (which might not mention technical debt, since posts are identified by the tag).
    - `created_utc`: the time the item was posted in seconds since epoch in UTC.
    - `author`: the author of the item. We use the username or userid from the source.
    - `source`: where the item was posted. Valid sources are:
    - HackerNews Comment
    - HackerNews Job
    - HackerNews Submission
    - Reddit Comment
    - Reddit Submission
    - StackExchange Answer
    - StackExchange Comment
    - StackExchange Question
    - Medium Post
    - `meta`: Additional information about the item specific to the source. This includes, e.g., the subreddit a Reddit submission or comment was posted to, the score, etc. We try to use the same names, e.g., `score` and `num_comments` for keys that have the same meaning/information across multiple sources.

    This is a sample item from Reddit:

    ```JSON
    {
    "id": "ab8auf",
    "body": "Technical Debt Explained (x-post r/Eve)",
    "created_utc": 1546271789,
    "author": "totally_100_human",
    "source": "Reddit Submission",
    "meta": {
    "title": "Technical Debt Explained (x-post r/Eve)",
    "score": 1,
    "num_comments": 0,
    "url": "http://jestertrek.com/eve/technical-debt-2.png",
    "subreddit": "RCBRedditBot"
    }
    }
    ```

    ## Sample Analyses

    We decided to use JSON to store the data, since it is easy to work with from multiple programming languages. In the following examples, we use [`jq`](https://stedolan.github.io/jq/) to process the JSON.

    ### How many items are there for each source?

    ```
    lbzip2 -cd postscomments.json.bz2 | jq '.source' | sort | uniq -c
    ```

    ### How many submissions that mentioned technical debt were posted each month?

    ```
    lbzip2 -cd postscomments.json.bz2 | jq 'select(.source == "Reddit Submission") | .created_utc | strftime("%Y-%m")' | sort | uniq -c
    ```

    ### What are the titles of items that link (`meta.url`) to PDF documents?

    ```
    lbzip2 -cd postscomments.json.bz2 | jq '. as $r | select(.meta.url?) | .meta.url | select(endswith(".pdf")) | $r.body'
    ```

    ### Please, I want CSV!

    ```
    lbzip2 -cd postscomments.json.bz2 | jq -r '[.id, .body, .author] | @csv'
    ```

    Note that you need to specify the keys you want to include for the CSV, so it is easier to either ignore the meta information or process each source.

    Please see [https://github.com/sse-lnu/tdmentions](https://github.com/sse-lnu/tdmentions) for more analyses

    # Limitations and Future updates

    The current version of the dataset lacks GitHub data and Medium comments. GitHub data will be added in the next update. Medium comments (responses) will be added in a future update if we find a good way to represent these.

  14. Enterprise-Driven Open Source Software

    • zenodo.org
    • opendatalab.com
    • +1more
    application/gzip
    Updated Apr 22, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Diomidis Spinellis; Diomidis Spinellis; Zoe Kotti; Zoe Kotti; Konstantinos Kravvaritis; Konstantinos Kravvaritis; Georgios Theodorou; Georgios Theodorou; Panos Louridas; Panos Louridas (2020). Enterprise-Driven Open Source Software [Dataset]. http://doi.org/10.5281/zenodo.3742962
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Apr 22, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Diomidis Spinellis; Diomidis Spinellis; Zoe Kotti; Zoe Kotti; Konstantinos Kravvaritis; Konstantinos Kravvaritis; Georgios Theodorou; Georgios Theodorou; Panos Louridas; Panos Louridas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present a dataset of open source software developed mainly by enterprises rather than volunteers. This can be used to address known generalizability concerns, and, also, to perform research on open source business software development. Based on the premise that an enterprise's employees are likely to contribute to a project developed by their organization using the email account provided by it, we mine domain names associated with enterprises from open data sources as well as through white- and blacklisting, and use them through three heuristics to identify 17,264 enterprise GitHub projects. We provide these as a dataset detailing their provenance and properties. A manual evaluation of a dataset sample shows an identification accuracy of 89%. Through an exploratory data analysis we found that projects are staffed by a plurality of enterprise insiders, who appear to be pulling more than their weight, and that in a small percentage of relatively large projects development happens exclusively through enterprise insiders.

    The main dataset is provided as a 17,264 record tab-separated file named enterprise_projects.txt with the following 29 fields.

    • url: the project's GitHub URL
    • project_id: the project's GHTorrent identifier
    • sdtc: true if selected using the same domain top committers heuristic (9,016 records)
    • mcpc: true if selected using the multiple committers from a valid enterprise heuristic (8,314 records)
    • mcve: true if selected using the multiple committers from a probable company heuristic (8,015 records),
    • star_number: number of GitHub watchers
    • commit_count: number of commits
    • files: number of files in current main branch
    • lines: corresponding number of lines in text files
    • pull_requests: number of pull requests
    • github_repo_creation: timestamp of the GitHub repository creation
    • earliest_commit: timestamp of the earliest commit
    • most_recent_commit: date of the most recent commit
    • committer_count: number of different committers
    • author_count: number of different authors
    • dominant_domain: the projects dominant email domain
    • dominant_domain_committer_commits: number of commits made by committers whose email matches the project's dominant domain
    • dominant_domain_author_commits: corresponding number for commit authors
    • dominant_domain_committers: number of committers whose email matches the project's dominant domain
    • dominant_domain_authors: corresponding number for commit authors
    • cik: SEC's EDGAR "central index key"
    • fg500: true if this is a Fortune Global 500 company (2,233 records)
    • sec10k: true if the company files SEC 10-K forms (4,180 records)
    • sec20f: true if the company files SEC 20-F forms (429 records)
    • project_name: GitHub project name
    • owner_login: GitHub project's owner login
    • company_name: company name as derived from the SEC and Fortune 500 data
    • owner_company: GitHub project's owner company name
    • license: SPDX license identifier

    The file cohost_project_details.txt provides the full set of 311,223 cohort projects that are not part of the enterprise data set, but have comparable quality attributes.

    • url: the project's GitHub URL
    • project_id: the project's GHTorrent identifier
    • stars: number of GitHub watchers
    • commit_count: number of commits
  15. Data from: GHTraffic: A Dataset for Reproducible Research in...

    • zenodo.org
    zip
    Updated Aug 29, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thilini Bhagya; Jens Dietrich; Hans Guesgen; Steve Versteeg; Thilini Bhagya; Jens Dietrich; Hans Guesgen; Steve Versteeg (2020). GHTraffic: A Dataset for Reproducible Research in Service-Oriented Computing [Dataset]. http://doi.org/10.5281/zenodo.1034573
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 29, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Thilini Bhagya; Jens Dietrich; Hans Guesgen; Steve Versteeg; Thilini Bhagya; Jens Dietrich; Hans Guesgen; Steve Versteeg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present GHTraffic, a dataset of significant size comprising HTTP transactions extracted from GitHub data (i.e., from 04 August 2015 GHTorrent issues snapshot) and augmented with synthetic transaction data. This dataset facilitates reproducible research on many aspects of service-oriented computing.

    The GHTraffic dataset comprises three different editions: Small (S), Medium (M) and Large (L). The S dataset includes HTTP transaction records created from google/guava repository. Guava is a popular Java library containing utilities and data structures. The M dataset includes records from the npm/npm project. It is the popular de-facto standard package manager for JavaScript. The L dataset contains data that were created by selecting eight repositories containing large and very active projects, including twbs/bootstrap, symfony/symfony, docker/docker, Homebrew/homebrew, rust-lang/rust, kubernetes/kubernetes, rails/rails, and angular/angular.js.

    We also provide access to the scripts used to generate GHTraffic. Using these scripts, users can modify the configuration properties in the config.properties file in order to create a customised version of GHTraffic datasets for their own use. The readme.md file included in the distribution provides further information on how to build the code and run the scripts.

    The GHTraffic scripts can be accessed by downloading the pre-configured VirtualBox image or by cloning the repository.

  16. GitSED: GitHub Socially Enhanced Dataset

    • zenodo.org
    xz
    Updated Jul 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriel P. Oliveira; Gabriel P. Oliveira; Ana Flávia C. Moura; Natércia A. Batista; Michele A. Brandão; Mirella M. Moro; Mirella M. Moro; Ana Flávia C. Moura; Natércia A. Batista; Michele A. Brandão (2021). GitSED: GitHub Socially Enhanced Dataset [Dataset]. http://doi.org/10.5281/zenodo.5021329
    Explore at:
    xzAvailable download formats
    Dataset updated
    Jul 2, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gabriel P. Oliveira; Gabriel P. Oliveira; Ana Flávia C. Moura; Natércia A. Batista; Michele A. Brandão; Mirella M. Moro; Mirella M. Moro; Ana Flávia C. Moura; Natércia A. Batista; Michele A. Brandão
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Software Engineering has evolved as a field to study not only the many ways software is created but also how it evolves, becomes successful, is effective and efficient in its objectives, satisfies its quality attributes, and much more. Nonetheless, there are still many open issues during its conception, development, and maintenance phases. Especially, understanding how developers collaborate may help in all such phases, but it is also challenging. Luckily, we may now explore a novel angle to deal with such a challenge: studying the social aspects of software development over social networks.

    With GitHub becoming the main representative of collaborative software development online tools, there are approaches to assess the follow-network, stargazer-network, and contributors-network. Moreover, having such networks built from real software projects offers support for relevant applications, such as detection of key developers, recommendation of collaboration among developers, detection of developer communities, and analyses of collaboration patterns in agile development.

    GitSED is a dataset based on GitHub that is curated (cleaned and reduced), augmented with external data, and enriched with social information on developers’ interactions. The original data is extracted from GHTorrent (an offline repository of data collected through the GitHub REST API). Our final dataset contains data from up to June 2019. It comprises:

    • 8,556,778 repositories
    • 32,411,674 developers
    • 6 programming languages (Assembly, JavaScript, Pascal, Python, Ruby, Visual Basic)
    • 13 collaboration metrics

    There are two previous versions of GitSED, which were originally built for the following conference papers:

    v2 (May 2017): Gabriel P. Oliveira, Natércia A. Batista, Michele A. Brandão, and Mirella M. Moro. Tie Strength in GitHub Heterogeneous Networks. In Proceedings of the 24th Brazilian Symposium on Multimedia and the Web (WebMedia'18), 2018.

    v1 (Sep 2015): Natércia A. Batista, Michele A. Brandão, Gabriela B. Alves, Ana Paula Couto da Silva, and Mirella M. Moro. Collaboration strength metrics and analyses on GitHub. In Proceedings of the International Conference on Web Intelligence (WI'17), 2017.

  17. msr2020_new_pullreq_public

    • zenodo.org
    bin, csv
    Updated Jun 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xunhui Zhang; Ayushi Rastogi; Yue Yu; Xunhui Zhang; Ayushi Rastogi; Yue Yu (2020). msr2020_new_pullreq_public [Dataset]. http://doi.org/10.5281/zenodo.3922907
    Explore at:
    bin, csvAvailable download formats
    Dataset updated
    Jun 30, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Xunhui Zhang; Ayushi Rastogi; Yue Yu; Xunhui Zhang; Ayushi Rastogi; Yue Yu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the dataset with public access for MSR 2020 data showcase paper "On the Shoulders of Giants: A New Dataset for Pull-based Development Research" , where both the pull request id on Github and GHTorrent are deleted. The reason to do this is because there are some person related factors in the dataset (country, affiliation, personality and etc). By deleting the pull request id, users cannot get the personal information.

    Please use the latest version and only use it for research.

    If you want to use these information for further research, please see the dataset msr2020_new_pullreq_restricted

  18. Replication Package for Paper "How Early Participation Determines Long-Term...

    • zenodo.org
    Updated Sep 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    anonymous; anonymous (2022). Replication Package for Paper "How Early Participation Determines Long-Term Sustained Activity in GitHub Projects" [Dataset]. http://doi.org/10.5281/zenodo.7059020
    Explore at:
    Dataset updated
    Sep 8, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    anonymous; anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This replication package can be used for replicating results in the paper. It contains 1) a dataset of 290,255 repositories; and 2) Python scripts for training and interpreting models.

    We recommend manually setup the required environment in a commodity Linux machine with at least 1 CPU Core, 8GB Memory and 100GB empty storage space. We conduct development and execute all our experiments on a Ubuntu 20.04 server with two Intel Xeon Gold CPUs, 320GB memory, and 36TB RAID 5 Storage.

    We use GHTorrent to restore historical states of 290,255 repositories with more than 57 commits, 4 PRs, 1 issue, 1 fork and 2 stars. The raw data of repositories are stored in `Replication Package/data/prodata.pkl`, and the contribution of features resulting from LIME model is stored in `Replication Package/data/limeres_m2_k1.pkl`. We sort items by the order in `Replication Package/data/randind.npy`, which can be used to reproduce the same results as in the paper.
    `Replication Package/data/X_test_m2_k1.pkl` and `Replication Package/data/y_test_m2_k1.pkl` store the test dataset for the LIME model. You can run `Replication Package/fitdata.py` to get the results in Table III and IV, run `Replication Package/draw_compare_variable.py` to get Figure 2 and run `Replication Package/allvari_statistics.py` to get Table II. In `Replication Package/Variable_comparison_with_different_parameter.pdf`, we show the LIME results under different parameters. In `Replication Package/sample_pros.csv`, we also provide the list of randomly selected repositories in Section III.B.

  19. Extracted MSR GitHub Repository URLs

    • zenodo.org
    • data.niaid.nih.gov
    Updated Oct 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicholas Synovic; Nicholas Synovic (2022). Extracted MSR GitHub Repository URLs [Dataset]. http://doi.org/10.5281/zenodo.7226299
    Explore at:
    Dataset updated
    Oct 21, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nicholas Synovic; Nicholas Synovic
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains text files of GitHub URLs pointing to hosted git repositories.

    These URLs come from mining software repository (MSR) datasets. URLs are built by taking the repository owner's name (OWNER) and it's name (REPO) and appending them to https://github.com/. There is one URL per line. URLs have not been tested for their current availibility. An example URL format is provided below:

    https://github.com/OWNER/REPO

    Current URLs are from the following datasets:

    • libraies.io January 12th, 2020 dataset
      • Jeremy Katz, "Libraries.io Open Source Repository and Dependency Metadata". Zenodo, Jan. 12, 2020. doi: 10.5281/zenodo.3626071.
    • RepoReapers/reaper dataset
    • GH Torrent partial dataset
      • G. Gousios, “The GHTorent dataset and tool suite,” in Proceedings of the 10th Working Conference on Mining Software Repositories, San Francisco, CA, USA, May 2013, pp. 233–236.
  20. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rayce Rossum (2019). GHTorrent Project Commits Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.8321285.v1
Organization logo

GHTorrent Project Commits Dataset

Explore at:
binAvailable download formats
Dataset updated
Jun 25, 2019
Dataset provided by
Figsharehttp://figshare.com/
Authors
Rayce Rossum
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This data is a pull from GHTorrent, converted to a feather format. This data was used in https://github.com/UBC-MDS/RStudio-GitHub-Analysis.

Search
Clear search
Close search
Google apps
Main menu