12 datasets found
  1. GHArchive

    • redivis.com
    application/jsonl +2
    Updated Feb 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford Graduate School of Business Library (2022). GHArchive [Dataset]. https://redivis.com/datasets/3frr-0829sqaf7
    Explore at:
    stata, sas, application/jsonlAvailable download formats
    Dataset updated
    Feb 28, 2022
    Dataset provided by
    Redivis Inc.
    Authors
    Stanford Graduate School of Business Library
    Description

    This dataset was created by StanfordGSBLibrary on Fri, 10 Dec 2021 18:49:35 GMT.

  2. r

    GHArchive

    • redivis.com
    Updated Dec 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). GHArchive [Dataset]. https://redivis.com/workflows/y17e-6g6v5trwe
    Explore at:
    Dataset updated
    Dec 14, 2021
    Description

    null This dataset was created on Fri, 10 Dec 2021 18:49:35 GMT.

  3. Pull Request Review Comments Dataset

    • zenodo.org
    application/gzip, bin
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akshay Sinha; Akshay Sinha (2025). Pull Request Review Comments Dataset [Dataset]. http://doi.org/10.5281/zenodo.4773068
    Explore at:
    application/gzip, binAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Akshay Sinha; Akshay Sinha
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Pull Request Review Comments (PRRC) Datasets

    Two datasets have been created from the gharchive website. The Pull Request Review Comment Event was selected from the set of available GitHub events. This dataset has been created for CARA: Chatbot for Automating Repairnator Actions as part of a master's thesis at KTH, Stockholm.

    First, a source dataset was downloaded from gharchive. That dataset ranges from January 2015 to December 2019. It consisted of 37,358,242 PRRCs and is over 12 Gigabytes in size. It took over 100 hours to download all the data files and extract PRRC from it. From this source dataset, two subsets were derived:

    1. Pull Request Review Comments Dataset: This is the dataset of the comments from the first 100,000 threads in the source dataset from gharchive.
    2. Pull Request Review Threads Dataset: This is the dataset of comments that were concatenated together if they were from the same thread.

    Description

    The dataset is stored in the JSONLines format, as was the source dataset from gharchive.

    For PRRC events, the source dataset contains the fields `comment_id`, `commit_id`, `url`, `author`, `created_at`, and `body`.

    • `comment_id` is the field which specifies the ID GitHub uses for that comment.
    • `commit_id` is the field which specifies the ID of the commit proposed in the pull request.
    • `url` is the field which specifies the url to the comment in a pull request thread.
    • `author` is the field which lists the username of the author of the pull request.
    • `created_at` is the field which specifies the time at which the pull request comment was created.
    • `body` is the field which describes the contents of the PRRC.

    The threads dataset contains the fields `url` and `body` which contain similar information as described above. However, the body field differs: it is a concatenation of all the PRRCs in a pull request thread. The comments dataset contains the fields `comment_id`, `commit_id`, `url`, `author`, `created_at`, and `body`. They are the same fields from the initial dataset.

    Construction

    We used the fasttext model published by Facebook to detect the language of the PRRC. Only those PRRCs in English were preserved. We also removed any PRRC or thread whose size exceeded 128 Kilobytes.

  4. GHArchive copy

    • redivis.com
    Updated Dec 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seyeon Kim (2021). GHArchive copy [Dataset]. https://redivis.com/workflows/y17e-6g6v5trwe
    Explore at:
    Dataset updated
    Dec 15, 2021
    Dataset provided by
    Redivis Inc.
    Authors
    Seyeon Kim
    Description

    This workflow was created by seyeonk on Wed, 15 Dec 2021 00:34:12 GMT.

  5. Data from: TDMentions: A Dataset of Technical Debt Mentions in Online Posts

    • zenodo.org
    bin, bz2
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TDMentions: A Dataset of Technical Debt Mentions in Online Posts [Dataset]. https://zenodo.org/records/2593142
    Explore at:
    bin, bz2Available download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Morgan Ericsson; Morgan Ericsson; Anna Wingkvist; Anna Wingkvist
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    # TDMentions: A Dataset of Technical Debt Mentions in Online Posts (version 1.0)

    TDMentions is a dataset that contains mentions of technical debt from Reddit, Hacker News, and Stack Exchange. It also contains a list of blog posts on Medium that were tagged as technical debt. The dataset currently contains approximately 35,000 items.

    ## Data collection and processing

    The dataset is mainly collected from existing datasets. We used data from:

    - the archive of Reddit posts by Jason Baumgartner (available at [https://pushshift.io](https://pushshift.io),
    - the archive of Hacker News available at Google's BigQuery (available at [https://console.cloud.google.com/marketplace/details/y-combinator/hacker-news](https://console.cloud.google.com/marketplace/details/y-combinator/hacker-news)), and the Stack Exchange data dump (available at [https://archive.org/details/stackexchange](https://archive.org/details/stackexchange)).
    - the [GHTorrent](http://ghtorrent.org) project
    - the [GH Archive](https://www.gharchive.org)

    The data set currently contains data from the start of each source/service until 2018-12-31. For GitHub, we currently only include data from 2015-01-01.

    We use the regular expression `tech(nical)?[\s\-_]*?debt` to find mentions in all sources except for Medium. We decided to limit our matches to variations of technical debt and tech debt. Other shorter forms, such as TD, can result in too many false positives. For Medium, we used the tag `technical-debt`.

    ## Data Format

    The dataset is stored as a compressed (bzip2) JSON file with one JSON object per line. Each mention is represented as a JSON object with the following keys.

    - `id`: the id used in the original source. We use the URL path to identify Medium posts.
    - `body`: the text that contains the mention. This is either the comment or the title of the post. For Medium posts this is the title and subtitle (which might not mention technical debt, since posts are identified by the tag).
    - `created_utc`: the time the item was posted in seconds since epoch in UTC.
    - `author`: the author of the item. We use the username or userid from the source.
    - `source`: where the item was posted. Valid sources are:
    - HackerNews Comment
    - HackerNews Job
    - HackerNews Submission
    - Reddit Comment
    - Reddit Submission
    - StackExchange Answer
    - StackExchange Comment
    - StackExchange Question
    - Medium Post
    - `meta`: Additional information about the item specific to the source. This includes, e.g., the subreddit a Reddit submission or comment was posted to, the score, etc. We try to use the same names, e.g., `score` and `num_comments` for keys that have the same meaning/information across multiple sources.

    This is a sample item from Reddit:

    ```JSON
    {
    "id": "ab8auf",
    "body": "Technical Debt Explained (x-post r/Eve)",
    "created_utc": 1546271789,
    "author": "totally_100_human",
    "source": "Reddit Submission",
    "meta": {
    "title": "Technical Debt Explained (x-post r/Eve)",
    "score": 1,
    "num_comments": 0,
    "url": "http://jestertrek.com/eve/technical-debt-2.png",
    "subreddit": "RCBRedditBot"
    }
    }
    ```

    ## Sample Analyses

    We decided to use JSON to store the data, since it is easy to work with from multiple programming languages. In the following examples, we use [`jq`](https://stedolan.github.io/jq/) to process the JSON.

    ### How many items are there for each source?

    ```
    lbzip2 -cd postscomments.json.bz2 | jq '.source' | sort | uniq -c
    ```

    ### How many submissions that mentioned technical debt were posted each month?

    ```
    lbzip2 -cd postscomments.json.bz2 | jq 'select(.source == "Reddit Submission") | .created_utc | strftime("%Y-%m")' | sort | uniq -c
    ```

    ### What are the titles of items that link (`meta.url`) to PDF documents?

    ```
    lbzip2 -cd postscomments.json.bz2 | jq '. as $r | select(.meta.url?) | .meta.url | select(endswith(".pdf")) | $r.body'
    ```

    ### Please, I want CSV!

    ```
    lbzip2 -cd postscomments.json.bz2 | jq -r '[.id, .body, .author] | @csv'
    ```

    Note that you need to specify the keys you want to include for the CSV, so it is easier to either ignore the meta information or process each source.

    Please see [https://github.com/sse-lnu/tdmentions](https://github.com/sse-lnu/tdmentions) for more analyses

    # Limitations and Future updates

    The current version of the dataset lacks GitHub data and Medium comments. GitHub data will be added in the next update. Medium comments (responses) will be added in a future update if we find a good way to represent these.

  6. A

    ‘Commits effectués par des personnes avec des adresses en *.gouv.fr sur...

    • analyst-2.ai
    Updated Jan 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Commits effectués par des personnes avec des adresses en *.gouv.fr sur GitHub’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/data-europa-eu-commits-effectues-par-des-personnes-avec-des-adresses-en-gouv-fr-sur-github-d1b1/e5039c48/?iid=002-153&v=presentation
    Explore at:
    Dataset updated
    Jan 17, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    France
    Description

    Analysis of ‘Commits effectués par des personnes avec des adresses en *.gouv.fr sur GitHub’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from http://data.europa.eu/88u/dataset/5cab15d28b4c415b457e8629 on 17 January 2022.

    --- Dataset description provided by original source is as follows ---

    Une liste de commits publiés sur GitHub dont l'auteur a renseigné une adresse e-mail en *.gouv.fr.

    Le schéma des données au format Table Schema est disponible en ligne.

    Un article de blog parle de cette initiative.

    --- Original source retains full ownership of the source dataset ---

  7. r

    repoDetails

    • redivis.com
    Updated Dec 10, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford Graduate School of Business Library (2021). repoDetails [Dataset]. https://redivis.com/datasets/3frr-0829sqaf7
    Explore at:
    Dataset updated
    Dec 10, 2021
    Dataset authored and provided by
    Stanford Graduate School of Business Library
    Description

    The table repoDetails is part of the dataset GHArchive, available at https://stanford.redivis.com/datasets/3frr-0829sqaf7. It contains 3834562753 rows across 4 variables.

  8. r

    organizations

    • redivis.com
    Updated Dec 10, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford Graduate School of Business Library (2021). organizations [Dataset]. https://redivis.com/datasets/3frr-0829sqaf7
    Explore at:
    Dataset updated
    Dec 10, 2021
    Dataset authored and provided by
    Stanford Graduate School of Business Library
    Description

    The table organizations is part of the dataset GHArchive, available at https://stanford.redivis.com/datasets/3frr-0829sqaf7. It contains 1175630161 rows across 6 variables.

  9. r

    actors

    • redivis.com
    Updated Dec 10, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford Graduate School of Business Library (2021). actors [Dataset]. https://redivis.com/datasets/3frr-0829sqaf7
    Explore at:
    Dataset updated
    Dec 10, 2021
    Dataset authored and provided by
    Stanford Graduate School of Business Library
    Description

    The table actors is part of the dataset GHArchive, available at https://stanfordgsb.redivis.com/datasets/3frr-0829sqaf7. It contains 3839081663 rows across 6 variables.

  10. g

    Commits effectués par des personnes avec des adresses en *.gouv.fr sur...

    • gimi9.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Commits effectués par des personnes avec des adresses en *.gouv.fr sur GitHub [Dataset]. https://gimi9.com/dataset/fr_5cab15d28b4c415b457e8629/
    Explore at:
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Une liste de commits publiés sur GitHub dont l'auteur a renseigné une adresse e-mail en *.gouv.fr. Le schéma des données au format Table Schema est disponible en ligne. Un article de blog parle de cette initiative.

  11. e

    Compromisos realizados por personas con direcciones en *.gouv.fr en GitHub

    • data.europa.eu
    csv/utf8
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antoine Augusti, Compromisos realizados por personas con direcciones en *.gouv.fr en GitHub [Dataset]. https://data.europa.eu/data/datasets/5cab15d28b4c415b457e8629?locale=es
    Explore at:
    csv/utf8Available download formats
    Dataset authored and provided by
    Antoine Augusti
    License

    https://www.etalab.gouv.fr/licence-ouverte-open-licencehttps://www.etalab.gouv.fr/licence-ouverte-open-licence

    Description

    Una lista de confirmaciones publicadas en GitHub cuyo autor proporcionó una dirección de correo electrónico en ‘*.gouv.fr’.

    El esquema de los datos en formato Esquema de tabla está disponible online.

    Un artículo de blog habla de esta iniciativa.

  12. e

    Sitoumukset, joita tekevät henkilöt, joiden osoite on *.gouv.fr GitHubissa

    • data.europa.eu
    csv/utf8
    Updated Dec 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antoine Augusti (2024). Sitoumukset, joita tekevät henkilöt, joiden osoite on *.gouv.fr GitHubissa [Dataset]. https://data.europa.eu/data/datasets/5cab15d28b4c415b457e8629?locale=fi
    Explore at:
    csv/utf8Available download formats
    Dataset updated
    Dec 3, 2024
    Dataset authored and provided by
    Antoine Augusti
    License

    https://www.etalab.gouv.fr/licence-ouverte-open-licencehttps://www.etalab.gouv.fr/licence-ouverte-open-licence

    Description

    Luettelo GitHubissa julkaistuista sitoumuksista, joiden laatija antoi sähköpostiosoitteen ”*.gouv.fr”.

    Taulukon Schema muodossa olevien tietojen kaavio on saatavilla online.

    Blogiartikkeli käsittelee tätä aloitetta.

  13. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Stanford Graduate School of Business Library (2022). GHArchive [Dataset]. https://redivis.com/datasets/3frr-0829sqaf7
Organization logo

GHArchive

Explore at:
217 scholarly articles cite this dataset (View in Google Scholar)
stata, sas, application/jsonlAvailable download formats
Dataset updated
Feb 28, 2022
Dataset provided by
Redivis Inc.
Authors
Stanford Graduate School of Business Library
Description

This dataset was created by StanfordGSBLibrary on Fri, 10 Dec 2021 18:49:35 GMT.

Search
Clear search
Close search
Google apps
Main menu