12 datasets found

GHArchive
redivis.com
application/jsonl +2
Updated Feb 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford Graduate School of Business Library (2022). GHArchive [Dataset]. https://redivis.com/datasets/3frr-0829sqaf7
Explore at:
stata, sas, application/jsonlAvailable download formats
Dataset updated
Feb 28, 2022
Dataset provided by
Redivis Inc.
Authors
Stanford Graduate School of Business Library
Description
This dataset was created by StanfordGSBLibrary on Fri, 10 Dec 2021 18:49:35 GMT.
r
GHArchive
redivis.com
Updated Dec 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). GHArchive [Dataset]. https://redivis.com/workflows/y17e-6g6v5trwe
Explore at:
Dataset updated
Dec 14, 2021
Description
null This dataset was created on Fri, 10 Dec 2021 18:49:35 GMT.
Pull Request Review Comments Dataset
zenodo.org
application/gzip, bin
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akshay Sinha; Akshay Sinha (2025). Pull Request Review Comments Dataset [Dataset]. http://doi.org/10.5281/zenodo.4773068
Explore at:
application/gzip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4773068
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Akshay Sinha; Akshay Sinha
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Pull Request Review Comments (PRRC) Datasets

Two datasets have been created from the gharchive website. The Pull Request Review Comment Event was selected from the set of available GitHub events. This dataset has been created for CARA: Chatbot for Automating Repairnator Actions as part of a master's thesis at KTH, Stockholm.

First, a source dataset was downloaded from gharchive. That dataset ranges from January 2015 to December 2019. It consisted of 37,358,242 PRRCs and is over 12 Gigabytes in size. It took over 100 hours to download all the data files and extract PRRC from it. From this source dataset, two subsets were derived:

Pull Request Review Comments Dataset: This is the dataset of the comments from the first 100,000 threads in the source dataset from gharchive.

Pull Request Review Threads Dataset: This is the dataset of comments that were concatenated together if they were from the same thread.

Description

The dataset is stored in the JSONLines format, as was the source dataset from gharchive.

For PRRC events, the source dataset contains the fields `comment_id`, `commit_id`, `url`, `author`, `created_at`, and `body`.

`comment_id` is the field which specifies the ID GitHub uses for that comment.

`commit_id` is the field which specifies the ID of the commit proposed in the pull request.

`url` is the field which specifies the url to the comment in a pull request thread.

`author` is the field which lists the username of the author of the pull request.

`created_at` is the field which specifies the time at which the pull request comment was created.

`body` is the field which describes the contents of the PRRC.

The threads dataset contains the fields `url` and `body` which contain similar information as described above. However, the body field differs: it is a concatenation of all the PRRCs in a pull request thread. The comments dataset contains the fields `comment_id`, `commit_id`, `url`, `author`, `created_at`, and `body`. They are the same fields from the initial dataset.

Construction

We used the fasttext model published by Facebook to detect the language of the PRRC. Only those PRRCs in English were preserved. We also removed any PRRC or thread whose size exceeded 128 Kilobytes.
GHArchive copy
redivis.com
Updated Dec 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seyeon Kim (2021). GHArchive copy [Dataset]. https://redivis.com/workflows/y17e-6g6v5trwe
Explore at:
Dataset updated
Dec 15, 2021
Dataset provided by
Redivis Inc.
Authors
Seyeon Kim
Description
This workflow was created by seyeonk on Wed, 15 Dec 2021 00:34:12 GMT.
Data from: TDMentions: A Dataset of Technical Debt Mentions in Online Posts
zenodo.org
bin, bz2
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TDMentions: A Dataset of Technical Debt Mentions in Online Posts [Dataset]. https://zenodo.org/records/2593142
Explore at:
bin, bz2Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.2593142
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Morgan Ericsson; Morgan Ericsson; Anna Wingkvist; Anna Wingkvist
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
# TDMentions: A Dataset of Technical Debt Mentions in Online Posts (version 1.0)

TDMentions is a dataset that contains mentions of technical debt from Reddit, Hacker News, and Stack Exchange. It also contains a list of blog posts on Medium that were tagged as technical debt. The dataset currently contains approximately 35,000 items.

## Data collection and processing

The dataset is mainly collected from existing datasets. We used data from:

- the archive of Reddit posts by Jason Baumgartner (available at [https://pushshift.io](https://pushshift.io),
- the archive of Hacker News available at Google's BigQuery (available at [https://console.cloud.google.com/marketplace/details/y-combinator/hacker-news](https://console.cloud.google.com/marketplace/details/y-combinator/hacker-news)), and the Stack Exchange data dump (available at [https://archive.org/details/stackexchange](https://archive.org/details/stackexchange)).
- the [GHTorrent](http://ghtorrent.org) project
- the [GH Archive](https://www.gharchive.org)

The data set currently contains data from the start of each source/service until 2018-12-31. For GitHub, we currently only include data from 2015-01-01.

We use the regular expression `tech(nical)?[\s\-_]*?debt` to find mentions in all sources except for Medium. We decided to limit our matches to variations of technical debt and tech debt. Other shorter forms, such as TD, can result in too many false positives. For Medium, we used the tag `technical-debt`.

## Data Format

The dataset is stored as a compressed (bzip2) JSON file with one JSON object per line. Each mention is represented as a JSON object with the following keys.

- `id`: the id used in the original source. We use the URL path to identify Medium posts.
- `body`: the text that contains the mention. This is either the comment or the title of the post. For Medium posts this is the title and subtitle (which might not mention technical debt, since posts are identified by the tag).
- `created_utc`: the time the item was posted in seconds since epoch in UTC.
- `author`: the author of the item. We use the username or userid from the source.
- `source`: where the item was posted. Valid sources are:
- HackerNews Comment
- HackerNews Job
- HackerNews Submission
- Reddit Comment
- Reddit Submission
- StackExchange Answer
- StackExchange Comment
- StackExchange Question
- Medium Post
- `meta`: Additional information about the item specific to the source. This includes, e.g., the subreddit a Reddit submission or comment was posted to, the score, etc. We try to use the same names, e.g., `score` and `num_comments` for keys that have the same meaning/information across multiple sources.

This is a sample item from Reddit:

```JSON
{
"id": "ab8auf",
"body": "Technical Debt Explained (x-post r/Eve)",
"created_utc": 1546271789,
"author": "totally_100_human",
"source": "Reddit Submission",
"meta": {
"title": "Technical Debt Explained (x-post r/Eve)",
"score": 1,
"num_comments": 0,
"url": "http://jestertrek.com/eve/technical-debt-2.png",
"subreddit": "RCBRedditBot"
}
}
```

## Sample Analyses

We decided to use JSON to store the data, since it is easy to work with from multiple programming languages. In the following examples, we use [`jq`](https://stedolan.github.io/jq/) to process the JSON.

### How many items are there for each source?

```
lbzip2 -cd postscomments.json.bz2 | jq '.source' | sort | uniq -c
```

### How many submissions that mentioned technical debt were posted each month?

```
lbzip2 -cd postscomments.json.bz2 | jq 'select(.source == "Reddit Submission") | .created_utc | strftime("%Y-%m")' | sort | uniq -c
```

### What are the titles of items that link (`meta.url`) to PDF documents?

```
lbzip2 -cd postscomments.json.bz2 | jq '. as $r | select(.meta.url?) | .meta.url | select(endswith(".pdf")) | $r.body'
```

### Please, I want CSV!

```
lbzip2 -cd postscomments.json.bz2 | jq -r '[.id, .body, .author] | @csv'
```

Note that you need to specify the keys you want to include for the CSV, so it is easier to either ignore the meta information or process each source.

Please see [https://github.com/sse-lnu/tdmentions](https://github.com/sse-lnu/tdmentions) for more analyses

# Limitations and Future updates

The current version of the dataset lacks GitHub data and Medium comments. GitHub data will be added in the next update. Medium comments (responses) will be added in a future update if we find a good way to represent these.
A
‘Commits effectués par des personnes avec des adresses en *.gouv.fr sur...
analyst-2.ai
Updated Jan 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Commits effectués par des personnes avec des adresses en *.gouv.fr sur GitHub’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/data-europa-eu-commits-effectues-par-des-personnes-avec-des-adresses-en-gouv-fr-sur-github-d1b1/e5039c48/?iid=002-153&v=presentation
Explore at:
Dataset updated
Jan 17, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
France
Description
Analysis of ‘Commits effectués par des personnes avec des adresses en *.gouv.fr sur GitHub’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from http://data.europa.eu/88u/dataset/5cab15d28b4c415b457e8629 on 17 January 2022.

--- Dataset description provided by original source is as follows ---

Une liste de commits publiés sur GitHub dont l'auteur a renseigné une adresse e-mail en *.gouv.fr.

Le schéma des données au format Table Schema est disponible en ligne.

Un article de blog parle de cette initiative.

--- Original source retains full ownership of the source dataset ---
r
repoDetails
redivis.com
Updated Dec 10, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford Graduate School of Business Library (2021). repoDetails [Dataset]. https://redivis.com/datasets/3frr-0829sqaf7
Explore at:
Dataset updated
Dec 10, 2021
Dataset authored and provided by
Stanford Graduate School of Business Library
Description
The table repoDetails is part of the dataset GHArchive, available at https://stanford.redivis.com/datasets/3frr-0829sqaf7. It contains 3834562753 rows across 4 variables.
r
organizations
redivis.com
Updated Dec 10, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford Graduate School of Business Library (2021). organizations [Dataset]. https://redivis.com/datasets/3frr-0829sqaf7
Explore at:
Dataset updated
Dec 10, 2021
Dataset authored and provided by
Stanford Graduate School of Business Library
Description
The table organizations is part of the dataset GHArchive, available at https://stanford.redivis.com/datasets/3frr-0829sqaf7. It contains 1175630161 rows across 6 variables.
r
actors
redivis.com
Updated Dec 10, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford Graduate School of Business Library (2021). actors [Dataset]. https://redivis.com/datasets/3frr-0829sqaf7
Explore at:
Dataset updated
Dec 10, 2021
Dataset authored and provided by
Stanford Graduate School of Business Library
Description
The table actors is part of the dataset GHArchive, available at https://stanfordgsb.redivis.com/datasets/3frr-0829sqaf7. It contains 3839081663 rows across 6 variables.
g
Commits effectués par des personnes avec des adresses en *.gouv.fr sur...
gimi9.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Commits effectués par des personnes avec des adresses en *.gouv.fr sur GitHub [Dataset]. https://gimi9.com/dataset/fr_5cab15d28b4c415b457e8629/
Explore at:
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Une liste de commits publiés sur GitHub dont l'auteur a renseigné une adresse e-mail en *.gouv.fr. Le schéma des données au format Table Schema est disponible en ligne. Un article de blog parle de cette initiative.
e
Compromisos realizados por personas con direcciones en *.gouv.fr en GitHub
data.europa.eu
csv/utf8
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antoine Augusti, Compromisos realizados por personas con direcciones en *.gouv.fr en GitHub [Dataset]. https://data.europa.eu/data/datasets/5cab15d28b4c415b457e8629?locale=es
Explore at:
csv/utf8Available download formats
Dataset authored and provided by
Antoine Augusti
License
https://www.etalab.gouv.fr/licence-ouverte-open-licencehttps://www.etalab.gouv.fr/licence-ouverte-open-licence
Description
Una lista de confirmaciones publicadas en GitHub cuyo autor proporcionó una dirección de correo electrónico en ‘*.gouv.fr’.

El esquema de los datos en formato Esquema de tabla está disponible online.

Un artículo de blog habla de esta iniciativa.
e
Sitoumukset, joita tekevät henkilöt, joiden osoite on *.gouv.fr GitHubissa
data.europa.eu
csv/utf8
Updated Dec 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antoine Augusti (2024). Sitoumukset, joita tekevät henkilöt, joiden osoite on *.gouv.fr GitHubissa [Dataset]. https://data.europa.eu/data/datasets/5cab15d28b4c415b457e8629?locale=fi
Explore at:
csv/utf8Available download formats
Dataset updated
Dec 3, 2024
Dataset authored and provided by
Antoine Augusti
License
https://www.etalab.gouv.fr/licence-ouverte-open-licencehttps://www.etalab.gouv.fr/licence-ouverte-open-licence
Description
Luettelo GitHubissa julkaistuista sitoumuksista, joiden laatija antoi sähköpostiosoitteen ”*.gouv.fr”.

Taulukon Schema muodossa olevien tietojen kaavio on saatavilla online.

Blogiartikkeli käsittelee tätä aloitetta.
Not seeing a result you expected?
Learn how you can add new datasets to our index.