This dataset was created by StanfordGSBLibrary on Fri, 10 Dec 2021 18:49:35 GMT.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Pull Request Review Comments (PRRC) Datasets
Two datasets have been created from the gharchive website. The Pull Request Review Comment Event was selected from the set of available GitHub events. This dataset has been created for CARA: Chatbot for Automating Repairnator Actions as part of a master's thesis at KTH, Stockholm.
First, a source dataset was downloaded from gharchive. That dataset ranges from January 2015 to December 2019. It consisted of 37,358,242 PRRCs and is over 12 Gigabytes in size. It took over 100 hours to download all the data files and extract PRRC from it. From this source dataset, two subsets were derived:
Description
The dataset is stored in the JSONLines format, as was the source dataset from gharchive.
For PRRC events, the source dataset contains the fields `comment_id`, `commit_id`, `url`, `author`, `created_at`, and `body`.
The threads dataset contains the fields `url` and `body` which contain similar information as described above. However, the body field differs: it is a concatenation of all the PRRCs in a pull request thread. The comments dataset contains the fields `comment_id`, `commit_id`, `url`, `author`, `created_at`, and `body`. They are the same fields from the initial dataset.
Construction
We used the fasttext model published by Facebook to detect the language of the PRRC. Only those PRRCs in English were preserved. We also removed any PRRC or thread whose size exceeded 128 Kilobytes.
This workflow was created by seyeonk on Wed, 15 Dec 2021 00:34:12 GMT.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
# TDMentions: A Dataset of Technical Debt Mentions in Online Posts (version 1.0)
TDMentions is a dataset that contains mentions of technical debt from Reddit, Hacker News, and Stack Exchange. It also contains a list of blog posts on Medium that were tagged as technical debt. The dataset currently contains approximately 35,000 items.
## Data collection and processing
The dataset is mainly collected from existing datasets. We used data from:
- the archive of Reddit posts by Jason Baumgartner (available at [https://pushshift.io](https://pushshift.io),
- the archive of Hacker News available at Google's BigQuery (available at [https://console.cloud.google.com/marketplace/details/y-combinator/hacker-news](https://console.cloud.google.com/marketplace/details/y-combinator/hacker-news)), and the Stack Exchange data dump (available at [https://archive.org/details/stackexchange](https://archive.org/details/stackexchange)).
- the [GHTorrent](http://ghtorrent.org) project
- the [GH Archive](https://www.gharchive.org)
The data set currently contains data from the start of each source/service until 2018-12-31. For GitHub, we currently only include data from 2015-01-01.
We use the regular expression `tech(nical)?[\s\-_]*?debt` to find mentions in all sources except for Medium. We decided to limit our matches to variations of technical debt and tech debt. Other shorter forms, such as TD, can result in too many false positives. For Medium, we used the tag `technical-debt`.
## Data Format
The dataset is stored as a compressed (bzip2) JSON file with one JSON object per line. Each mention is represented as a JSON object with the following keys.
- `id`: the id used in the original source. We use the URL path to identify Medium posts.
- `body`: the text that contains the mention. This is either the comment or the title of the post. For Medium posts this is the title and subtitle (which might not mention technical debt, since posts are identified by the tag).
- `created_utc`: the time the item was posted in seconds since epoch in UTC.
- `author`: the author of the item. We use the username or userid from the source.
- `source`: where the item was posted. Valid sources are:
- HackerNews Comment
- HackerNews Job
- HackerNews Submission
- Reddit Comment
- Reddit Submission
- StackExchange Answer
- StackExchange Comment
- StackExchange Question
- Medium Post
- `meta`: Additional information about the item specific to the source. This includes, e.g., the subreddit a Reddit submission or comment was posted to, the score, etc. We try to use the same names, e.g., `score` and `num_comments` for keys that have the same meaning/information across multiple sources.
This is a sample item from Reddit:
```JSON
{
"id": "ab8auf",
"body": "Technical Debt Explained (x-post r/Eve)",
"created_utc": 1546271789,
"author": "totally_100_human",
"source": "Reddit Submission",
"meta": {
"title": "Technical Debt Explained (x-post r/Eve)",
"score": 1,
"num_comments": 0,
"url": "http://jestertrek.com/eve/technical-debt-2.png",
"subreddit": "RCBRedditBot"
}
}
```
## Sample Analyses
We decided to use JSON to store the data, since it is easy to work with from multiple programming languages. In the following examples, we use [`jq`](https://stedolan.github.io/jq/) to process the JSON.
### How many items are there for each source?
```
lbzip2 -cd postscomments.json.bz2 | jq '.source' | sort | uniq -c
```
### How many submissions that mentioned technical debt were posted each month?
```
lbzip2 -cd postscomments.json.bz2 | jq 'select(.source == "Reddit Submission") | .created_utc | strftime("%Y-%m")' | sort | uniq -c
```
### What are the titles of items that link (`meta.url`) to PDF documents?
```
lbzip2 -cd postscomments.json.bz2 | jq '. as $r | select(.meta.url?) | .meta.url | select(endswith(".pdf")) | $r.body'
```
### Please, I want CSV!
```
lbzip2 -cd postscomments.json.bz2 | jq -r '[.id, .body, .author] | @csv'
```
Note that you need to specify the keys you want to include for the CSV, so it is easier to either ignore the meta information or process each source.
Please see [https://github.com/sse-lnu/tdmentions](https://github.com/sse-lnu/tdmentions) for more analyses
# Limitations and Future updates
The current version of the dataset lacks GitHub data and Medium comments. GitHub data will be added in the next update. Medium comments (responses) will be added in a future update if we find a good way to represent these.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Commits effectués par des personnes avec des adresses en *.gouv.fr sur GitHub’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from http://data.europa.eu/88u/dataset/5cab15d28b4c415b457e8629 on 17 January 2022.
--- Dataset description provided by original source is as follows ---
Une liste de commits publiés sur GitHub dont l'auteur a renseigné une adresse e-mail en *.gouv.fr
.
Le schéma des données au format Table Schema est disponible en ligne.
Un article de blog parle de cette initiative.
--- Original source retains full ownership of the source dataset ---
The table repoDetails is part of the dataset GHArchive, available at https://stanford.redivis.com/datasets/3frr-0829sqaf7. It contains 3834562753 rows across 4 variables.
The table organizations is part of the dataset GHArchive, available at https://stanford.redivis.com/datasets/3frr-0829sqaf7. It contains 1175630161 rows across 6 variables.
The table actors is part of the dataset GHArchive, available at https://stanfordgsb.redivis.com/datasets/3frr-0829sqaf7. It contains 3839081663 rows across 6 variables.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Une liste de commits publiés sur GitHub dont l'auteur a renseigné une adresse e-mail en *.gouv.fr
. Le schéma des données au format Table Schema est disponible en ligne. Un article de blog parle de cette initiative.
https://www.etalab.gouv.fr/licence-ouverte-open-licencehttps://www.etalab.gouv.fr/licence-ouverte-open-licence
Una lista de confirmaciones publicadas en GitHub cuyo autor proporcionó una dirección de correo electrónico en ‘*.gouv.fr’.
El esquema de los datos en formato Esquema de tabla está disponible online.
Un artículo de blog habla de esta iniciativa.
https://www.etalab.gouv.fr/licence-ouverte-open-licencehttps://www.etalab.gouv.fr/licence-ouverte-open-licence
Luettelo GitHubissa julkaistuista sitoumuksista, joiden laatija antoi sähköpostiosoitteen ”*.gouv.fr”.
Taulukon Schema muodossa olevien tietojen kaavio on saatavilla online.
Blogiartikkeli käsittelee tätä aloitetta.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
This dataset was created by StanfordGSBLibrary on Fri, 10 Dec 2021 18:49:35 GMT.