6 datasets found

Stack Overflow Data
kaggle.com
zip
Updated Mar 20, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stack Overflow (2019). Stack Overflow Data [Dataset]. https://www.kaggle.com/datasets/stackoverflow/stackoverflow
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset authored and provided by
Stack Overflowhttp://stackoverflow.com/
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Context

Stack Overflow is the largest online community for programmers to learn, share their knowledge, and advance their careers.

Content

Updated on a quarterly basis, this BigQuery dataset includes an archive of Stack Overflow content, including posts, votes, tags, and badges. This dataset is updated to mirror the Stack Overflow content on the Internet Archive, and is also available through the Stack Exchange Data Explorer.

Fork this kernel to get started with this dataset.

Acknowledgements

Dataset Source: https://archive.org/download/stackexchange

https://bigquery.cloud.google.com/dataset/bigquery-public-data:stackoverflow

https://cloud.google.com/bigquery/public-data/stackoverflow

Banner Photo by Caspar Rubin from Unplash.

Inspiration

What is the percentage of questions that have been answered over the years?

What is the reputation and badge count of users across different tenures on StackOverflow?

What are 10 of the “easier” gold badges to earn?

Which day of the week has most questions answered within an hour?
Data from: Stack Overflow
console.cloud.google.com
Updated Aug 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:Stack%20Exchange&hl=id (2024). Stack Overflow [Dataset]. https://console.cloud.google.com/marketplace/details/stack-exchange/stack-overflow?hl=id
Explore at:
Dataset updated
Aug 13, 2024
Dataset provided by
Googlehttp://google.com/
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Stack Overflow is the largest online community for programmers to learn, share their knowledge, and advance their careers. Updated on a quarterly basis, this BigQuery dataset includes an archive of Stack Overflow content, including posts, votes, tags, and badges. This dataset is updated to mirror the Stack Overflow content on the Internet Archive, and is also available through the Stack Exchange Data Explorer. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
SOTorrent 2018-12-09
kaggle.com
zip
Updated Dec 18, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SOTorrent (2018). SOTorrent 2018-12-09 [Dataset]. https://www.kaggle.com/datasets/sotorrent/2018-12-09
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Dec 18, 2018
Dataset authored and provided by
SOTorrent
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Please notice

Tables TitleVersion and Votes are not yet visible in the Data preview page, but they are accessible in Kernels.

Context

Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on SO evolve over time, for example when bugs in code snippets are fixed, code is updated to work with a more recent library version, or text surrounding a code snippet is edited for clarity. To be able to analyze how content on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump.

Content

SOTorrent provides access to the version history of SO content at the level of whole posts and individual text or code blocks. It connects SO posts to other platforms by aggregating URLs from text blocks and comments, and by collecting references from GitHub files to SO posts. Our vision is that researchers will use SOTorrent to investigate and understand the evolution of SO posts and their relation to other platforms such as GitHub. If you use this dataset in your work, please cite our MSR 2018 paper or our MSR 2019 mining challenge proposal.

This version is based on the official Stack Overflow data dump released 2018-12-02 and the Google BigQuery GitHub data set queried 2018-12-09.

Inspiration

The goal of the MSR 2019 mining challenge is to study the origin, evolution, and usage of Stack Overflow code snippets. Questions that are, to the best of our knowledge, not sufficiently answered yet include:

How are code snippets on Stack Overflow maintained?

How many clones of code snippets exist inside Stack Overflow?

How can we detect buggy versions of Stack Overflow code snippets and find them in GitHub projects?

How frequently are code snippets copied from external sources into Stack Overflow and then co-evolve there?

How do snippets copied from Stack Overflow to GitHub co-evolve?

Does the evolution of Stack Overflow code snippets follow patterns?

Do these patterns differ between programming languages?

Are the licenses of external sources compatible with Stack Overflow’s license (CC BY-SA 3.0)?

How many code blocks on Stack Overflow do not contain source code (and are only used for markup)?

Can we reliably predict bug-fixing edits to code on Stack Overflow?

Can we reliably predict popularity of Stack Overflow code snippets on GitHub?

These are just some of the questions that could be answered using SOTorrent. We encourage challenge participants to adapt the above questions or formulate their own research questions about the origin, evolution, and usage of content on Stack Overflow.
h
stackexchange
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Albert Gong, stackexchange [Dataset]. https://huggingface.co/datasets/ag2435/stackexchange
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Albert Gong
Description
StackExchange Dataset

Working doc: https://docs.google.com/document/d/1h585bH5sYcQW4pkHzqWyQqA4ape2Bq6o1Cya0TkMOQc/edit?usp=sharing

BigQuery query (see so_bigquery.ipynb): CREATE TEMP TABLE answers AS SELECT * FROM bigquery-public-data.stackoverflow.posts_answers WHERE LOWER(Body) LIKE '%arxiv%';

CREATE TEMPORARY TABLE questions AS SELECT * FROM bigquery-public-data.stackoverflow.posts_questions;

SELECT * FROM answers JOIN questions ON questions.id = answers.parent_id;

NOTE:… See the full description on the dataset page: https://huggingface.co/datasets/ag2435/stackexchange.
Z
SOTorrent Data Set 2017-07-25
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sebastian Baltes (2020). SOTorrent Data Set 2017-07-25 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_834571
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
University of Trier, Germany
Authors
Sebastian Baltes
Description
Stack Overflow (SO) is the largest Q&A website for software developers, providing a huge amount of copyable code snippets. Recent studies have shown that developers regularly copy those snippets into their software projects, often without the required attribution. Beside possible licensing issues, maintenance issues may arise, because the snippets evolve on SO, but the developers who copied the code are not aware of these changes. To help researchers investigate the evolution of code snippets on SO and their relation to other platforms like GitHub, we build SOTorrent, an open data set based on data from the official SO data dump and the Google BigQuery GitHub data set. SOTorrent provides access to the version history of SO content on the level of whole posts and individual text or code blocks. Moreover, it links SO content to external resources in two ways: (1) by extracting linked URLs from text blocks of SO posts and (2) by providing a table with links to SO posts found in the source code of all projects in the BigQuery GitHub data set.
Z
GPT vs Stack Overflow: data collection (A2I2 T2 2023)
data.niaid.nih.gov
Updated Oct 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heath, Mark (2023). GPT vs Stack Overflow: data collection (A2I2 T2 2023) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8403467
Explore at:
Dataset updated
Oct 6, 2023
Dataset provided by
Deakin University
Authors
Heath, Mark
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
About

The dataset components produced by this repo. Please see the documentation there for more information.

Each CSV has been individually zipped so that you only have to download the specific file(s) that you want.

Overview of Files

From using the Stack Exchange Data Dump as the data source (these zip files have a DD_ prefix):

Raw dataset before processing: saved_dataset.csv (DD_saved_dataset.zip)

Completed tag count: tag_count.csv (DD_tag_count.zip)

Processed dataset with completed evaluations: dataset_results.csv (DD_dataset_results.zip)

From using Google BigQuery as the data source (these zip files have a BQ_ prefix):

Raw dataset before processing: saved_dataset.csv (BQ_saved_dataset.zip)

Completed tag count: tag_count.csv (BQ_tag_count.zip)

No large-scale evaluation was completed when using BigQuery as a data source.

As noted in the linked repo, the use of Google BigQuery as a data source is not recommended for this work, but the working code and dataset have nonetheless been provided for completeness.

License

This dataset is licensed under the CC BY-SA 4.0 license, the same license used by the Stack Exchange Data Dump.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Stack Overflow (2019). Stack Overflow Data [Dataset]. https://www.kaggle.com/datasets/stackoverflow/stackoverflow

Stack Overflow Data

Stack Overflow Data (BigQuery Dataset)

Explore at:

zip(0 bytes)Available download formats

Dataset updated

Mar 20, 2019

Dataset authored and provided by

Stack Overflowhttp://stackoverflow.com/

License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Description

Context

Stack Overflow is the largest online community for programmers to learn, share their knowledge, and advance their careers.

Content

Updated on a quarterly basis, this BigQuery dataset includes an archive of Stack Overflow content, including posts, votes, tags, and badges. This dataset is updated to mirror the Stack Overflow content on the Internet Archive, and is also available through the Stack Exchange Data Explorer.

Fork this kernel to get started with this dataset.

Acknowledgements

Dataset Source: https://archive.org/download/stackexchange

https://bigquery.cloud.google.com/dataset/bigquery-public-data:stackoverflow

https://cloud.google.com/bigquery/public-data/stackoverflow

Banner Photo by Caspar Rubin from Unplash.

Inspiration

What is the percentage of questions that have been answered over the years?

What is the reputation and badge count of users across different tenures on StackOverflow?

What are 10 of the “easier” gold badges to earn?

Which day of the week has most questions answered within an hour?

Clear search

Close search

Google apps

Main menu

Stack Overflow Data

Context

Content

Acknowledgements

Inspiration

Data from: Stack Overflow

SOTorrent 2018-12-09

Please notice

Context

Content

Inspiration

stackexchange

SOTorrent Data Set 2017-07-25

GPT vs Stack Overflow: data collection (A2I2 T2 2023)

Stack Overflow Data

Stack Overflow Data (BigQuery Dataset)

Context

Content

Acknowledgements

Inspiration