Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Stack Overflow is the largest online community for programmers to learn, share their knowledge, and advance their careers.
Updated on a quarterly basis, this BigQuery dataset includes an archive of Stack Overflow content, including posts, votes, tags, and badges. This dataset is updated to mirror the Stack Overflow content on the Internet Archive, and is also available through the Stack Exchange Data Explorer.
Fork this kernel to get started with this dataset.
Dataset Source: https://archive.org/download/stackexchange
https://bigquery.cloud.google.com/dataset/bigquery-public-data:stackoverflow
https://cloud.google.com/bigquery/public-data/stackoverflow
Banner Photo by Caspar Rubin from Unplash.
What is the percentage of questions that have been answered over the years?
What is the reputation and badge count of users across different tenures on StackOverflow?
What are 10 of the “easier” gold badges to earn?
Which day of the week has most questions answered within an hour?
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Stack Overflow is the largest online community for programmers to learn, share their knowledge, and advance their careers. Updated on a quarterly basis, this BigQuery dataset includes an archive of Stack Overflow content, including posts, votes, tags, and badges. This dataset is updated to mirror the Stack Overflow content on the Internet Archive, and is also available through the Stack Exchange Data Explorer. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Tables TitleVersion and Votes are not yet visible in the Data preview page, but they are accessible in Kernels.
Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on SO evolve over time, for example when bugs in code snippets are fixed, code is updated to work with a more recent library version, or text surrounding a code snippet is edited for clarity. To be able to analyze how content on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump.
SOTorrent provides access to the version history of SO content at the level of whole posts and individual text or code blocks. It connects SO posts to other platforms by aggregating URLs from text blocks and comments, and by collecting references from GitHub files to SO posts. Our vision is that researchers will use SOTorrent to investigate and understand the evolution of SO posts and their relation to other platforms such as GitHub. If you use this dataset in your work, please cite our MSR 2018 paper or our MSR 2019 mining challenge proposal.
This version is based on the official Stack Overflow data dump released 2018-12-02 and the Google BigQuery GitHub data set queried 2018-12-09.
The goal of the MSR 2019 mining challenge is to study the origin, evolution, and usage of Stack Overflow code snippets. Questions that are, to the best of our knowledge, not sufficiently answered yet include:
These are just some of the questions that could be answered using SOTorrent. We encourage challenge participants to adapt the above questions or formulate their own research questions about the origin, evolution, and usage of content on Stack Overflow.
Facebook
TwitterStackExchange Dataset
Working doc: https://docs.google.com/document/d/1h585bH5sYcQW4pkHzqWyQqA4ape2Bq6o1Cya0TkMOQc/edit?usp=sharing
BigQuery query (see so_bigquery.ipynb): CREATE TEMP TABLE answers AS SELECT * FROM bigquery-public-data.stackoverflow.posts_answers WHERE LOWER(Body) LIKE '%arxiv%';
CREATE TEMPORARY TABLE questions AS SELECT * FROM bigquery-public-data.stackoverflow.posts_questions;
SELECT * FROM answers JOIN questions ON questions.id = answers.parent_id;
NOTE:… See the full description on the dataset page: https://huggingface.co/datasets/ag2435/stackexchange.
Facebook
TwitterStack Overflow (SO) is the largest Q&A website for software developers, providing a huge amount of copyable code snippets. Recent studies have shown that developers regularly copy those snippets into their software projects, often without the required attribution. Beside possible licensing issues, maintenance issues may arise, because the snippets evolve on SO, but the developers who copied the code are not aware of these changes. To help researchers investigate the evolution of code snippets on SO and their relation to other platforms like GitHub, we build SOTorrent, an open data set based on data from the official SO data dump and the Google BigQuery GitHub data set. SOTorrent provides access to the version history of SO content on the level of whole posts and individual text or code blocks. Moreover, it links SO content to external resources in two ways: (1) by extracting linked URLs from text blocks of SO posts and (2) by providing a table with links to SO posts found in the source code of all projects in the BigQuery GitHub data set.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
About
The dataset components produced by this repo. Please see the documentation there for more information.
Each CSV has been individually zipped so that you only have to download the specific file(s) that you want.
Overview of Files
From using the Stack Exchange Data Dump as the data source (these zip files have a DD_ prefix):
Raw dataset before processing: saved_dataset.csv (DD_saved_dataset.zip)
Completed tag count: tag_count.csv (DD_tag_count.zip)
Processed dataset with completed evaluations: dataset_results.csv (DD_dataset_results.zip)
From using Google BigQuery as the data source (these zip files have a BQ_ prefix):
Raw dataset before processing: saved_dataset.csv (BQ_saved_dataset.zip)
Completed tag count: tag_count.csv (BQ_tag_count.zip)
No large-scale evaluation was completed when using BigQuery as a data source.
As noted in the linked repo, the use of Google BigQuery as a data source is not recommended for this work, but the working code and dataset have nonetheless been provided for completeness.
License
This dataset is licensed under the CC BY-SA 4.0 license, the same license used by the Stack Exchange Data Dump.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Stack Overflow is the largest online community for programmers to learn, share their knowledge, and advance their careers.
Updated on a quarterly basis, this BigQuery dataset includes an archive of Stack Overflow content, including posts, votes, tags, and badges. This dataset is updated to mirror the Stack Overflow content on the Internet Archive, and is also available through the Stack Exchange Data Explorer.
Fork this kernel to get started with this dataset.
Dataset Source: https://archive.org/download/stackexchange
https://bigquery.cloud.google.com/dataset/bigquery-public-data:stackoverflow
https://cloud.google.com/bigquery/public-data/stackoverflow
Banner Photo by Caspar Rubin from Unplash.
What is the percentage of questions that have been answered over the years?
What is the reputation and badge count of users across different tenures on StackOverflow?
What are 10 of the “easier” gold badges to earn?
Which day of the week has most questions answered within an hour?