6 datasets found
  1. Stack Overflow Data

    • kaggle.com
    zip
    Updated Mar 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stack Overflow (2019). Stack Overflow Data [Dataset]. https://www.kaggle.com/datasets/stackoverflow/stackoverflow
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Mar 20, 2019
    Dataset authored and provided by
    Stack Overflowhttp://stackoverflow.com/
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Context

    Stack Overflow is the largest online community for programmers to learn, share their knowledge, and advance their careers.

    Content

    Updated on a quarterly basis, this BigQuery dataset includes an archive of Stack Overflow content, including posts, votes, tags, and badges. This dataset is updated to mirror the Stack Overflow content on the Internet Archive, and is also available through the Stack Exchange Data Explorer.

    Fork this kernel to get started with this dataset.

    Acknowledgements

    Dataset Source: https://archive.org/download/stackexchange

    https://bigquery.cloud.google.com/dataset/bigquery-public-data:stackoverflow

    https://cloud.google.com/bigquery/public-data/stackoverflow

    Banner Photo by Caspar Rubin from Unplash.

    Inspiration

    What is the percentage of questions that have been answered over the years?

    What is the reputation and badge count of users across different tenures on StackOverflow?

    What are 10 of the “easier” gold badges to earn?

    Which day of the week has most questions answered within an hour?

  2. Data from: Stack Overflow

    • console.cloud.google.com
    Updated Aug 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:Stack%20Exchange&hl=id (2024). Stack Overflow [Dataset]. https://console.cloud.google.com/marketplace/details/stack-exchange/stack-overflow?hl=id
    Explore at:
    Dataset updated
    Aug 13, 2024
    Dataset provided by
    Googlehttp://google.com/
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Stack Overflow is the largest online community for programmers to learn, share their knowledge, and advance their careers. Updated on a quarterly basis, this BigQuery dataset includes an archive of Stack Overflow content, including posts, votes, tags, and badges. This dataset is updated to mirror the Stack Overflow content on the Internet Archive, and is also available through the Stack Exchange Data Explorer. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .

  3. SOTorrent 2018-12-09

    • kaggle.com
    zip
    Updated Dec 18, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SOTorrent (2018). SOTorrent 2018-12-09 [Dataset]. https://www.kaggle.com/datasets/sotorrent/2018-12-09
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Dec 18, 2018
    Dataset authored and provided by
    SOTorrent
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Please notice

    Tables TitleVersion and Votes are not yet visible in the Data preview page, but they are accessible in Kernels.

    Context

    Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on SO evolve over time, for example when bugs in code snippets are fixed, code is updated to work with a more recent library version, or text surrounding a code snippet is edited for clarity. To be able to analyze how content on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump.

    Content

    SOTorrent provides access to the version history of SO content at the level of whole posts and individual text or code blocks. It connects SO posts to other platforms by aggregating URLs from text blocks and comments, and by collecting references from GitHub files to SO posts. Our vision is that researchers will use SOTorrent to investigate and understand the evolution of SO posts and their relation to other platforms such as GitHub. If you use this dataset in your work, please cite our MSR 2018 paper or our MSR 2019 mining challenge proposal.

    This version is based on the official Stack Overflow data dump released 2018-12-02 and the Google BigQuery GitHub data set queried 2018-12-09.

    Inspiration

    The goal of the MSR 2019 mining challenge is to study the origin, evolution, and usage of Stack Overflow code snippets. Questions that are, to the best of our knowledge, not sufficiently answered yet include:

    • How are code snippets on Stack Overflow maintained?
    • How many clones of code snippets exist inside Stack Overflow?
    • How can we detect buggy versions of Stack Overflow code snippets and find them in GitHub projects?
    • How frequently are code snippets copied from external sources into Stack Overflow and then co-evolve there?
    • How do snippets copied from Stack Overflow to GitHub co-evolve?
    • Does the evolution of Stack Overflow code snippets follow patterns?
    • Do these patterns differ between programming languages?
    • Are the licenses of external sources compatible with Stack Overflow’s license (CC BY-SA 3.0)?
    • How many code blocks on Stack Overflow do not contain source code (and are only used for markup)?
    • Can we reliably predict bug-fixing edits to code on Stack Overflow?
    • Can we reliably predict popularity of Stack Overflow code snippets on GitHub?

    These are just some of the questions that could be answered using SOTorrent. We encourage challenge participants to adapt the above questions or formulate their own research questions about the origin, evolution, and usage of content on Stack Overflow.

  4. h

    stackexchange

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Albert Gong, stackexchange [Dataset]. https://huggingface.co/datasets/ag2435/stackexchange
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Albert Gong
    Description

    StackExchange Dataset

    Working doc: https://docs.google.com/document/d/1h585bH5sYcQW4pkHzqWyQqA4ape2Bq6o1Cya0TkMOQc/edit?usp=sharing

    BigQuery query (see so_bigquery.ipynb): CREATE TEMP TABLE answers AS SELECT * FROM bigquery-public-data.stackoverflow.posts_answers WHERE LOWER(Body) LIKE '%arxiv%';

    CREATE TEMPORARY TABLE questions AS SELECT * FROM bigquery-public-data.stackoverflow.posts_questions;

    SELECT * FROM answers JOIN questions ON questions.id = answers.parent_id;

    NOTE:… See the full description on the dataset page: https://huggingface.co/datasets/ag2435/stackexchange.

  5. Z

    SOTorrent Data Set 2017-07-25

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sebastian Baltes (2020). SOTorrent Data Set 2017-07-25 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_834571
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    University of Trier, Germany
    Authors
    Sebastian Baltes
    Description

    Stack Overflow (SO) is the largest Q&A website for software developers, providing a huge amount of copyable code snippets. Recent studies have shown that developers regularly copy those snippets into their software projects, often without the required attribution. Beside possible licensing issues, maintenance issues may arise, because the snippets evolve on SO, but the developers who copied the code are not aware of these changes. To help researchers investigate the evolution of code snippets on SO and their relation to other platforms like GitHub, we build SOTorrent, an open data set based on data from the official SO data dump and the Google BigQuery GitHub data set. SOTorrent provides access to the version history of SO content on the level of whole posts and individual text or code blocks. Moreover, it links SO content to external resources in two ways: (1) by extracting linked URLs from text blocks of SO posts and (2) by providing a table with links to SO posts found in the source code of all projects in the BigQuery GitHub data set.

  6. Z

    GPT vs Stack Overflow: data collection (A2I2 T2 2023)

    • data.niaid.nih.gov
    Updated Oct 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heath, Mark (2023). GPT vs Stack Overflow: data collection (A2I2 T2 2023) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8403467
    Explore at:
    Dataset updated
    Oct 6, 2023
    Dataset provided by
    Deakin University
    Authors
    Heath, Mark
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    About

    The dataset components produced by this repo. Please see the documentation there for more information.

    Each CSV has been individually zipped so that you only have to download the specific file(s) that you want.

    Overview of Files

    From using the Stack Exchange Data Dump as the data source (these zip files have a DD_ prefix):

    Raw dataset before processing: saved_dataset.csv (DD_saved_dataset.zip)

    Completed tag count: tag_count.csv (DD_tag_count.zip)

    Processed dataset with completed evaluations: dataset_results.csv (DD_dataset_results.zip)

    From using Google BigQuery as the data source (these zip files have a BQ_ prefix):

    Raw dataset before processing: saved_dataset.csv (BQ_saved_dataset.zip)

    Completed tag count: tag_count.csv (BQ_tag_count.zip)

    No large-scale evaluation was completed when using BigQuery as a data source.

    As noted in the linked repo, the use of Google BigQuery as a data source is not recommended for this work, but the working code and dataset have nonetheless been provided for completeness.

    License

    This dataset is licensed under the CC BY-SA 4.0 license, the same license used by the Stack Exchange Data Dump.

  7. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Stack Overflow (2019). Stack Overflow Data [Dataset]. https://www.kaggle.com/datasets/stackoverflow/stackoverflow
Organization logo

Stack Overflow Data

Stack Overflow Data (BigQuery Dataset)

Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset authored and provided by
Stack Overflowhttp://stackoverflow.com/
License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Description

Context

Stack Overflow is the largest online community for programmers to learn, share their knowledge, and advance their careers.

Content

Updated on a quarterly basis, this BigQuery dataset includes an archive of Stack Overflow content, including posts, votes, tags, and badges. This dataset is updated to mirror the Stack Overflow content on the Internet Archive, and is also available through the Stack Exchange Data Explorer.

Fork this kernel to get started with this dataset.

Acknowledgements

Dataset Source: https://archive.org/download/stackexchange

https://bigquery.cloud.google.com/dataset/bigquery-public-data:stackoverflow

https://cloud.google.com/bigquery/public-data/stackoverflow

Banner Photo by Caspar Rubin from Unplash.

Inspiration

What is the percentage of questions that have been answered over the years?

What is the reputation and badge count of users across different tenures on StackOverflow?

What are 10 of the “easier” gold badges to earn?

Which day of the week has most questions answered within an hour?

Search
Clear search
Close search
Google apps
Main menu