100+ datasets found
  1. Stack Overflow Data

    • kaggle.com
    zip
    Updated Mar 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stack Overflow (2019). Stack Overflow Data [Dataset]. https://www.kaggle.com/datasets/stackoverflow/stackoverflow
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Mar 20, 2019
    Dataset authored and provided by
    Stack Overflowhttp://stackoverflow.com/
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Context

    Stack Overflow is the largest online community for programmers to learn, share their knowledge, and advance their careers.

    Content

    Updated on a quarterly basis, this BigQuery dataset includes an archive of Stack Overflow content, including posts, votes, tags, and badges. This dataset is updated to mirror the Stack Overflow content on the Internet Archive, and is also available through the Stack Exchange Data Explorer.

    Fork this kernel to get started with this dataset.

    Acknowledgements

    Dataset Source: https://archive.org/download/stackexchange

    https://bigquery.cloud.google.com/dataset/bigquery-public-data:stackoverflow

    https://cloud.google.com/bigquery/public-data/stackoverflow

    Banner Photo by Caspar Rubin from Unplash.

    Inspiration

    What is the percentage of questions that have been answered over the years?

    What is the reputation and badge count of users across different tenures on StackOverflow?

    What are 10 of the “easier” gold badges to earn?

    Which day of the week has most questions answered within an hour?

  2. h

    stackoverflow-posts

    • huggingface.co
    Updated Jun 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mike (2023). stackoverflow-posts [Dataset]. https://huggingface.co/datasets/mikex86/stackoverflow-posts
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 14, 2023
    Authors
    Mike
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    StackOverflow Posts Markdown

      Dataset Summary
    

    This dataset contains all posts submitted to StackOverflow before the 14th of June 2023 formatted as Markdown text. The dataset contains ~60 Million posts, totaling ~35GB in size and ~65 billion characters of text. The data is sourced from Internet Archive StackExchange Data Dump.

      Dataset Structure
    

    Each record corresponds to one post of a particular type. Original ordering from the data dump is not exactly preserved… See the full description on the dataset page: https://huggingface.co/datasets/mikex86/stackoverflow-posts.

  3. Stack Overflow Annual Developer Survey 2024

    • kaggle.com
    zip
    Updated Aug 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Berkay Alan (2024). Stack Overflow Annual Developer Survey 2024 [Dataset]. https://www.kaggle.com/datasets/berkayalan/stack-overflow-annual-developer-survey-2024
    Explore at:
    zip(18374043 bytes)Available download formats
    Dataset updated
    Aug 10, 2024
    Authors
    Berkay Alan
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    In May 2024 over 65,000 developers responded to Stack Overflow's annual survey about coding, working, AI and how they feel about all of those topics and more.

    There were seven sections in this survey. The 2nd, 3rd, 4th, and 5th sections will appear in a random order.

    1. Basic Information
    2. Education, Work, and Career
    3. Technology and Tech Culture
    4. Stack Overflow Usage + Community
    5. Artificial Intelligence
    6. Professional Developer Series (Optional)
    7. Final Questions
  4. h

    stackoverflow-dataset

    • huggingface.co
    Updated Oct 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sunny Bhaveen Chandra (2024). stackoverflow-dataset [Dataset]. https://huggingface.co/datasets/c17hawke/stackoverflow-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 24, 2024
    Authors
    Sunny Bhaveen Chandra
    Description

    c17hawke/stackoverflow-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. Stack Overflow 2023 survey dataset

    • kaggle.com
    zip
    Updated Dec 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AashiDutt (2023). Stack Overflow 2023 survey dataset [Dataset]. https://www.kaggle.com/datasets/aashidutt3/stack-overflow-2023-survey-dataset
    Explore at:
    zip(21448761 bytes)Available download formats
    Dataset updated
    Dec 1, 2023
    Authors
    AashiDutt
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    In May 2023 over 90,000 developers responded to the Stack Overflow annual survey about how they learn and level up, which tools they're using, and which ones they want.

    The dataset is a collection of two CSV files and the original survey questionnaire as described below: - survey_results_schema.csv - contains 78 rows and 6 columns covering the basic schema of the survey.

    • survey_results_public.csv- contains 89184 rows and 84 columns covering different questions related to users and tools they want to explore.

    This dataset is good enough for exploration, and basic data analysis purposes. Alongside, Kagglers can try hands-on to solve some NLP problems as well.

  6. h

    Data from: stack-overflow

    • huggingface.co
    Updated Oct 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TPP-LLM (2024). stack-overflow [Dataset]. https://huggingface.co/datasets/tppllm/stack-overflow
    Explore at:
    Dataset updated
    Oct 4, 2024
    Dataset authored and provided by
    TPP-LLM
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Stack Overflow Dataset

    This dataset contains badge awards earned by users on Stack Overflow between January 1, 2022, and December 31, 2023. It includes 3,336 sequences with 187,836 events and 25 badge types, derived from the Stack Exchange Data Dump under the CC BY-SA 4.0 license. The detailed data preprocessing steps used to create this dataset can be found in the TPP-LLM paper and TPP-Embedding paper. Update (2025-10-28): Added three timestamp fields (timestamp_event… See the full description on the dataset page: https://huggingface.co/datasets/tppllm/stack-overflow.

  7. h

    stackoverflow-chat-dutch

    • huggingface.co
    Updated Jan 24, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bram Vanroy (2024). stackoverflow-chat-dutch [Dataset]. http://doi.org/10.57967/hf/0529
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 24, 2024
    Authors
    Bram Vanroy
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for Stack Overflow Chat Dutch

      Dataset Summary
    

    This dataset contains 56,964 conversations between een AI assistant and a (fake) "Human" (generated) in Dutch, specifically in the domain of programming (Stack Overflow). They are translations of Baize's machine-generated answers to the Stack Overflow dataset. ☕ Want to help me out? Translating the data with the OpenAI API, and prompt testing, cost me 💸$133.60💸. If you like this dataset, please consider buying… See the full description on the dataset page: https://huggingface.co/datasets/BramVanroy/stackoverflow-chat-dutch.

  8. Z

    Replication package for the paper "What do Developers Discuss about Code...

    • data.niaid.nih.gov
    Updated Jun 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous (2021). Replication package for the paper "What do Developers Discuss about Code Comments" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4470125
    Explore at:
    Dataset updated
    Jun 30, 2021
    Dataset authored and provided by
    Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    RP-commenting-practices-multiple-sources

    Replication package for the paper "What do Developers Discuss about Code Comments?"

    Structure

    Appendix.pdf
    Tags-topics.md
    Stack-exchange-query.md
    
    RQ1/
      LDA_input/
        combined-so-quora-mallet-metadata.csv
        topic-input.mallet
    
      LDA_output/
        Mallet/
          output_csv/
            docs-in-topics.csv
            topic-words.csv
            topics-in-docs.csv
            topics-metadata.csv
          output_html/
            all_topics.html
            Docs/
            Topics/
    
    RQ2/
      datasource_rawdata/
        quora.csv
        stackoverflow.csv
      manual_analysis_output/
        stackoverflow_quora_taxonomy.xlsx
    

    Contents of the Replication Package

    • Appendix.pdf- Appendix of the paper containing supplement tables

    • Tags-topics.md tags selected from Stack overflow and topics selected from Quora for the study (RQ1 & RQ2)

    • Stack-exchange-query.md the query interface used to extract the posts from stack exchnage explorer.

    • RQ1/ - contains the data used to answer RQ1

      • LDA_input/ - input data used for LDA analysis
      • combined-so-quora-mallet-metadata.csv - Stack overflow and Quora questions used to perform LDA analysis
      • topic-input.mallet - input file to the mallet tool
      • LDA_output/
      • Mallet/ - contains the LDA output generated by MALLET tool
        • output_csv/
          • docs-in-topics.csv - documents per topic
          • topic-words.csv - most relevant topic words
          • topics-in-docs.csv - topic probability per document
          • topics-metadata.csv - metadata per document and topic probability
        • output_html/ - Browsable results of mallet output
          • all_topics.html
          • Docs/
          • Topics/
    • RQ2/ - contains the data used to answer RQ2

      • datasource_rawdata/ - contains the raw data for each source
      • quora.csv - contains the processed dataset (like removing html tags). To know more about the preprocessing steps, please refer to the reproducibility section in the paper. The data is preprocessed using Makar tool.
      • stackoverflow.csv - contains the processed stackoverflow dataset. To know more about the preprocessing steps, please refer to the reproducibility section in the paper. The data is preprocessed using Makar tool.
      • manual_analysis_output/
      • stackoverflow_quora_taxonomy.xlsx - contains the classified dataset of stackoverflow and quora and description of taxonomy.
        • Taxonomy - contains the description of the first dimension and second dimension categories. Second dimension categories are further divided into levels, separated by | symbol.
        • stackoverflow-posts - the questions are labelled relevant or irrelevant and categorized into the first dimension and second dimension categories.

          - quota-posts - the questions are labelled relevant or irrelevant and categorized into the first dimension and second dimension categories.

  9. Stack Overflow Developer Survey Dataset

    • kaggle.com
    zip
    Updated Jan 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Palvinder (2024). Stack Overflow Developer Survey Dataset [Dataset]. https://www.kaggle.com/datasets/palvinder2006/stackoverflow
    Explore at:
    zip(9459089 bytes)Available download formats
    Dataset updated
    Jan 8, 2024
    Authors
    Palvinder
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Overview The Stack Overflow Developer Survey Dataset represents one of the most trusted and comprehensive sources of information about the global developer community. Collected by Stack Overflow through its annual survey, the dataset provides insights into the demographics, preferences, habits, and career paths of developers.

    This dataset is frequently used for: - Analyzing trends in programming languages, tools, and technologies. - Understanding developer job satisfaction, compensation, and work environments. - Studying global and regional differences in developer demographics and experience.

    The data has of two CSV files, "survey_results_public" that consist of data and "survey_results_schema" that describes each column in detail.

    Data Dictionary: All the details are in "survey_results_schema.csv"

    Features of the Stack Overflow Developer Survey Dataset

    Demographic & Background Information - Respondent: A unique identifier for each survey participant. - MainBranch: Describes whether the respondent is a professional developer, student, hobbyist, etc. - Country: The country where the respondent lives. - Age: The respondent's age. - Gender: The gender identity of the respondent. - Ethnicity: Ethnic background (when available). - EdLevel: The highest level of formal education completed. - UndergradMajor: The respondent's undergraduate major. - Hobbyist: Indicates whether the person codes as a hobby (Yes/No).

    Employment & Professional Experience - Employment: Employment status (full-time, part-time, unemployed, student, etc.). - DevType: Types of developer roles the respondent identifies with (e.g., Web Developer, Data Scientist). - YearsCode: Number of years the respondent has been coding. - YearsCodePro: Number of years coding professionally. - JobSat: Job satisfaction level. - CareerSat: Career satisfaction level. - WorkWeekHrs: Approximate hours worked per week. - RemoteWork: Whether the respondent works remotely and how frequently.

    Compensation - CompTotal: Total compensation in USD (including salary, bonuses, etc.). - CompFreq: Frequency of compensation (e.g., yearly, monthly).

    Learning & Education - LearnCode: How the respondent first learned to code (e.g., online courses, university). - LearnCodeOnline: Online resources used (e.g., YouTube, freeCodeCamp). - LearnCodeCoursesCert: Whether the respondent has taken online courses or earned certifications.

    Technology & Tools - LanguageHaveWorkedWith: Programming languages the respondent has used. - LanguageWantToWorkWith: Languages the respondent is interested in learning or using more. - DatabaseHaveWorkedWith: Databases the respondent has experience with. - PlatformHaveWorkedWith: Platforms used (e.g., Linux, AWS, Android). - OpSys: The operating system used most often. - NEWCollabToolsHaveWorkedWith: Collaboration tools used (e.g., Slack, Teams, Zoom). - NEWStuck: How often the respondent feels stuck when coding. - ToolsTechHaveWorkedWith: Frameworks and technologies respondents have worked with.

    Online Presence & Community - SOAccount: Whether the respondent has a Stack Overflow account. - SOPartFreq: How often the respondent participates on Stack Overflow. - SOVisitFreq: Frequency of visiting Stack Overflow. - SOComm: Whether the respondent feels welcome in the Stack Overflow community. - OpenSourcer: Level of involvement in open-source contributions.

    Opinions & Preferences - WorkChallenge: Challenges faced at work (e.g., unclear requirements, unrealistic expectations). - JobFactors: Important job factors (e.g., salary, work-life balance, technologies used). - MentalHealth: Questions on how mental health affects or is affected by their job.

  10. Stack Overflow BigQuery Dataset

    • live.european-language-grid.eu
    Updated Dec 30, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stack Overflow (2018). Stack Overflow BigQuery Dataset [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/5094
    Explore at:
    Dataset updated
    Dec 30, 2018
    Dataset authored and provided by
    Stack Overflowhttp://stackoverflow.com/
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    BigQuery dataset includes an archive of Stack Overflow content, including posts, votes, tags, and badges.

  11. h

    stackoverflow-qa

    • huggingface.co
    Updated Aug 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massive Text Embedding Benchmark (2024). stackoverflow-qa [Dataset]. https://huggingface.co/datasets/mteb/stackoverflow-qa
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 5, 2024
    Dataset authored and provided by
    Massive Text Embedding Benchmark
    Description

    mteb/stackoverflow-qa dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. h

    Data from: stackoverflow

    • huggingface.co
    Updated Dec 8, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ML Foundations Development (2024). stackoverflow [Dataset]. https://huggingface.co/datasets/mlfoundations-dev/stackoverflow
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 8, 2024
    Dataset authored and provided by
    ML Foundations Development
    Description

    mlfoundations-dev/stackoverflow dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. Data from: Stack Overflow

    • console.cloud.google.com
    Updated Aug 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:Stack%20Exchange&hl=id (2024). Stack Overflow [Dataset]. https://console.cloud.google.com/marketplace/details/stack-exchange/stack-overflow?hl=id
    Explore at:
    Dataset updated
    Aug 13, 2024
    Dataset provided by
    Googlehttp://google.com/
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Stack Overflow is the largest online community for programmers to learn, share their knowledge, and advance their careers. Updated on a quarterly basis, this BigQuery dataset includes an archive of Stack Overflow content, including posts, votes, tags, and badges. This dataset is updated to mirror the Stack Overflow content on the Internet Archive, and is also available through the Stack Exchange Data Explorer. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .

  14. Z

    Data from: Stack Overflow's Hidden Nuances: How Does Zip Code Define User...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    • +1more
    Updated Apr 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymised (2024). Stack Overflow's Hidden Nuances: How Does Zip Code Define User Contribution? [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_11044295
    Explore at:
    Dataset updated
    Apr 23, 2024
    Authors
    Anonymised
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    --Collective intelligence constitutes a foundational element within online community question-and-answering (CQA) platforms, such as Stack Overflow, being the source of most programming-related issues. Despite this relevance, concerns remain regarding issues surrounding user participation. Precedent research tends to focus on simple numerical measurements to analyse participation, which may sideline the inherent, subtler aspects.

    The proposed study aims to bridge this gap by operationalising 11 distinct metrics to represent user participation, behaviour, and community value across different regions of the USA. The study also conducts inductive content analysis to understand the impact of regional contextual factors on users' knowledge sharing patterns.

    This replication package is provided for those interested in further examining our research methodology.

  15. Stack Overflow Developer Survey, 2017 A look into the lives of over 64,000...

    • dataandsons.com
    csv, zip
    Updated Jun 28, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Verka Bicic (2018). Stack Overflow Developer Survey, 2017 A look into the lives of over 64,000 Stack Overflow developers [Dataset]. https://www.dataandsons.com/categories/surveys/stack-overflow-developer-survey-2017-a-look-into-the-lives-of-over-64-000-stack-overflow-developers
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    Jun 28, 2018
    Dataset provided by
    Authors
    Verka Bicic
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2017 - Nov 5, 2017
    Description

    About this Dataset

    Every year, Stack Overflow conducts a massive survey of people on the site, covering all sorts of information like programming languages, salary, code style and various other information. This year, they amassed more than 64,000 responses fielded from 213 countries. Data The data is made up of two files: 1. survey_results_public.csv - CSV file with main survey results, one respondent per row and one column per answer 2. survey_results_schema.csv - CSV file with survey schema, i.e., the questions that correspond to each column name m Acknowledgements Data is directly taken from StackOverflow and licensed under the ODbL license.

    Category

    Surveys

    Keywords

    internet,Information Technology,coding

    Row Count

    51248

    Price

    Free

  16. 2024 - Stack Overflow Annual Developer Survey

    • kaggle.com
    zip
    Updated Sep 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IJnskje T (2024). 2024 - Stack Overflow Annual Developer Survey [Dataset]. https://www.kaggle.com/datasets/ijnskjet/2024-stack-overflow-annual-developer-survey
    Explore at:
    zip(17710263 bytes)Available download formats
    Dataset updated
    Sep 22, 2024
    Authors
    IJnskje T
    License

    https://ec.europa.eu/info/legal-notice_enhttps://ec.europa.eu/info/legal-notice_en

    Description

    This data set is from Stack Overflow from a Survey held in May 2024. Over 65000 responded to this survey and answered questions , about the work, AI, coding, program language, experiences, etc.

    Image used from - https://www.passionateinmarketing.com/fastest-growing-programming-languages/

    CSV- Downloaded from - https://survey.stackoverflow.co/ (copy and paste in new tab, seems otherwise not to open the correct page.

    There are 114 Columns with lots of Data, useful and maybe less. I have done the Notebook 1st and is now published, and make this database public as part of my learning Process to see if this works as should.

    Finished 3 different note books and this database. But I am sure you will be able to get more out of it, But please review mine and share feedback.

  17. Z

    Pylint Results for Python Code Snippets on Stack Overflow

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikolaos Bafatakis; Niels Boecker; Wenjie Boon; Martin Cabello Salazar; Gazi Oznacar; Jens Krinke; Robert White (2020). Pylint Results for Python Code Snippets on Stack Overflow [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_2558543
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    University College London
    Authors
    Nikolaos Bafatakis; Niels Boecker; Wenjie Boon; Martin Cabello Salazar; Gazi Oznacar; Jens Krinke; Robert White
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains valid pylint results for all Stack Overflow code snippets from SOTorrent that meet the following criteria

    Tagged with 'python'

    6 lines and above

    Contains basic python syntax (i.e. 'print', 'import', '(', '=')

    Produces a result when processed by Pylint

  18. The SciSO Dataset: Mining Science in Stack Overflow

    • figshare.com
    json
    Updated Dec 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Run "Ryan" Huang; Hrushikesh Vaidya; Souti Chattopadhyay (2024). The SciSO Dataset: Mining Science in Stack Overflow [Dataset]. http://doi.org/10.6084/m9.figshare.27967092.v1
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Dec 5, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Run "Ryan" Huang; Hrushikesh Vaidya; Souti Chattopadhyay
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Stack Overflow, often recognized as the go-to resource for software practitioners, has seen a growing trend of discussions around academic research. Yet, little is known about these academic references and how they intersect developers’ needs and interests. To bridge this gap, we presented a novel approach for identifying academic reference links and curated the first comprehensive dataset of academic references on Stack Overflow.This dataset contains 19,582 academic references mined from Stack Overflow posts, answers, comments, etc. as of October 7, 2024.

  19. Reddit and StackOverflow dataset (Programming languages)

    • zenodo.org
    zip
    Updated Mar 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniele De Vinco; Daniele De Vinco; Alessia Antelmi; Alessia Antelmi (2023). Reddit and StackOverflow dataset (Programming languages) [Dataset]. http://doi.org/10.5281/zenodo.7685062
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 7, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Daniele De Vinco; Daniele De Vinco; Alessia Antelmi; Alessia Antelmi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data set contains anonymized data collected from Reddit (via the Pushshift API) and StackOverflow (from Kaggle's dataset).

    Each folder includes the data split by trimester. The schema of StackOverflow and Reddit-related files follows:

    • Fields from StackOverflow
      • question_id
      • answer_id
      • creation_date - answer creation_date
      • score - score of the question/answer
      • tags - all tags flagged for a question
      • answer_count - number of answers for a question
      • start_question - question's time of creation
      • last_activity_date - last update on the question
      • new_id - hashed id of the answerer
      • q_new_id - hashed id of the questioner
    • Fields from Reddit
      • comment_id
      • submission_id
      • score - score of the question/submission
      • subreddit
      • created_utc - time of creation (unrelated to last modified comments)
      • new_id - hashed id

    The .txt files represent the structure of the corresponding hypergraphs.

  20. h

    stackoverflow-kubernetes-questions

    • huggingface.co
    • opendatalab.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI Lover, stackoverflow-kubernetes-questions [Dataset]. https://huggingface.co/datasets/mcipriano/stackoverflow-kubernetes-questions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    AI Lover
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The purpose of this dataset is to provide the opportunity to perform any training, fine-tuning, etc. for any Language Model. In the 'data' folder, you will find the dataset in Parquet format, which is one of the formats used for these processes. In case it may be useful for other purposes, I have also included the dataset in CSV format. All data in this dataset were retrieved from the Stack Exchange network using the Stack Exchange Data explorer tool… See the full description on the dataset page: https://huggingface.co/datasets/mcipriano/stackoverflow-kubernetes-questions.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Stack Overflow (2019). Stack Overflow Data [Dataset]. https://www.kaggle.com/datasets/stackoverflow/stackoverflow
Organization logo

Stack Overflow Data

Stack Overflow Data (BigQuery Dataset)

Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset authored and provided by
Stack Overflowhttp://stackoverflow.com/
License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Description

Context

Stack Overflow is the largest online community for programmers to learn, share their knowledge, and advance their careers.

Content

Updated on a quarterly basis, this BigQuery dataset includes an archive of Stack Overflow content, including posts, votes, tags, and badges. This dataset is updated to mirror the Stack Overflow content on the Internet Archive, and is also available through the Stack Exchange Data Explorer.

Fork this kernel to get started with this dataset.

Acknowledgements

Dataset Source: https://archive.org/download/stackexchange

https://bigquery.cloud.google.com/dataset/bigquery-public-data:stackoverflow

https://cloud.google.com/bigquery/public-data/stackoverflow

Banner Photo by Caspar Rubin from Unplash.

Inspiration

What is the percentage of questions that have been answered over the years?

What is the reputation and badge count of users across different tenures on StackOverflow?

What are 10 of the “easier” gold badges to earn?

Which day of the week has most questions answered within an hour?

Search
Clear search
Close search
Google apps
Main menu