100+ datasets found
  1. Codeforces Competitive Programming Dataset

    • kaggle.com
    zip
    Updated Jul 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dinu Ion George (2025). Codeforces Competitive Programming Dataset [Dataset]. https://www.kaggle.com/datasets/dinuiongeorge/codeforces-competitive-programming-dataset
    Explore at:
    zip(538337548 bytes)Available download formats
    Dataset updated
    Jul 4, 2025
    Authors
    Dinu Ion George
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset was first revision was published in the paper "Matching Problem Statements to Editorials in Competitive Programming" - ICALT 2024 https://ieeexplore.ieee.org/abstract/document/10645920

    The second revision which is 7 times bigger was published in the paper "Domain Adaptation for Automated Tag Prediction in Competitive Programming" - AIAI 2025

    If you are interested in this dataset, cite one of the papers in your research.

    The repository of papers can be found at 1. https://github.com/DinuGeorge0019/MatchingProblemStatementsToEditorialsInCP 2. https://github.com/DinuGeorge0019/MLCP

    Competitive programming is a challenging task that demands proficiency in computer science concepts and strong problem-solving skills.

    A significant limitation in the field of competitive programming, in the context of machine learning, is the lack of available datasets that include the problem statement, the editorial, and the source code for research purposes. This limitation hinders the development of new algorithms and techniques to improve the efficiency and accuracy of selecting or creating suitable editorials for given problems.

    To address this problem, we have introduced a comprehensive series of over 7000 competitive programming problems that encompass editorial solutions, source code and other metadata.

    Note: PSG named datasets from 01_TASK_DATASETS directory are provided from the paper https://arxiv.org/abs/2310.05791 with the public repository https://github.com/sronger/PSG_Predicting_Algorithm_Tags_and_Difficulty

  2. h

    the-stack

    • huggingface.co
    • opendatalab.com
    Updated Oct 27, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigCode (2022). the-stack [Dataset]. https://huggingface.co/datasets/bigcode/the-stack
    Explore at:
    Dataset updated
    Oct 27, 2022
    Dataset authored and provided by
    BigCode
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for The Stack

      Changelog
    

    Release Description

    v1.0 Initial release of the Stack. Included 30 programming languages and 18 permissive licenses. Note: Three included licenses (MPL/EPL/LGPL) are considered weak copyleft licenses. The resulting near-deduplicated dataset is 3TB in size.

    v1.1 The three copyleft licenses ((MPL/EPL/LGPL) were excluded and the list of permissive licenses extended to 193 licenses in total. The list of programming languages… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack.

  3. ProgrammingDataset

    • kaggle.com
    • huggingface.co
    zip
    Updated Jul 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aryan Rathod (2025). ProgrammingDataset [Dataset]. https://www.kaggle.com/datasets/kaiiddo/programmingdataset
    Explore at:
    zip(13665 bytes)Available download formats
    Dataset updated
    Jul 20, 2025
    Authors
    Aryan Rathod
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    license: mit language: - en tags: - programming - code - dataset - snippets - software-development pretty_name: Programming Dataset size_categories:

    - n<1K

    🧠 ProgrammingDataset

    A high-quality, production-grade dataset of programming code snippets across multiple languages, collected and curated manually to support research in code generation, analysis, and educational tools.

    📌 Dataset Summary

    FieldDescription
    Rows100+ code samples
    LanguagesPython, JavaScript, C++, Java, etc.
    TasksData structures, algorithms, system utilities
    FormatExcel (.xlsx) and CSV
    LicenseMIT

    Each entry includes:
    - id: Unique identifier
    - language: The programming language used
    - task_type: What kind of task the snippet solves (e.g., sorting, API call)
    - description: Human-readable explanation
    - code: The actual working code (formatted for readability)

    🏷️ Tags

    • code-generation
    • multi-language
    • algorithms
    • educational
    • dataset

    📐 Size Categories

    • Small (<1k samples)
    • Easily extensible to thousands with our VBA macro (included in Excel).

    ✨ How to Use

    You can open ProgrammingDataset.xlsx in Excel or load the .csv in Python:

    from datasets import load_dataset
    
    # Login using e.g. `huggingface-cli login` to access this dataset
    ds = load_dataset("kaiiddo/ProgrammingDataset")
    

    🧪 Use Cases

    • Train code generation models (e.g., CodeT5, Codex)
    • Build coding assistants and tutors
    • Analyze cross-language patterns
    • Educate students on common tasks

    📄 License

    This dataset is licensed under the MIT License — feel free to use, modify, and share with attribution.

    📚 Citation

    If you use this dataset in your research, applications, or publications, please cite it as:

    @misc{kaiiddo2025programmingdataset,
     title    = {ProgrammingDataset: A Curated Collection of Code Snippets Across Languages},
     author    = {Kaiiddo Team},
     year     = {2025},
     howpublished = {\url{https://huggingface.co/datasets/kaiiddo/ProgrammingDataset}},
     note     = {Accessed July 2025}
    }
    

    Or:

    "We gratefully acknowledge the use of the ProgrammingDataset curated by the Kaiiddo Team (2025) for training and evaluating programming-related models."

    Want to contribute more snippets or suggest improvements? Submit a PR or reach out to the maintainers!

  4. Most used programming languages among developers worldwide 2025

    • statista.com
    Updated Nov 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Most used programming languages among developers worldwide 2025 [Dataset]. https://www.statista.com/statistics/793628/worldwide-developer-survey-most-used-languages/
    Explore at:
    Dataset updated
    Nov 19, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    May 29, 2025 - Jun 23, 2025
    Area covered
    Worldwide
    Description

    As of 2025, JavaScript and HTML/CSS are the most commonly used programming languages among software developers around the world, with more than 66 percent of respondents stating that they used JavaScript and just around 61.9 percent using HTML/CSS. Python, SQL, and Bash/Shell rounded out the top five most widely used programming languages around the world. Programming languages At a very basic level, programming languages serve as sets of instructions that direct computers on how to behave and carry out tasks. Thanks to the increased prevalence of, and reliance on, computers and electronic devices in today’s society, these languages play a crucial role in the everyday lives of people around the world. An increasing number of people are interested in furthering their understanding of these tools through courses and bootcamps, while current developers are constantly seeking new languages and resources to learn to add to their skills. Furthermore, programming knowledge is becoming an important skill to possess within various industries throughout the business world. Job seekers with skills in Python, R, and SQL will find their knowledge to be among the most highly desirable data science skills and likely assist in their search for employment.

  5. Programming Languages

    • kaggle.com
    zip
    Updated Sep 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sujay Kapadnis (2023). Programming Languages [Dataset]. https://www.kaggle.com/datasets/sujaykapadnis/programming-languages
    Explore at:
    zip(879324 bytes)Available download formats
    Dataset updated
    Sep 16, 2023
    Authors
    Sujay Kapadnis
    Description

    The Dataset comes from Programming Languages Database

    languages.csv

    The full data dictionary is available from PLDB.com.

    variableclassdescription
    pldb_idcharacterA standardized, uniquified version of the language name, used as an ID on the PLDB site.
    titlecharacterThe official title of the language.
    descriptioncharacterDescription of the repo on GitHub.
    typecharacterWhich category in PLDB's subjective ontology does this entity fit into.
    appeareddoubleWhat year was the language publicly released and/or announced?
    creatorscharacterName(s) of the original creators of the language delimited by " and "
    websitecharacterURL of the official homepage for the language project.
    domain_namecharacterIf the project website is on its own domain.
    domain_name_registereddoubleWhen was this domain first registered?
    referencecharacterA link to more info about this entity.
    isbndbdoubleBooks about this language from ISBNdb.
    book_countdoubleComputed; the number of books found for this language at isbndb.com
    semantic_scholarintegerPapers about this language from Semantic Scholar.
    language_rankdoubleComputed; A rank for the language, taking into account various online rankings. The computation for this column is not currently clear.
    github_repocharacterURL of the official GitHub repo for the project if it hosted there.
    github_repo_starsdoubleHow many stars of the repo?
    github_repo_forksdoubleHow many forks of the repo?
    github_repo_updateddoubleWhat year was the last commit made?
    github_repo_subscribersdoubleHow many subscribers to the repo?
    github_repo_createddoubleWhen was the Github repo for this entity created?
    github_repo_descriptioncharacterDescription of the repo on GitHub.
    github_repo_issuesdoubleHow many isses on the repo?
    github_repo_first_commitdoubleWhat year the first commit made in this git repo?
    github_languagecharacterGitHub has a set of supported languages as defined here
    github_language_tm_scopecharacterThe TextMate scope that represents this programming language.
    github_language_typecharacterEither data, programming, markup, prose, or nil.
    github_language_ace_modecharacterA String name of the Ace Mode used for highlighting whenever a file is edited. This must match one of the filenames in http://git.io/3XO_Cg. Use "text" if a mode does not exist.
    github_language_file_extensionscharacterAn Array of associated extensions (the first one is considered the primary extension, the others should be listed alphabetically).
    github_language_reposdoubleHow many repos for this language does GitHub report?
    wikipediacharacterURL of the entity on Wikipedia, if and only if it has a page dedicated to it.
    wikipedia_daily_page_viewsdoubleHow many page views per day does this Wikipedia page get? Useful as a signal for rankings. Available via WP api.
    wikipedia_backlinks_countdoubleHow many pages on WP link to this page?
    wikipedia_summarycharacterWhat is the text summary of the language from the Wikipedia page?
    wikipedia_page_iddoubleWaht is the internal ID for this entity on WP?
    wikipedia_appeareddoubleWhen does Wikipedia claim this entity first appeared?
    wikipedia_createddoubleWhen was the Wikipedia page for this entity created?
    wikipedia_revision_countdoubleHow many revisions does this page have?
    wikipedia_relatedcharacterWhat languages does Wikipedia have as related?
    features_has_commentslogicalDoes this language have a comment character?
    features_has_semantic_indentationlogicalDoes indentation have semantic meaning in this language?
    features_has_line_commentslogicalDoes this language support inline comments (as opposed to comments that must span an entire line)?
    line_comment_tokencharacter...
  6. h

    python-coding-dataset

    • huggingface.co
    Updated Jan 19, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hoglet (2026). python-coding-dataset [Dataset]. https://huggingface.co/datasets/Hoglet-33/python-coding-dataset
    Explore at:
    Dataset updated
    Jan 19, 2026
    Authors
    Hoglet
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Python Coding Dataset

    This dataset contains high-quality Python function examples designed for fine-tuning coding-focused language models. It includes carefully curated samples covering common programming tasks, bug fixes, refactoring, and code completions.

      Dataset Details
    

    Number of samples: 732 (and growing)
    Purpose: Fine-tuning LLMs to generate accurate and idiomatic Python code
    Content: Functions, bug fixes, refactors, completions
    License: MIT License… See the full description on the dataset page: https://huggingface.co/datasets/Hoglet-33/python-coding-dataset.

  7. t

    Block-based programming dataset - Dataset - LDM

    • service.tib.eu
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Block-based programming dataset - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/block-based-programming-dataset
    Explore at:
    Dataset updated
    Dec 16, 2024
    Description

    The dataset is a block-based programming dataset used to train a code classification model to predict students' success on a given problem.

  8. Programming languages used for software development worldwide 2024

    • statista.com
    Updated Nov 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Programming languages used for software development worldwide 2024 [Dataset]. https://www.statista.com/statistics/869092/worldwide-software-developer-survey-languages-used/
    Explore at:
    Dataset updated
    Nov 28, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2024
    Area covered
    Worldwide
    Description

    The most popular programming language used in the past 12 months by software developers worldwide is JavaScript as of 2024, according to ** percent of the software developers surveyed. This is followed by Python at ** percent of the respondents surveyed.

  9. t

    Programming Language Ecosystem Project TU Wien

    • test.researchdata.tuwien.at
    • test.researchdata.tuwien.ac.at
    csv, text/markdown
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Valentin Futterer; Valentin Futterer; Valentin Futterer; Valentin Futterer (2024). Programming Language Ecosystem Project TU Wien [Dataset]. http://doi.org/10.70124/gnbse-ts649
    Explore at:
    text/markdown, csvAvailable download formats
    Dataset updated
    Jun 25, 2024
    Dataset provided by
    TU Wien
    Authors
    Valentin Futterer; Valentin Futterer; Valentin Futterer; Valentin Futterer
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Time period covered
    Dec 12, 2023
    Area covered
    Vienna
    Description

    About Dataset

    This dataset was created during the Programming Language Ecosystem project from TU Wien using the code inside the repository https://github.com/ValentinFutterer/UsageOfProgramminglanguages2011-2023?tab=readme-ov-file.

    The centerpiece of this repository is the usage_of_programming_languages_2011-2023.csv. This csv file shows the popularity of programming languages over the last 12 years in yearly increments. The repository also contains graphs created with the dataset. To get an accurate estimate on the popularity of programming languages, this dataset was created using 3 vastly different sources.

    About Data collection methodology

    The dataset was created using the github repository above. As input data, three public datasets where used.

    github_metadata

    Taken from https://www.kaggle.com/datasets/pelmers/github-repository-metadata-with-5-stars/ by Peter Elmers. It is licensed under CC BY 4.0 https://creativecommons.org/licenses/by/4.0/. It shows metadata information (no code) of all github repositories with more than 5 stars.

    PYPL_survey_2004-2023

    Taken from https://github.com/pypl/pypl.github.io/tree/master, put online by the user pcarbonn. It is licensed under CC BY 3.0 https://creativecommons.org/licenses/by/3.0/. It shows from 2004 to 2023 for each month the share of programming related google searches per language.

    stack_overflow_developer_survey

    Taken from https://insights.stackoverflow.com/survey. It is licensed under Open Data Commons Open Database License (ODbL) v1.0 https://opendatacommons.org/licenses/odbl/1-0/. It shows from 2011 to 2023 the results of the yearly stackoverflow developer survey.

    All these datasets were downloaded on the 12.12.2023. The datasets are all in the github repository above

    Description of the data

    The dataset contains a column for the year and then many columns for the different languages, denoting their usage in percent. Additionally, vertical barcharts and piecharts for each year plus a line graph for each language over the whole timespan as png's are provided.

    The languages that are going to be considered for the project can be seen here:

    - Python

    - C

    - C++

    - Java

    - C#

    - JavaScript

    - PHP

    - SQL

    - Assembly

    - Scratch

    - Fortran

    - Go

    - Kotlin

    - Delphi

    - Swift

    - Rust

    - Ruby

    - R

    - COBOL

    - F#

    - Perl

    - TypeScript

    - Haskell

    - Scala

    License

    This project is licensed under the Open Data Commons Open Database License (ODbL) v1.0 https://opendatacommons.org/licenses/odbl/1-0/ license.

    TLDR: You are free to share, adapt, and create derivative works from this dataser as long as you attribute me, keep the database open (if you redistribute it), and continue to share-alike any adapted database under the ODbl.

    Acknowledgments

    Thanks go out to

    - stackoverflow https://insights.stackoverflow.com/survey for providing the data from the yearly stackoverflow developer survey.

    - the PYPL survey, https://github.com/pypl/pypl.github.io/tree/master for providing google search data.

    - Peter Elmers, for crawling metadata on github repositories and providing the data https://www.kaggle.com/datasets/pelmers/github-repository-metadata-with-5-stars/.

  10. SWE-Bench coding dataset 8,712 files

    • kaggle.com
    zip
    Updated Oct 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    simon graves (2025). SWE-Bench coding dataset 8,712 files [Dataset]. https://www.kaggle.com/datasets/simongraves/swe-bench-coding-tasks
    Explore at:
    zip(146556 bytes)Available download formats
    Dataset updated
    Oct 3, 2025
    Authors
    simon graves
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    SWE-Bench Dataset - 8,712 files

    The dataset comprises 8,712 files across 6 programming languages, featuring verified tasks and benchmarks for evaluating coding agents and language models. It supports coding agents, language models, and developer tools with verified benchmark scores and multi-language test sets. - Get the data

    Dataset characteristics:

    CharacteristicData
    DescriptionAn extended benchmark of real-world software engineering tasks with enhanced artifacts and broader language coverage
    Data typesText
    TasksBug fixing, code completion, pull request generation, automated code review
    Total number of files8,712
    Total number of people30
    LabelingAnnotated with golden patches, test patches, post-patch reference states, and metadata stored in parquet files (e.g., repository name, issue/PR identifier, diffs, test results)
    Programming languagesC#, Go, PHP, Rust, Kotlin, Ruby

    Here's a sample dataset to check out. For full access, go here

    Dataset structure

    • Go - Files in Go
    • Scala - Files in Scala

    Similar Datasets:

    1. LLM Text Generation Dataset
    2. Synthetic Printed USA Passports Dataset
    3. DeepFake Videos Dataset
  11. P

    Programming Language Learning Platform Report

    • marketresearchforecast.com
    doc, pdf, ppt
    Updated Jan 20, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Research Forecast (2026). Programming Language Learning Platform Report [Dataset]. https://www.marketresearchforecast.com/reports/programming-language-learning-platform-531543
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Jan 20, 2026
    Dataset authored and provided by
    Market Research Forecast
    License

    https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy

    Time period covered
    2026 - 2034
    Area covered
    Global
    Variables measured
    Market Size
    Description

    Discover the booming market for online programming language learning platforms! This in-depth analysis reveals market size, growth projections (CAGR 15%), key players (Coursera, Udemy, Udacity), and future trends, helping you understand this rapidly expanding sector.

  12. Most popular programming languages worldwide 2024

    • statista.com
    Updated Nov 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Most popular programming languages worldwide 2024 [Dataset]. https://www.statista.com/statistics/1292294/popular-it-skills-worldwide/
    Explore at:
    Dataset updated
    Nov 28, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Jan 1, 2024 - Jun 30, 2024
    Area covered
    Worldwide
    Description

    JavaScript and Java were some of the most tested programming languages on the DevSkiller platform as of 2024. SQL and Python ranked second and fourth, with ** percent and ** percent of respondents testing this language in 2024, respectively. Nevertheless, the tech skill developers wanted to learn the most in 2024 was related to artificial intelligence, machine learning, and deep learning. At the same time, the fastest growing IT skills among DevSkiller customers were C/C++ and data science, while cybersecurity ranked third. Software skills When it came to the most used programming language among developers worldwide, JavaScript took the top spot, chosen by 62 percent of surveyed respondents. Most software developers learn how to code between 11 and 17 years old, with some of them writing their first line of code by the age of 5. Moreover, seven out of 10 developers learned how to program by accessing online resources such as videos and blogs. Software skills pay In 2024, the average annual software developer’s salary in the U.S. amounted to nearly ** thousand U.S. dollars, while in Germany, it totaled above ** thousand U.S. dollars. The programming languages associated with the highest salaries worldwide in 2024 were Clojure and Erlang.

  13. h

    tiny-codes

    • huggingface.co
    Updated Sep 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nam Pham (2023). tiny-codes [Dataset]. http://doi.org/10.57967/hf/0937
    Explore at:
    Dataset updated
    Sep 10, 2023
    Authors
    Nam Pham
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Reasoning with Language and Code

    This synthetic dataset is a collection of 1.6 millions short and clear code snippets that can help LLM models learn how to reason with both natural and programming languages. The dataset covers a wide range of programming languages, such as Python, TypeScript, JavaScript, Ruby, Julia, Rust, C++, Bash, Java, C#, and Go. It also includes two database languages: Cypher (for graph databases) and SQL (for relational databases) in order to study the… See the full description on the dataset page: https://huggingface.co/datasets/nampdn-ai/tiny-codes.

  14. E

    Most Popular Programming Languages Statistics

    • enterpriseappstoday.com
    Updated Jan 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    EnterpriseAppsToday (2023). Most Popular Programming Languages Statistics [Dataset]. https://www.enterpriseappstoday.com/stats/programming-languages-statistics.html
    Explore at:
    Dataset updated
    Jan 5, 2023
    Dataset authored and provided by
    EnterpriseAppsToday
    License

    https://www.enterpriseappstoday.com/privacy-policyhttps://www.enterpriseappstoday.com/privacy-policy

    Time period covered
    2022 - 2032
    Area covered
    Global
    Description

    programming languages statistics: The tech market which is also booming along with digital marketing is pretty good for a better income source. The tech market has many other things including programming languages. Programming languages are the basis for the formation of various websites, games, software, mobile applications, etc... There are nearly 9,000 programming languages around the world with each language with its own feature. In this most popular programming language statistics, we will have a look at statistical information and general knowledge about worldwide available various programming languages. Programming Languages Statistics (Editor’s Choice) There are 8,945 programming languages as stated by most popular Programming languages statistics. As of 2022, JavaScript is one of the most popular programming languages as around 47.86% of recruiters are demanding JavaScript language skills. A basic python developer earns between $70,000 to $1,00,00 a year. As per the most popular programming languages statistics Python has ranked number 1 in the United States of America, India, Germany, France, and the United Kingdom

  15. h

    java-coding-dataset

    • huggingface.co
    Updated Jan 25, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hoglet (2026). java-coding-dataset [Dataset]. https://huggingface.co/datasets/Hoglet-33/java-coding-dataset
    Explore at:
    Dataset updated
    Jan 25, 2026
    Authors
    Hoglet
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Java Coding Dataset

    This dataset contains high-quality Java code samples designed for fine-tuning coding-focused language models. It includes a diverse set of examples such as utility functions, class definitions, interface implementations, and exception handling.

      Dataset Details
    

    Number of samples: 520 (and growing)
    Purpose: Fine-tuning LLMs to generate accurate and idiomatic Java code
    Content: Functions, classes, interfaces, exception handling
    License: MIT License… See the full description on the dataset page: https://huggingface.co/datasets/Hoglet-33/java-coding-dataset.

  16. MCMD | Multi-programming-language Commit Message Dataset

    • zenodo.org
    application/gzip
    Updated Aug 3, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous; Anonymous (2023). MCMD | Multi-programming-language Commit Message Dataset [Dataset]. http://doi.org/10.5281/zenodo.4727704
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Aug 3, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous; Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A large-scale dataset in multi-programming languages and with rich information.

  17. c

    Youtube programming videos sample dataset

    • crawlfeeds.com
    json, zip
    Updated Apr 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). Youtube programming videos sample dataset [Dataset]. https://crawlfeeds.com/datasets/youtube-programming-videos-sample-dataset
    Explore at:
    zip, jsonAvailable download formats
    Dataset updated
    Apr 27, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Area covered
    YouTube
    Description

    Programming youtube videos dataset. Total records extracted more than 300. Last extracted on 24 jan 2022.

    Get in touch with crawlfeeds team for large datasets and customized youtube datasets.

  18. f

    Programming Languages

    • datasetcatalog.nlm.nih.gov
    Updated Apr 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gagniuc, Paul A. (2023). Programming Languages [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001046864
    Explore at:
    Dataset updated
    Apr 8, 2023
    Authors
    Gagniuc, Paul A.
    Description

    These files accompany the book entitled: An Introduction to Programming Languages: Simultaneous Learning in Multiple Coding Environments. This work is an introductory textbook in several computer languages. It describes the most well-known and popular programming environments such as: C#, C++, Java, JavaScript, PERL, PHP, Python, Ruby, and Visual Basic (VB) or Visual Basic for Applications (VBA). Therefore, the main objective of this unique guide is to provide code examples reflected in these nine computer languages. Readers can easily understand the connection and universality between the syntax of different environments and be adept at translating code. This learning experience can be ideal for upper-undergraduate introductory courses, researchers, doctoral students, and sociologists or engineers charged with implementing data analysis. Graphical illustrations are used for technical details about the computation examples to aid in an in-depth understanding of their inner workings. Moreover, the book contains original material that has been class-tested by the author and numerous cases are examined. Readers will also benefit from the inclusion of: a) Historical and philosophical perspectives on the past, present and future of computer languages. b) A total of 448 additional files freely available online, from which a total of 44 files are poster presentations (i.e. PowerPoint and PDF files). c) A total of 404 code examples reflected in nine computer languages, namely: C#, C++, Java, JavaScript, PERL, PHP, Python, Ruby and VB. This work first begins with a general introduction to history and presents the natural inevitable pathway from mechanical automatons to present electronic computers. Following this historical introduction, an in-detail look is made on philosophical questions, implementations, entropy and life. More often than not, there is a genuine amazement of the younger generations regarding the advancement of computer technology. Historical events that led to the development of technologies have been distilled down to the essence. However, the essence of any story is made with massive loss of detailed information. The essence of essences even more so. Over time, the lack of detail leads to a collective amnesia that can prevent us from understanding the naturalness by which technology has evolved. Thus, new constructs are always built upon older constructs to fit the evolutionary chain of technological progress, which boils down to the same fundamental rules as biological evolution. In the first stage, this book discusses the natural path of programming constructs by starting from time immemorial and ending with examples up to the present times. In the end, naturally driven constructs of all kinds also drive our society today. In the second part, the emphasis is made on the technical side where a total of nine computer languages are used simultaneously for mirrored examples. Simultaneous learning of multiple computer languages can be regarded as an asset in the world of science and technology. Thus, the reader can get used to the majority of known programming or scripting languages. Moreover, a basic knowledge of software implementation in several computer languages, even in an introductory way, helps the versatility and adaptability of the reader to new situations that may arise in industry, education, or research. Thus, this work is meant to bring a more concrete understanding of the similarities and differences between computer languages. Paul A. Gagniuc. An Introduction to Programming Languages: Simultaneous Learning in Multiple Coding Environments. Synthesis Lectures on Computer Science. Springer International Publishing, 2023, pp. 1-280.

  19. P

    Programming Language Learning Platform Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated May 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Programming Language Learning Platform Report [Dataset]. https://www.datainsightsmarket.com/reports/programming-language-learning-platform-1391013
    Explore at:
    ppt, doc, pdfAvailable download formats
    Dataset updated
    May 29, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    Discover the booming market for online programming language learning platforms. Explore market size, growth trends, key players (Coursera, Udemy, Udacity), and future projections in this comprehensive analysis. Learn how companies are capitalizing on the rising demand for coding skills.

  20. P

    Programming Language Training Market Report

    • marketreportanalytics.com
    doc, pdf, ppt
    Updated Jun 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Report Analytics (2025). Programming Language Training Market Report [Dataset]. https://www.marketreportanalytics.com/reports/programming-language-training-market-4003
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    Jun 7, 2025
    Dataset authored and provided by
    Market Report Analytics
    License

    https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The size of the Programming Language Training Market market was valued at USD 6.02 billion in 2024 and is projected to reach USD 20.72 billion by 2033, with an expected CAGR of 19.31% during the forecast period. Key drivers for this market are: Increasing demand for skilled programmers Growing adoption of online learning Government initiatives to promote coding education . Potential restraints include: Availability of free online resources Lack of standardized training programs Economic downturns. Notable trends are: The rising adoption of digital technologies across various industries has created a high demand for skilled programmers who can develop and maintain software applications. The convenience and affordability of online learning platforms have made it accessible for individuals to acquire programming skills. The advent of new programming languages and technologies, such as artificial intelligence and machine learning, is driving the demand for training in these areas..

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Dinu Ion George (2025). Codeforces Competitive Programming Dataset [Dataset]. https://www.kaggle.com/datasets/dinuiongeorge/codeforces-competitive-programming-dataset
Organization logo

Codeforces Competitive Programming Dataset

NLP-Enhanced Competitive Programming Dataset: Codeforces Problem Collection

Explore at:
25 scholarly articles cite this dataset (View in Google Scholar)
zip(538337548 bytes)Available download formats
Dataset updated
Jul 4, 2025
Authors
Dinu Ion George
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The dataset was first revision was published in the paper "Matching Problem Statements to Editorials in Competitive Programming" - ICALT 2024 https://ieeexplore.ieee.org/abstract/document/10645920

The second revision which is 7 times bigger was published in the paper "Domain Adaptation for Automated Tag Prediction in Competitive Programming" - AIAI 2025

If you are interested in this dataset, cite one of the papers in your research.

The repository of papers can be found at 1. https://github.com/DinuGeorge0019/MatchingProblemStatementsToEditorialsInCP 2. https://github.com/DinuGeorge0019/MLCP

Competitive programming is a challenging task that demands proficiency in computer science concepts and strong problem-solving skills.

A significant limitation in the field of competitive programming, in the context of machine learning, is the lack of available datasets that include the problem statement, the editorial, and the source code for research purposes. This limitation hinders the development of new algorithms and techniques to improve the efficiency and accuracy of selecting or creating suitable editorials for given problems.

To address this problem, we have introduced a comprehensive series of over 7000 competitive programming problems that encompass editorial solutions, source code and other metadata.

Note: PSG named datasets from 01_TASK_DATASETS directory are provided from the paper https://arxiv.org/abs/2310.05791 with the public repository https://github.com/sronger/PSG_Predicting_Algorithm_Tags_and_Difficulty

Search
Clear search
Close search
Google apps
Main menu