Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset was first revision was published in the paper "Matching Problem Statements to Editorials in Competitive Programming" - ICALT 2024 https://ieeexplore.ieee.org/abstract/document/10645920
The second revision which is 7 times bigger was published in the paper "Domain Adaptation for Automated Tag Prediction in Competitive Programming" - AIAI 2025
If you are interested in this dataset, cite one of the papers in your research.
The repository of papers can be found at 1. https://github.com/DinuGeorge0019/MatchingProblemStatementsToEditorialsInCP 2. https://github.com/DinuGeorge0019/MLCP
Competitive programming is a challenging task that demands proficiency in computer science concepts and strong problem-solving skills.
A significant limitation in the field of competitive programming, in the context of machine learning, is the lack of available datasets that include the problem statement, the editorial, and the source code for research purposes. This limitation hinders the development of new algorithms and techniques to improve the efficiency and accuracy of selecting or creating suitable editorials for given problems.
To address this problem, we have introduced a comprehensive series of over 7000 competitive programming problems that encompass editorial solutions, source code and other metadata.
Note: PSG named datasets from 01_TASK_DATASETS directory are provided from the paper https://arxiv.org/abs/2310.05791 with the public repository https://github.com/sronger/PSG_Predicting_Algorithm_Tags_and_Difficulty
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for The Stack
Changelog
Release Description
v1.0 Initial release of the Stack. Included 30 programming languages and 18 permissive licenses. Note: Three included licenses (MPL/EPL/LGPL) are considered weak copyleft licenses. The resulting near-deduplicated dataset is 3TB in size.
v1.1 The three copyleft licenses ((MPL/EPL/LGPL) were excluded and the list of permissive licenses extended to 193 licenses in total. The list of programming languages… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
license: mit language: - en tags: - programming - code - dataset - snippets - software-development pretty_name: Programming Dataset size_categories:
A high-quality, production-grade dataset of programming code snippets across multiple languages, collected and curated manually to support research in code generation, analysis, and educational tools.
| Field | Description |
|---|---|
| Rows | 100+ code samples |
| Languages | Python, JavaScript, C++, Java, etc. |
| Tasks | Data structures, algorithms, system utilities |
| Format | Excel (.xlsx) and CSV |
| License | MIT |
Each entry includes:
- id: Unique identifier
- language: The programming language used
- task_type: What kind of task the snippet solves (e.g., sorting, API call)
- description: Human-readable explanation
- code: The actual working code (formatted for readability)
code-generationmulti-languagealgorithmseducationaldatasetYou can open ProgrammingDataset.xlsx in Excel or load the .csv in Python:
from datasets import load_dataset
# Login using e.g. `huggingface-cli login` to access this dataset
ds = load_dataset("kaiiddo/ProgrammingDataset")
This dataset is licensed under the MIT License — feel free to use, modify, and share with attribution.
If you use this dataset in your research, applications, or publications, please cite it as:
@misc{kaiiddo2025programmingdataset,
title = {ProgrammingDataset: A Curated Collection of Code Snippets Across Languages},
author = {Kaiiddo Team},
year = {2025},
howpublished = {\url{https://huggingface.co/datasets/kaiiddo/ProgrammingDataset}},
note = {Accessed July 2025}
}
Or:
"We gratefully acknowledge the use of the ProgrammingDataset curated by the Kaiiddo Team (2025) for training and evaluating programming-related models."
Want to contribute more snippets or suggest improvements? Submit a PR or reach out to the maintainers!
Facebook
TwitterAs of 2025, JavaScript and HTML/CSS are the most commonly used programming languages among software developers around the world, with more than 66 percent of respondents stating that they used JavaScript and just around 61.9 percent using HTML/CSS. Python, SQL, and Bash/Shell rounded out the top five most widely used programming languages around the world. Programming languages At a very basic level, programming languages serve as sets of instructions that direct computers on how to behave and carry out tasks. Thanks to the increased prevalence of, and reliance on, computers and electronic devices in today’s society, these languages play a crucial role in the everyday lives of people around the world. An increasing number of people are interested in furthering their understanding of these tools through courses and bootcamps, while current developers are constantly seeking new languages and resources to learn to add to their skills. Furthermore, programming knowledge is becoming an important skill to possess within various industries throughout the business world. Job seekers with skills in Python, R, and SQL will find their knowledge to be among the most highly desirable data science skills and likely assist in their search for employment.
Facebook
TwitterThe Dataset comes from Programming Languages Database
languages.csvThe full data dictionary is available from PLDB.com.
| variable | class | description |
|---|---|---|
| pldb_id | character | A standardized, uniquified version of the language name, used as an ID on the PLDB site. |
| title | character | The official title of the language. |
| description | character | Description of the repo on GitHub. |
| type | character | Which category in PLDB's subjective ontology does this entity fit into. |
| appeared | double | What year was the language publicly released and/or announced? |
| creators | character | Name(s) of the original creators of the language delimited by " and " |
| website | character | URL of the official homepage for the language project. |
| domain_name | character | If the project website is on its own domain. |
| domain_name_registered | double | When was this domain first registered? |
| reference | character | A link to more info about this entity. |
| isbndb | double | Books about this language from ISBNdb. |
| book_count | double | Computed; the number of books found for this language at isbndb.com |
| semantic_scholar | integer | Papers about this language from Semantic Scholar. |
| language_rank | double | Computed; A rank for the language, taking into account various online rankings. The computation for this column is not currently clear. |
| github_repo | character | URL of the official GitHub repo for the project if it hosted there. |
| github_repo_stars | double | How many stars of the repo? |
| github_repo_forks | double | How many forks of the repo? |
| github_repo_updated | double | What year was the last commit made? |
| github_repo_subscribers | double | How many subscribers to the repo? |
| github_repo_created | double | When was the Github repo for this entity created? |
| github_repo_description | character | Description of the repo on GitHub. |
| github_repo_issues | double | How many isses on the repo? |
| github_repo_first_commit | double | What year the first commit made in this git repo? |
| github_language | character | GitHub has a set of supported languages as defined here |
| github_language_tm_scope | character | The TextMate scope that represents this programming language. |
| github_language_type | character | Either data, programming, markup, prose, or nil. |
| github_language_ace_mode | character | A String name of the Ace Mode used for highlighting whenever a file is edited. This must match one of the filenames in http://git.io/3XO_Cg. Use "text" if a mode does not exist. |
| github_language_file_extensions | character | An Array of associated extensions (the first one is considered the primary extension, the others should be listed alphabetically). |
| github_language_repos | double | How many repos for this language does GitHub report? |
| wikipedia | character | URL of the entity on Wikipedia, if and only if it has a page dedicated to it. |
| wikipedia_daily_page_views | double | How many page views per day does this Wikipedia page get? Useful as a signal for rankings. Available via WP api. |
| wikipedia_backlinks_count | double | How many pages on WP link to this page? |
| wikipedia_summary | character | What is the text summary of the language from the Wikipedia page? |
| wikipedia_page_id | double | Waht is the internal ID for this entity on WP? |
| wikipedia_appeared | double | When does Wikipedia claim this entity first appeared? |
| wikipedia_created | double | When was the Wikipedia page for this entity created? |
| wikipedia_revision_count | double | How many revisions does this page have? |
| wikipedia_related | character | What languages does Wikipedia have as related? |
| features_has_comments | logical | Does this language have a comment character? |
| features_has_semantic_indentation | logical | Does indentation have semantic meaning in this language? |
| features_has_line_comments | logical | Does this language support inline comments (as opposed to comments that must span an entire line)? |
| line_comment_token | character | ... |
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Python Coding Dataset
This dataset contains high-quality Python function examples designed for fine-tuning coding-focused language models. It includes carefully curated samples covering common programming tasks, bug fixes, refactoring, and code completions.
Dataset Details
Number of samples: 732 (and growing)
Purpose: Fine-tuning LLMs to generate accurate and idiomatic Python code
Content: Functions, bug fixes, refactors, completions
License: MIT License… See the full description on the dataset page: https://huggingface.co/datasets/Hoglet-33/python-coding-dataset.
Facebook
TwitterThe dataset is a block-based programming dataset used to train a code classification model to predict students' success on a given problem.
Facebook
TwitterThe most popular programming language used in the past 12 months by software developers worldwide is JavaScript as of 2024, according to ** percent of the software developers surveyed. This is followed by Python at ** percent of the respondents surveyed.
Facebook
TwitterOpen Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
This dataset was created during the Programming Language Ecosystem project from TU Wien using the code inside the repository https://github.com/ValentinFutterer/UsageOfProgramminglanguages2011-2023?tab=readme-ov-file.
The centerpiece of this repository is the usage_of_programming_languages_2011-2023.csv. This csv file shows the popularity of programming languages over the last 12 years in yearly increments. The repository also contains graphs created with the dataset. To get an accurate estimate on the popularity of programming languages, this dataset was created using 3 vastly different sources.
The dataset was created using the github repository above. As input data, three public datasets where used.
Taken from https://www.kaggle.com/datasets/pelmers/github-repository-metadata-with-5-stars/ by Peter Elmers. It is licensed under CC BY 4.0 https://creativecommons.org/licenses/by/4.0/. It shows metadata information (no code) of all github repositories with more than 5 stars.
Taken from https://github.com/pypl/pypl.github.io/tree/master, put online by the user pcarbonn. It is licensed under CC BY 3.0 https://creativecommons.org/licenses/by/3.0/. It shows from 2004 to 2023 for each month the share of programming related google searches per language.
Taken from https://insights.stackoverflow.com/survey. It is licensed under Open Data Commons Open Database License (ODbL) v1.0 https://opendatacommons.org/licenses/odbl/1-0/. It shows from 2011 to 2023 the results of the yearly stackoverflow developer survey.
All these datasets were downloaded on the 12.12.2023. The datasets are all in the github repository above
The dataset contains a column for the year and then many columns for the different languages, denoting their usage in percent. Additionally, vertical barcharts and piecharts for each year plus a line graph for each language over the whole timespan as png's are provided.
The languages that are going to be considered for the project can be seen here:
- Python
- C
- C++
- Java
- C#
- JavaScript
- PHP
- SQL
- Assembly
- Scratch
- Fortran
- Go
- Kotlin
- Delphi
- Swift
- Rust
- Ruby
- R
- COBOL
- F#
- Perl
- TypeScript
- Haskell
- Scala
This project is licensed under the Open Data Commons Open Database License (ODbL) v1.0 https://opendatacommons.org/licenses/odbl/1-0/ license.
TLDR: You are free to share, adapt, and create derivative works from this dataser as long as you attribute me, keep the database open (if you redistribute it), and continue to share-alike any adapted database under the ODbl.
Thanks go out to
- stackoverflow https://insights.stackoverflow.com/survey for providing the data from the yearly stackoverflow developer survey.
- the PYPL survey, https://github.com/pypl/pypl.github.io/tree/master for providing google search data.
- Peter Elmers, for crawling metadata on github repositories and providing the data https://www.kaggle.com/datasets/pelmers/github-repository-metadata-with-5-stars/.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The dataset comprises 8,712 files across 6 programming languages, featuring verified tasks and benchmarks for evaluating coding agents and language models. It supports coding agents, language models, and developer tools with verified benchmark scores and multi-language test sets. - Get the data
| Characteristic | Data |
|---|---|
| Description | An extended benchmark of real-world software engineering tasks with enhanced artifacts and broader language coverage |
| Data types | Text |
| Tasks | Bug fixing, code completion, pull request generation, automated code review |
| Total number of files | 8,712 |
| Total number of people | 30 |
| Labeling | Annotated with golden patches, test patches, post-patch reference states, and metadata stored in parquet files (e.g., repository name, issue/PR identifier, diffs, test results) |
| Programming languages | C#, Go, PHP, Rust, Kotlin, Ruby |
Facebook
Twitterhttps://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
Discover the booming market for online programming language learning platforms! This in-depth analysis reveals market size, growth projections (CAGR 15%), key players (Coursera, Udemy, Udacity), and future trends, helping you understand this rapidly expanding sector.
Facebook
TwitterJavaScript and Java were some of the most tested programming languages on the DevSkiller platform as of 2024. SQL and Python ranked second and fourth, with ** percent and ** percent of respondents testing this language in 2024, respectively. Nevertheless, the tech skill developers wanted to learn the most in 2024 was related to artificial intelligence, machine learning, and deep learning. At the same time, the fastest growing IT skills among DevSkiller customers were C/C++ and data science, while cybersecurity ranked third. Software skills When it came to the most used programming language among developers worldwide, JavaScript took the top spot, chosen by 62 percent of surveyed respondents. Most software developers learn how to code between 11 and 17 years old, with some of them writing their first line of code by the age of 5. Moreover, seven out of 10 developers learned how to program by accessing online resources such as videos and blogs. Software skills pay In 2024, the average annual software developer’s salary in the U.S. amounted to nearly ** thousand U.S. dollars, while in Germany, it totaled above ** thousand U.S. dollars. The programming languages associated with the highest salaries worldwide in 2024 were Clojure and Erlang.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Reasoning with Language and Code
This synthetic dataset is a collection of 1.6 millions short and clear code snippets that can help LLM models learn how to reason with both natural and programming languages. The dataset covers a wide range of programming languages, such as Python, TypeScript, JavaScript, Ruby, Julia, Rust, C++, Bash, Java, C#, and Go. It also includes two database languages: Cypher (for graph databases) and SQL (for relational databases) in order to study the… See the full description on the dataset page: https://huggingface.co/datasets/nampdn-ai/tiny-codes.
Facebook
Twitterhttps://www.enterpriseappstoday.com/privacy-policyhttps://www.enterpriseappstoday.com/privacy-policy
programming languages statistics: The tech market which is also booming along with digital marketing is pretty good for a better income source. The tech market has many other things including programming languages. Programming languages are the basis for the formation of various websites, games, software, mobile applications, etc... There are nearly 9,000 programming languages around the world with each language with its own feature. In this most popular programming language statistics, we will have a look at statistical information and general knowledge about worldwide available various programming languages. Programming Languages Statistics (Editor’s Choice) There are 8,945 programming languages as stated by most popular Programming languages statistics. As of 2022, JavaScript is one of the most popular programming languages as around 47.86% of recruiters are demanding JavaScript language skills. A basic python developer earns between $70,000 to $1,00,00 a year. As per the most popular programming languages statistics Python has ranked number 1 in the United States of America, India, Germany, France, and the United Kingdom
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Java Coding Dataset
This dataset contains high-quality Java code samples designed for fine-tuning coding-focused language models. It includes a diverse set of examples such as utility functions, class definitions, interface implementations, and exception handling.
Dataset Details
Number of samples: 520 (and growing)
Purpose: Fine-tuning LLMs to generate accurate and idiomatic Java code
Content: Functions, classes, interfaces, exception handling
License: MIT License… See the full description on the dataset page: https://huggingface.co/datasets/Hoglet-33/java-coding-dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A large-scale dataset in multi-programming languages and with rich information.
Facebook
Twitterhttps://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Programming youtube videos dataset. Total records extracted more than 300. Last extracted on 24 jan 2022.
Get in touch with crawlfeeds team for large datasets and customized youtube datasets.
Facebook
TwitterThese files accompany the book entitled: An Introduction to Programming Languages: Simultaneous Learning in Multiple Coding Environments. This work is an introductory textbook in several computer languages. It describes the most well-known and popular programming environments such as: C#, C++, Java, JavaScript, PERL, PHP, Python, Ruby, and Visual Basic (VB) or Visual Basic for Applications (VBA). Therefore, the main objective of this unique guide is to provide code examples reflected in these nine computer languages. Readers can easily understand the connection and universality between the syntax of different environments and be adept at translating code. This learning experience can be ideal for upper-undergraduate introductory courses, researchers, doctoral students, and sociologists or engineers charged with implementing data analysis. Graphical illustrations are used for technical details about the computation examples to aid in an in-depth understanding of their inner workings. Moreover, the book contains original material that has been class-tested by the author and numerous cases are examined. Readers will also benefit from the inclusion of: a) Historical and philosophical perspectives on the past, present and future of computer languages. b) A total of 448 additional files freely available online, from which a total of 44 files are poster presentations (i.e. PowerPoint and PDF files). c) A total of 404 code examples reflected in nine computer languages, namely: C#, C++, Java, JavaScript, PERL, PHP, Python, Ruby and VB. This work first begins with a general introduction to history and presents the natural inevitable pathway from mechanical automatons to present electronic computers. Following this historical introduction, an in-detail look is made on philosophical questions, implementations, entropy and life. More often than not, there is a genuine amazement of the younger generations regarding the advancement of computer technology. Historical events that led to the development of technologies have been distilled down to the essence. However, the essence of any story is made with massive loss of detailed information. The essence of essences even more so. Over time, the lack of detail leads to a collective amnesia that can prevent us from understanding the naturalness by which technology has evolved. Thus, new constructs are always built upon older constructs to fit the evolutionary chain of technological progress, which boils down to the same fundamental rules as biological evolution. In the first stage, this book discusses the natural path of programming constructs by starting from time immemorial and ending with examples up to the present times. In the end, naturally driven constructs of all kinds also drive our society today. In the second part, the emphasis is made on the technical side where a total of nine computer languages are used simultaneously for mirrored examples. Simultaneous learning of multiple computer languages can be regarded as an asset in the world of science and technology. Thus, the reader can get used to the majority of known programming or scripting languages. Moreover, a basic knowledge of software implementation in several computer languages, even in an introductory way, helps the versatility and adaptability of the reader to new situations that may arise in industry, education, or research. Thus, this work is meant to bring a more concrete understanding of the similarities and differences between computer languages. Paul A. Gagniuc. An Introduction to Programming Languages: Simultaneous Learning in Multiple Coding Environments. Synthesis Lectures on Computer Science. Springer International Publishing, 2023, pp. 1-280.
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Discover the booming market for online programming language learning platforms. Explore market size, growth trends, key players (Coursera, Udemy, Udacity), and future projections in this comprehensive analysis. Learn how companies are capitalizing on the rising demand for coding skills.
Facebook
Twitterhttps://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
The size of the Programming Language Training Market market was valued at USD 6.02 billion in 2024 and is projected to reach USD 20.72 billion by 2033, with an expected CAGR of 19.31% during the forecast period. Key drivers for this market are: Increasing demand for skilled programmers Growing adoption of online learning Government initiatives to promote coding education . Potential restraints include: Availability of free online resources Lack of standardized training programs Economic downturns. Notable trends are: The rising adoption of digital technologies across various industries has created a high demand for skilled programmers who can develop and maintain software applications. The convenience and affordability of online learning platforms have made it accessible for individuals to acquire programming skills. The advent of new programming languages and technologies, such as artificial intelligence and machine learning, is driving the demand for training in these areas..
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset was first revision was published in the paper "Matching Problem Statements to Editorials in Competitive Programming" - ICALT 2024 https://ieeexplore.ieee.org/abstract/document/10645920
The second revision which is 7 times bigger was published in the paper "Domain Adaptation for Automated Tag Prediction in Competitive Programming" - AIAI 2025
If you are interested in this dataset, cite one of the papers in your research.
The repository of papers can be found at 1. https://github.com/DinuGeorge0019/MatchingProblemStatementsToEditorialsInCP 2. https://github.com/DinuGeorge0019/MLCP
Competitive programming is a challenging task that demands proficiency in computer science concepts and strong problem-solving skills.
A significant limitation in the field of competitive programming, in the context of machine learning, is the lack of available datasets that include the problem statement, the editorial, and the source code for research purposes. This limitation hinders the development of new algorithms and techniques to improve the efficiency and accuracy of selecting or creating suitable editorials for given problems.
To address this problem, we have introduced a comprehensive series of over 7000 competitive programming problems that encompass editorial solutions, source code and other metadata.
Note: PSG named datasets from 01_TASK_DATASETS directory are provided from the paper https://arxiv.org/abs/2310.05791 with the public repository https://github.com/sronger/PSG_Predicting_Algorithm_Tags_and_Difficulty