Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Welcome to an exceptional dataset meticulously crafted for training state-of-the-art language models such as Gemma, Llama 2, Orca, and more.
Dataset Highlights - Challenging Questions : Immerse your language models in various Python programming questions designed to stimulate cognitive growth. - Real-world Inputs : Provide your models with authentic input scenarios, ensuring they are well-equipped to handle practical coding challenges. - Accurate Answers : Sharpen the precision of your language models by exposing them to meticulously crafted Python code solutions.
How to Get Started - Download : Grab a copy of the dataset and inject new life into your language models. - Build Brilliance : Watch your LLMs evolve as they engage with the challenging questions and nuanced coding scenarios. - Share & Collaborate : Join the Kaggle community to discuss, share insights, and collaborate with fellow enthusiasts.
Unleash the full potential of your language models with this dataset. Elevate your LLM training experience and witness unprecedented growth in language understanding and coding prowess. Happy coding !
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
🧠ProgrammingDataset
A high-quality, production-grade dataset of programming code snippets across multiple languages, collected and curated manually to support research in code generation, analysis, and educational tools.
📌 Dataset Summary
Field Description
Rows 100+ code samples
Languages Python, JavaScript, C++, Java, etc.
Tasks Data structures, algorithms, system utilities
Format Excel (.xlsx) and CSV
License MIT
Each entry includes:
id:… See the full description on the dataset page: https://huggingface.co/datasets/kaiiddo/ProgrammingDataset.
Facebook
TwitterAs of 2025, JavaScript and HTML/CSS are the most commonly used programming languages among software developers around the world, with more than 66 percent of respondents stating that they used JavaScript and just around 61.9 percent using HTML/CSS. Python, SQL, and Bash/Shell rounded out the top five most widely used programming languages around the world. Programming languages At a very basic level, programming languages serve as sets of instructions that direct computers on how to behave and carry out tasks. Thanks to the increased prevalence of, and reliance on, computers and electronic devices in today’s society, these languages play a crucial role in the everyday lives of people around the world. An increasing number of people are interested in furthering their understanding of these tools through courses and bootcamps, while current developers are constantly seeking new languages and resources to learn to add to their skills. Furthermore, programming knowledge is becoming an important skill to possess within various industries throughout the business world. Job seekers with skills in Python, R, and SQL will find their knowledge to be among the most highly desirable data science skills and likely assist in their search for employment.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for The Stack
Changelog
Release Description
v1.0 Initial release of the Stack. Included 30 programming languages and 18 permissive licenses. Note: Three included licenses (MPL/EPL/LGPL) are considered weak copyleft licenses. The resulting near-deduplicated dataset is 3TB in size.
v1.1 The three copyleft licenses ((MPL/EPL/LGPL) were excluded and the list of permissive licenses extended to 193 licenses in total. The list of programming languages… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Vulnerable Programming Dataset Overview The Vulnerable Programming Dataset is a comprehensive collection of 550 unique code vulnerabilities across 10 programming languages: Python, JavaScript, PHP, Java, Ruby, Go, TypeScript, C++, SQL, and C. Designed for cybersecurity professionals, red teamers, pentesters, and developers, this dataset highlights unconventional vulnerabilities such as insecure interprocess communication, misconfigured rate limiting, insecure dependency pinning, and logic… See the full description on the dataset page: https://huggingface.co/datasets/darkknight25/Vulnerable_Programming_Dataset.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains code snippets from 22 different programming languages, with 100 code files per language (total 2,200 samples). Each entry includes metadata such as file name, number of lines, number of characters, a code preview, and the source URLs from GitHub repositories.
Columns Descriptor:
id: Unique identifier
language: Programming language name
file_name: Name of the file
num_lines: Number of lines in the code
num_chars: Total number of characters
code_preview: Partial preview of the code
repo_url: GitHub repository link
source_url: Raw file link from GitHub
This dataset is useful for programming language detection and related code analysis tasks.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The dataset comprises 8,712 files across 6 programming languages, featuring verified tasks and benchmarks for evaluating coding agents and language models. It supports coding agents, language models, and developer tools with verified benchmark scores and multi-language test sets. - Get the data
| Characteristic | Data |
|---|---|
| Description | An extended benchmark of real-world software engineering tasks with enhanced artifacts and broader language coverage |
| Data types | Text |
| Tasks | Bug fixing, code completion, pull request generation, automated code review |
| Total number of files | 8,712 |
| Total number of people | 30 |
| Labeling | Annotated with golden patches, test patches, post-patch reference states, and metadata stored in parquet files (e.g., repository name, issue/PR identifier, diffs, test results) |
| Programming languages | C#, Go, PHP, Rust, Kotlin, Ruby |
Facebook
TwitterA table listing common programming languages used in data science, their purpose, and key capabilities.
Facebook
TwitterThe most popular programming language used in the past 12 months by software developers worldwide is JavaScript as of 2024, according to ** percent of the software developers surveyed. This is followed by Python at ** percent of the respondents surveyed.
Facebook
TwitterJavaScript and Java were some of the most tested programming languages on the DevSkiller platform as of 2024. SQL and Python ranked second and fourth, with ** percent and ** percent of respondents testing this language in 2024, respectively. Nevertheless, the tech skill developers wanted to learn the most in 2024 was related to artificial intelligence, machine learning, and deep learning. At the same time, the fastest growing IT skills among DevSkiller customers were C/C++ and data science, while cybersecurity ranked third. Software skills When it came to the most used programming language among developers worldwide, JavaScript took the top spot, chosen by 62 percent of surveyed respondents. Most software developers learn how to code between 11 and 17 years old, with some of them writing their first line of code by the age of 5. Moreover, seven out of 10 developers learned how to program by accessing online resources such as videos and blogs. Software skills pay In 2024, the average annual software developer’s salary in the U.S. amounted to nearly ** thousand U.S. dollars, while in Germany, it totaled above ** thousand U.S. dollars. The programming languages associated with the highest salaries worldwide in 2024 were Clojure and Erlang.
Facebook
TwitterSomAnon/awesome-programming-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterOpen Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
This dataset was created during the Programming Language Ecosystem project from TU Wien using the code inside the repository https://github.com/ValentinFutterer/UsageOfProgramminglanguages2011-2023?tab=readme-ov-file.
The centerpiece of this repository is the usage_of_programming_languages_2011-2023.csv. This csv file shows the popularity of programming languages over the last 12 years in yearly increments. The repository also contains graphs created with the dataset. To get an accurate estimate on the popularity of programming languages, this dataset was created using 3 vastly different sources.
The dataset was created using the github repository above. As input data, three public datasets where used.
Taken from https://www.kaggle.com/datasets/pelmers/github-repository-metadata-with-5-stars/ by Peter Elmers. It is licensed under CC BY 4.0 https://creativecommons.org/licenses/by/4.0/. It shows metadata information (no code) of all github repositories with more than 5 stars.
Taken from https://github.com/pypl/pypl.github.io/tree/master, put online by the user pcarbonn. It is licensed under CC BY 3.0 https://creativecommons.org/licenses/by/3.0/. It shows from 2004 to 2023 for each month the share of programming related google searches per language.
Taken from https://insights.stackoverflow.com/survey. It is licensed under Open Data Commons Open Database License (ODbL) v1.0 https://opendatacommons.org/licenses/odbl/1-0/. It shows from 2011 to 2023 the results of the yearly stackoverflow developer survey.
All these datasets were downloaded on the 12.12.2023. The datasets are all in the github repository above
The dataset contains a column for the year and then many columns for the different languages, denoting their usage in percent. Additionally, vertical barcharts and piecharts for each year plus a line graph for each language over the whole timespan as png's are provided.
The languages that are going to be considered for the project can be seen here:
- Python
- C
- C++
- Java
- C#
- JavaScript
- PHP
- SQL
- Assembly
- Scratch
- Fortran
- Go
- Kotlin
- Delphi
- Swift
- Rust
- Ruby
- R
- COBOL
- F#
- Perl
- TypeScript
- Haskell
- Scala
This project is licensed under the Open Data Commons Open Database License (ODbL) v1.0 https://opendatacommons.org/licenses/odbl/1-0/ license.
TLDR: You are free to share, adapt, and create derivative works from this dataser as long as you attribute me, keep the database open (if you redistribute it), and continue to share-alike any adapted database under the ODbl.
Thanks go out to
- stackoverflow https://insights.stackoverflow.com/survey for providing the data from the yearly stackoverflow developer survey.
- the PYPL survey, https://github.com/pypl/pypl.github.io/tree/master for providing google search data.
- Peter Elmers, for crawling metadata on github repositories and providing the data https://www.kaggle.com/datasets/pelmers/github-repository-metadata-with-5-stars/.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
The GitHub Code dataest consists of 115M code files from GitHub in 32 programming languages with 60 extensions totalling in 1TB of text data. The dataset was created from the GitHub dataset on BiqQuery.
Facebook
Twitterhttps://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Python Coding Dataset
This dataset contains high-quality Python function examples designed for fine-tuning coding-focused language models. It includes carefully curated samples covering common programming tasks, bug fixes, refactoring, and code completions.
Dataset Details
Number of samples: 732 (and growing)
Purpose: Fine-tuning LLMs to generate accurate and idiomatic Python code
Content: Functions, bug fixes, refactors, completions
License: MIT License… See the full description on the dataset page: https://huggingface.co/datasets/Hoglet-33/python-coding-dataset.
Facebook
Twitterhttps://www.instarank.com/terms-conditionshttps://www.instarank.com/terms-conditions
Comprehensive programming language and framework data including use cases, job market demand, learning difficulty, and comparisons. Perfect for developer blogs, coding bootcamp pages, and tech comparison content.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
A huge dataset of 500+ programming questions with step-by-step AI-generated solutions across multiple domains. Each entry includes: ✅ Question (Problem statement) ✅ Difficulty Level (Easy, Medium, Hard) ✅ Programming Language (Python, Java, C++, etc.) ✅ AI-Generated Solution (Optimized code) ✅ Time Complexity & Explanation ✅ Topic Tag (DP, Graphs, OOP, etc.)
🔹 Ideal for developers, coding interview prep, and machine learning models for auto code generation!
Facebook
TwitterJavaScript was the most frequently used coding language in Russia, used by around ********** of the surveyed software companies in 2024. Furthermore, over ******** of the companies reported to use Python and Java.
Facebook
Twitterhttps://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
Discover the booming market for online programming language learning platforms! This in-depth analysis reveals market size, growth projections (CAGR 15%), key players (Coursera, Udemy, Udacity), and future trends, helping you understand this rapidly expanding sector.
Facebook
TwitterThe most demanded programming languages by recruiters in 2025 were Python, JavaScript, and Java, with around ** percent of recruiters looking to hire people with these programming skills.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Welcome to an exceptional dataset meticulously crafted for training state-of-the-art language models such as Gemma, Llama 2, Orca, and more.
Dataset Highlights - Challenging Questions : Immerse your language models in various Python programming questions designed to stimulate cognitive growth. - Real-world Inputs : Provide your models with authentic input scenarios, ensuring they are well-equipped to handle practical coding challenges. - Accurate Answers : Sharpen the precision of your language models by exposing them to meticulously crafted Python code solutions.
How to Get Started - Download : Grab a copy of the dataset and inject new life into your language models. - Build Brilliance : Watch your LLMs evolve as they engage with the challenging questions and nuanced coding scenarios. - Share & Collaborate : Join the Kaggle community to discuss, share insights, and collaborate with fellow enthusiasts.
Unleash the full potential of your language models with this dataset. Elevate your LLM training experience and witness unprecedented growth in language understanding and coding prowess. Happy coding !