CodeContests is a competitive programming dataset for machine-learning. This dataset was used when training AlphaCode.
It consists of programming problems, from a variety of sources.
Problems include test cases in the form of paired inputs and outputs, as well as both correct and incorrect human solutions in a variety of languages.
As of 2022, JavaScript and HTML/CSS were the most commonly used programming languages among software developers around the world, with more than 63.6 percent of respondents stating that they used JavaScript and just around 53 percent using HTML/CSS. Python, SQL, and TypeScript rounded out the top five most widely used programming languages around the world.
Programming languages At a very basic level, programming languages serve as sets of instructions that direct computers on how to behave and carry out tasks. Thanks to the increased prevalence of, and reliance on, computers and electronic devices in today’s society, these languages play a crucial role in the everyday lives of people around the world. An increasing number of people are interested in furthering their understanding of these tools through courses and bootcamps, while current developers are constantly seeking new languages and resources to learn to add to their skills. Furthermore, programming knowledge is becoming an important skill to possess within various industries throughout the business world. Job seekers with skills in Python, R, and SQL will find their knowledge to be among the most highly desirable data science skills and likely assist in their search for employment.
This repository contains programming data collected from 15 students during November and December of 2019 at Bielefeld University. Students were asked to implement gradient descent. Note that this data set contains only source code snapshots and neither timestamps nor personal information. All students programmed in a web environment, which is also contained in this repository.
The APPS dataset consists of problems collected from different open-access coding websites such as Codeforces, Kattis, and more. The APPS benchmark attempts to mirror how humans programmers are evaluated by posing coding problems in unrestricted natural language and evaluating the correctness of solutions. The problems range in difficulty from introductory to collegiate competition level and measure coding ability as well as problem-solving.
The Automated Programming Progress Standard, abbreviated APPS, consists of 10,000 coding problems in total, with 131,836 test cases for checking solutions and 232,444 ground-truth solutions written by humans. Problems can be complicated, as the average length of a problem is 293.2 words. The data are split evenly into training and test sets, with 5,000 problems each. In the test set, every problem has multiple test cases, and the average number of test cases is 21.2. Each test case is specifically designed for the corresponding problem, enabling us to rigorously evaluate program functionality.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for CodeContests
Dataset Summary
CodeContests is a competitive programming dataset for machine-learning. This dataset was used when training AlphaCode. It consists of programming problems, from a variety of sources:
Site URL Source
Aizu https://judge.u-aizu.ac.jp CodeNet
AtCoder https://atcoder.jp CodeNet
CodeChef https://www.codechef.com description2code
Codeforces https://codeforces.com description2code and Codeforces
HackerEarth… See the full description on the dataset page: https://huggingface.co/datasets/deepmind/code_contests.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These files accompany the book entitled: An Introduction to Programming Languages: Simultaneous Learning in Multiple Coding Environments. This work is an introductory textbook in several computer languages. It describes the most well-known and popular programming environments such as: C#, C++, Java, JavaScript, PERL, PHP, Python, Ruby, and Visual Basic (VB) or Visual Basic for Applications (VBA). Therefore, the main objective of this unique guide is to provide code examples reflected in these nine computer languages. Readers can easily understand the connection and universality between the syntax of different environments and be adept at translating code. This learning experience can be ideal for upper-undergraduate introductory courses, researchers, doctoral students, and sociologists or engineers charged with implementing data analysis. Graphical illustrations are used for technical details about the computation examples to aid in an in-depth understanding of their inner workings. Moreover, the book contains original material that has been class-tested by the author and numerous cases are examined. Readers will also benefit from the inclusion of: a) Historical and philosophical perspectives on the past, present and future of computer languages. b) A total of 448 additional files freely available online, from which a total of 44 files are poster presentations (i.e. PowerPoint and PDF files). c) A total of 404 code examples reflected in nine computer languages, namely: C#, C++, Java, JavaScript, PERL, PHP, Python, Ruby and VB. This work first begins with a general introduction to history and presents the natural inevitable pathway from mechanical automatons to present electronic computers. Following this historical introduction, an in-detail look is made on philosophical questions, implementations, entropy and life. More often than not, there is a genuine amazement of the younger generations regarding the advancement of computer technology. Historical events that led to the development of technologies have been distilled down to the essence. However, the essence of any story is made with massive loss of detailed information. The essence of essences even more so. Over time, the lack of detail leads to a collective amnesia that can prevent us from understanding the naturalness by which technology has evolved. Thus, new constructs are always built upon older constructs to fit the evolutionary chain of technological progress, which boils down to the same fundamental rules as biological evolution. In the first stage, this book discusses the natural path of programming constructs by starting from time immemorial and ending with examples up to the present times. In the end, naturally driven constructs of all kinds also drive our society today. In the second part, the emphasis is made on the technical side where a total of nine computer languages are used simultaneously for mirrored examples. Simultaneous learning of multiple computer languages can be regarded as an asset in the world of science and technology. Thus, the reader can get used to the majority of known programming or scripting languages. Moreover, a basic knowledge of software implementation in several computer languages, even in an introductory way, helps the versatility and adaptability of the reader to new situations that may arise in industry, education, or research. Thus, this work is meant to bring a more concrete understanding of the similarities and differences between computer languages.
Paul A. Gagniuc. An Introduction to Programming Languages: Simultaneous Learning in Multiple Coding Environments. Synthesis Lectures on Computer Science. Springer International Publishing, 2023, pp. 1-280.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Pre-trained Language Models (PLM) such as CodeBERT and GraphCodeBERT, when trained on a large corpus of code, have recently displayed promising results in Software Engineering (SE) down-stream tasks. A PLM is most useful if it can be leveraged to improve the performance on code corpora written in low-resource programming languages, where training data is limited. In this work, our focus is on studying the impact of PLMs on a low-resource programming language corpus — specifically, we choose Ruby as the study subject. A recent study by Ahmed and Devanbu reported that using a corpus of code written in multilingual datasets to fine-tune multilingual PLMs achieves higher performance as opposed to using a corpus of code written in just one programming language. However, no analysis was made with respect to monolingual PLMs. Furthermore, some programming languages are inherently different and code written in one language usually cannot be interchanged with the others, i.e., Ruby and Java code possess very different structure. To better understand how monolingual and multilingual PLM affects different programming languages, we investigate 1) the performance of PLMs on Ruby for two popular SE tasks: Code Summarization and Code Search, 2) the strategy (to select programming languages) that works well on fine-tuning multilingual PLMs for Ruby, and 3) the performance of the fine-tuned PLMs on Ruby given different code lengths — here, we bin the Ruby code based on its number of tokens; understanding the performance on different code lengths will enable developers to make more informed decision on the use of PLMs based on their code.
This dataset, containing the PLMs and their fine-tuned models (there are over a hundred trained and fine-tuned models), was generated by the researchers at the University of British Columbia, Singapore Management University and JetBrains.
https://www.enterpriseappstoday.com/privacy-policyhttps://www.enterpriseappstoday.com/privacy-policy
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Structure
References
[1] Albrecht, Ella et al. “Experiences in Introducing Blended Learning in an Introductory Programming Course.” ECSEE (2018).
[2] yEd - Graph editor. https://www.yworks.com/products/yed
[3] plotly. https://plot.ly
The dataset contains information on over 4000 programming languages. Which include facts about the language such as what year it was created, What is its rank, and other parameters that you will come to know once you explore the dataset.
Credits. https://github.com/breck7/pldb
Python Programming Puzzles (P3) is an open-source dataset where each puzzle is defined by a short Python program , and the goal is to find an input which makes output "True". The puzzles are objective in that each one is specified entirely by the source code of its verifier, so evaluating is all that is needed to test a candidate solution. They do not require an answer key or input/output examples, nor do they depend on natural language understanding.
The dataset is comprehensive in that it spans problems of a range of difficulties and domains, ranging from trivial string manipulation problems that are immediately obvious to human programmers (but not necessarily to AI), to classic programming puzzles (e.g., Towers of Hanoi), to interview/competitive-programming problems (e.g., dynamic programming), to longstanding open problems in algorithms and mathematics (e.g., factoring). The objective nature of P3 readily supports self-supervised bootstrapping.
The most popular programming language used in the last 12 months by software developers worldwide is JavaScript as of 2022, according to 65 percent of the software developers surveyed. Four percent of software developers are also planning to adopt or migrate to JavaScript.
About this webinar Programming is becoming more and more popular, with many researchers using programming to perform data cleaning, data manipulation, data analytics, as well as creating publication quality plots. Programming can be really beneficial for automating processes and workflows. In this webinar, we are exploring four of the most popular programming languages that are widely used in academia, namely Python, R, MATLAB, and Julia. Webinar Topics Why use Programming An overview of Python, R, MATLAB, and Julia Code comparison of the four programming languages Popularity and job opportunities Intersect’s comparison General guidelines on how to choose the best programming language for your research Licence Copyright © 2021 Intersect Australia Ltd. All rights reserved.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Competitive programming is a challenging task that demands proficiency in computer science concepts and strong problem-solving skills.
A significant limitation in the field of competitive programming, in the context of machine learning, is the lack of available datasets that include the problem statement, the editorial, and the source code for research purposes. This limitation hinders the development of new algorithms and techniques to improve the efficiency and accuracy of selecting or creating suitable editorials for given problems.
To address this problem, we have introduced a comprehensive series of 1550 competitive programming problems that encompass both editorial solutions and source code.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Software development is a continuous decision-making process that mainly relies on the software engineer's experience and intuition. One of the essential decisions in the early stages of the process is selecting the best fitting programming language based on the project requirements. A significant number of criteria, such as developer availability and consistent documentation, besides potential programming languages in the market, lead to a challenging decision-making process. A decision model is required to analyze the selection problem using systematic identification and evaluation of potential alternatives for a development project. Method: Recently, we introduced a framework to build decision models for technology selection problems in software production. Furthermore, we designed and implemented a decision support system that uses such decision models to support software engineers with their decision-making problems. This study presents a decision model based on the framework for the programming language selection problem. Results: The decision model has been evaluated through seven real-world case studies at seven software development companies. The case study participants declared that the approach provides significantly more insight into the programming language selection process and decreases the decision-making process's time and cost. Conclusion: With the knowledge available through the decision model, software engineers can more rapidly evaluate programming languages. Having this knowledge readily available supports software engineers in making more efficient and effective decisions that meet their requirements and priorities.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A large dataset that contains the eye movements of N=216 programmers of different experience levels captured during two code comprehension tasks is presented. Data are grouped in terms of programming expertise (from none to high) and other demographic descriptors. Data were collected through an international collaborative effort that involved eleven research teams across eight countries on four continents. The same eye tracking apparatus and software was used for the data collection. The Eye Movements in Programming (EMIP) dataset is freely available for download. The varied metadata in the EMIP dataset provides fertile ground for the analysis of gaze behavior and may be used to make novel insights about code comprehension.
Bednarik, Roman, et al. "EMIP: The eye movements in programming dataset." Science of Computer Programming 198 (2020): 102520.
The community behind R is built by inspired scientists that share their tools and knowledge freely to encourage equal access for all aspiring researchers and championing academic integrity. The tools available through R aid in every step of data analysis; including creating experiments, cataloging and organizing data, analyzing the results, and visualizing our findings all in one software environment. The power of programming also increases the flexibility and automation of these tasks saving an abundance of time and ensuring each step can be accurately reproduced. Often, courses that use the R software to demonstrate statistical concepts face the dual challenge of introducing two distinct and equally intricate topics at once; programming and statistics. In most cases, the focus must be shifted away from programming due to constraints on time and breadth to the potential confusion and dismay (repeated appearance of error messages) of novice learners in statistics. This workshop aims to provide a solid foundation of programming concepts such that attendees can confidently approach more advanced statistical courses or independently improve their statistical skills. Many of the ideas that will be covered can apply to many different programming languages, despite R being the main tool. Online recordings. Part 1: https://youtu.be/3zUkPvYTePo Part 2: https://youtu.be/Knjbu6JwNI0
CodeQA is a free-form question answering dataset for the purpose of source code comprehension: given a code snippet and a question, a textual answer is required to be generated. CodeQA contains a Java dataset with 119,778 question-answer pairs and a Python dataset with 70,085 question-answer pairs.
Description from: CodeQA: A Question Answering Dataset for Source Code Comprehension
https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/
Programming Software Market size was valued at USD 30.9 Billion in 2024 and is projected to reach USD 147.8 Billion by 2031, growing at a CAGR of 23.4% during the forecast period 2024-2031.
Global Programming Software Market Drivers
Technological Innovation: The market for programming software is primarily driven by technological advancements. The landscape is always changing due to advancements in programming languages, frameworks, and tools, which empower developers to produce increasingly complex and effective software solutions. The need for programming software that supports cutting-edge technologies like cloud computing, AI, machine learning, and the Internet of Things (IoT) is increasing.
Growing Need for Customized Solutions: Companies in a variety of sectors are depending more and more on software solutions made to meet their unique requirements. The need for programming tools that makes it possible for developers to quickly and effectively create highly customized apps is driven by this desire. The market is becoming more and more competitive, and this is driving up demand for programming tools that are both versatile and scalable.
Move Towards Open Source Software: Due to its affordability, adaptability, and collaborative nature, open source software has seen a sharp increase in popularity in recent years. Because of its accessibility and active community support, open source programming software is preferred by many developers and organizations. As a result, open source tools and frameworks are becoming more popular in the programming software market.
The use of DevOps principles, which prioritize cooperation between development and operations teams to expedite software delivery, is on the rise. These practices are being embraced by enterprises looking to increase their efficiency and agility. Programming software that enables smooth integration, automation, and continuous delivery inside the DevOps pipeline is in high demand due to this trend.
A Growing Focus on Security: As a result of the increase in cyberattacks and data leaks, security is now the top priority for businesses creating software solutions. Because of this, there is an increasing need for programming tools that support safe coding techniques and have strong security features. Programming frameworks and tools with a security focus are necessary to fix vulnerabilities and guarantee the integrity of software programs.
Transition to No-Code/Low-Code Development:
Because low-code/no-code development platforms make it possible for users with different degrees of technical expertise to construct apps quickly, they are democratizing software development. The demand for increased agility, lower development costs, and a quicker time to market is what’s driving this trend. Consequently, low-code/no-code tools are becoming more and more popular in the programming software market alongside conventional programming languages and frameworks.
Industry-Specific Requirements: The selection of programming software is influenced by the particular requirements and regulatory norms of various industries. Industry-specific standards and regulatory compliance are made easier by the need for programming tools in areas like finance, healthcare, and automotive, which have strict compliance requirements.
Global Economic variables: The market for programming software is also impacted by economic variables like GDP growth, investment trends, and geopolitical developments. While economic expansion can lead to higher investment in software development activities, economic downturns may result in reduced IT budgets and slower adoption of new technology.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains over 97,000,000 snippets of code from various GitHub repositories with more than 10,000 stars.
The repositories included in this dataset were the results of searching for repositories with greater than 10,000 stars for each of the following languages:
Bash
C
C++
CSV
DOTFILE
Go
HTML
JSON
Java
JavaScript
Jupyter
Markdown
PowerShell
Python
Ruby
Rust
Shell
TSV
Text
UNKNOWN
YAML
For each repository, we created snippets from the default branch by going through each text file and extracting 5-line chunks of text every 5 lines.
We used file extensions to associate snippets with the programming language they most likely
represent. For snippets for which we could not infer the language from the file extension, we use
the value UNKNOWN
in the language
column.
This dataset does not contain code from any GitHub repository without a license. The following is
the list of possible licenses a snippet can be associated with:
AGPL-3.0
Apache-2.0
BSD-2-Clause
BSD-3-Clause
BSL-1.0
CC-BY-4.0
CC-BY-SA-4.0
CC0-1.0
GPL-2.0
GPL-3.0
ISC
LGPL-2.1
LGPL-3.0
MIT
MPL-2.0
MS-PL
NOASSERTION
OFL-1.1
Unlicense
WTFPL
Zlib
These are SPDX License Identifiers.
Note that Unlicense
refers to the Unlicense. It does not mean that
the snippet is unlicensed.
This dataset is built and maintained by Bugout.dev. To report an issue with the data, to request changes in future versions of the dataset, please open a discussion thread..
As this dataset can be difficult to work with in Kaggle notebooks, we have made a smaller version of the dataset available, as well: GitHub Code Snippets - Development sample.
CodeContests is a competitive programming dataset for machine-learning. This dataset was used when training AlphaCode.
It consists of programming problems, from a variety of sources.
Problems include test cases in the form of paired inputs and outputs, as well as both correct and incorrect human solutions in a variety of languages.