95 datasets found

Developers population worldwide 2018-2024
statista.com
Updated Nov 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Developers population worldwide 2018-2024 [Dataset]. https://www.statista.com/statistics/627312/worldwide-developer-population/
Explore at:
Dataset updated
Nov 26, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Worldwide
Description
The global developer population is expected to reach 28.7 million people by 2024, an increase of 3.2 million from the number seen in 2020. According to the source, much of this growth is expected to occur in China, where the growth rate is between six percent to eight percent heading up to 2023. How much do software developers earn in the U.S.? Software developers work within a wide array of specialties, honing their skills in different programming languages, techniques, or in disciplines such as design. The average salary of U.S.-based designers working in software development reached 108 thousand U.S. dollars as of June 2021, while this figure climbs to 165 thousand U.S. dollars for engineering managers. Salaries are highly dependent on location, however, with an entry-level developer working in the San Francisco/Bay area earning an average of 44.79 percent more than their counterparts starting out in Austin. JavaScript and HTML/CSS still the most widely used languages While programming languages continue to emerge or fall out of favor, JavaScript and HTML/CSS are mainstays of the coding landscape. In a global survey of software developers, over 60 percent of respondents reported using JavaScript, and HTML/CSS. SQL, Python, and Java rounded out the top five.
Most used programming languages among developers worldwide 2024
statista.com
Updated Feb 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Most used programming languages among developers worldwide 2024 [Dataset]. https://www.statista.com/statistics/793628/worldwide-developer-survey-most-used-languages/
Explore at:
Dataset updated
Feb 6, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
May 19, 2024 - Jun 20, 2024
Area covered
Worldwide
Description
As of 2024, JavaScript and HTML/CSS were the most commonly used programming languages among software developers around the world, with more than 62 percent of respondents stating that they used JavaScript and just around 53 percent using HTML/CSS. Python, SQL, and TypeScript rounded out the top five most widely used programming languages around the world. Programming languages At a very basic level, programming languages serve as sets of instructions that direct computers on how to behave and carry out tasks. Thanks to the increased prevalence of, and reliance on, computers and electronic devices in today’s society, these languages play a crucial role in the everyday lives of people around the world. An increasing number of people are interested in furthering their understanding of these tools through courses and bootcamps, while current developers are constantly seeking new languages and resources to learn to add to their skills. Furthermore, programming knowledge is becoming an important skill to possess within various industries throughout the business world. Job seekers with skills in Python, R, and SQL will find their knowledge to be among the most highly desirable data science skills and likely assist in their search for employment.
CommitBench
zenodo.org
csv, json
Updated Feb 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maximilian Schall; Maximilian Schall; Tamara Czinczoll; Tamara Czinczoll; Gerard de Melo; Gerard de Melo (2024). CommitBench [Dataset]. http://doi.org/10.5281/zenodo.10497442
Explore at:
json, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10497442
Dataset updated
Feb 14, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Maximilian Schall; Maximilian Schall; Tamara Czinczoll; Tamara Czinczoll; Gerard de Melo; Gerard de Melo
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Time period covered
Dec 15, 2023
Description
Data Statement for CommitBench

- Dataset Title: CommitBench

- Dataset Curator: Maximilian Schall, Tamara Czinczoll, Gerard de Melo

- Dataset Version: 1.0, 15.12.2023

- Data Statement Author: Maximilian Schall, Tamara Czinczoll

- Data Statement Version: 1.0, 16.01.2023

- Code URL: https://github.com/maxscha/commitbench

EXECUTIVE SUMMARY

We provide CommitBench as an open-source, reproducible and privacy- and license-aware benchmark for commit message generation. The dataset is gathered from github repositories with licenses that permit redistribution. We provide six programming languages, Java, Python, Go, JavaScript, PHP and Ruby. The commit messages in natural language are restricted to English, as it is the working language in many software development projects. The dataset has 1,664,590 examples that were generated by using extensive quality-focused filtering techniques (e.g. excluding bot commits). Additionally, we provide a version with longer sequences for benchmarking models with more extended sequence input, as well a version with

CURATION RATIONALE

We created this dataset due to quality and legal issues with previous commit message generation datasets. Given a git diff displaying code changes between two file versions, the task is to predict the accompanying commit message describing these changes in natural language. We base our GitHub repository selection on that of a previous dataset, CodeSearchNet, but apply a large number of filtering techniques to improve the data quality and eliminate noise. Due to the original repository selection, we are also restricted to the aforementioned programming languages. It was important to us, however, to provide some number of programming languages to accommodate any changes in the task due to the degree of hardware-relatedness of a language. The dataset is provides as a large CSV file containing all samples. We provide the following fields: Diff, Commit Message, Hash, Project, Split.

DOCUMENTATION FOR SOURCE DATASETS

Repository selection based on CodeSearchNet, which can be found under https://github.com/github/CodeSearchNet

LANGUAGE VARIETIES

Since GitHub hosts software projects from all over the world, there is no single uniform variety of English used across all commit messages. This means that phrasing can be regional or subject to influences from the programmer's native language. It also means that different spelling conventions may co-exist and that different terms may used for the same concept. Any model trained on this data should take these factors into account. For the number of samples for different programming languages, see Table below:

Language Number of Samples
Java 153,119
Ruby 233,710
Go 137,998
JavaScript 373,598
Python 472,469
PHP 294,394

SPEAKER DEMOGRAPHIC

Due to the extremely diverse (geographically, but also socio-economically) backgrounds of the software development community, there is no single demographic the data comes from. Of course, this does not entail that there are no biases when it comes to the data origin. Globally, the average software developer tends to be male and has obtained higher education. Due to the anonymous nature of GitHub profiles, gender distribution information cannot be extracted.

ANNOTATOR DEMOGRAPHIC

Due to the automated generation of the dataset, no annotators were used.

SPEECH SITUATION AND CHARACTERISTICS

The public nature and often business-related creation of the data by the original GitHub users fosters a more neutral, information-focused and formal language. As it is not uncommon for developers to find the writing of commit messages tedious, there can also be commit messages representing the frustration or boredom of the commit author. While our filtering is supposed to catch these types of messages, there can be some instances still in the dataset.

PREPROCESSING AND DATA FORMATTING

See paper for all preprocessing steps. We do not provide the un-processed raw data due to privacy concerns, but it can be obtained via CodeSearchNet or requested from the authors.

CAPTURE QUALITY

While our dataset is completely reproducible at the time of writing, there are external dependencies that could restrict this. If GitHub shuts down and someone with a software project in the dataset deletes their repository, there can be instances that are non-reproducible.

LIMITATIONS

While our filters are meant to ensure a high quality for each data sample in the dataset, we cannot ensure that only low-quality examples were removed. Similarly, we cannot guarantee that our extensive filtering methods catch all low-quality examples. Some might remain in the dataset. Another limitation of our dataset is the low number of programming languages (there are many more) as well as our focus on English commit messages. There might be some people that only write commit messages in their respective languages, e.g., because the organization they work at has established this or because they do not speak English (confidently enough). Perhaps some languages' syntax better aligns with that of programming languages. These effects cannot be investigated with CommitBench.

Although we anonymize the data as far as possible, the required information for reproducibility, including the organization, project name, and project hash, makes it possible to refer back to the original authoring user account, since this information is freely available in the original repository on GitHub.

METADATA

License: Dataset under the CC BY-NC 4.0 license

DISCLOSURES AND ETHICAL REVIEW

While we put substantial effort into removing privacy-sensitive information, our solutions cannot find 100% of such cases. This means that researchers and anyone using the data need to incorporate their own safeguards to effectively reduce the amount of personal information that can be exposed.

ABOUT THIS DOCUMENT

A data statement is a characterization of a dataset that provides context to allow developers and users to better understand how experimental results might generalize, how software might be appropriately deployed, and what biases might be reflected in systems built on the software.

This data statement was written based on the template for the Data Statements Version 2 schema. The template was prepared by Angelina McMillan-Major, Emily M. Bender, and Batya Friedman and can be found at https://techpolicylab.uw.edu/data-statements/ and was updated from the community Version 1 Markdown template by Leon Dercyznski.
Global sought-after database skills for developers 2021
statista.com
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2023). Global sought-after database skills for developers 2021 [Dataset]. https://www.statista.com/statistics/793854/worldwide-developer-survey-most-wanted-database/
Explore at:
Dataset updated
Nov 22, 2023
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
May 25, 2021 - Jun 15, 2021
Area covered
Worldwide
Description
According to the survey, just under 18 percent of respondents identified PostgreSQQL as one of the most-wanted database skills. MongoDB ranked second with 17.89 percent stating they are not developing with it, but want to.
G
VIIRS Nighttime Day/Night Band Composites Version 1
developers.google.com
Updated May 31, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Earth Observation Group, Payne Institute for Public Policy, Colorado School of Mines (2017). VIIRS Nighttime Day/Night Band Composites Version 1 [Dataset]. https://developers.google.com/earth-engine/datasets/catalog/NOAA_VIIRS_DNB_MONTHLY_V1_VCMCFG
Explore at:
Dataset updated
May 31, 2017
Dataset provided by
Earth Observation Group, Payne Institute for Public Policy, Colorado School of Mines
Time period covered
Apr 1, 2012 - Mar 1, 2025
Area covered
Description
Monthly average radiance composite images using nighttime data from the Visible Infrared Imaging Radiometer Suite (VIIRS) Day/Night Band (DNB). As these data are composited monthly, there are many areas of the globe where it is impossible to get good quality data coverage for that month. This can be due to …
D
Coding Interview Platform Market Report | Global Forecast From 2025 To 2033
dataintelo.com
csv, pdf, pptx
Updated Jan 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Coding Interview Platform Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/coding-interview-platform-market
Explore at:
pptx, csv, pdfAvailable download formats
Dataset updated
Jan 7, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Coding Interview Platform Market Outlook

The global coding interview platform market size was valued at approximately USD 500 million in 2023 and is expected to reach USD 1.8 billion by 2032, growing at a CAGR of 15% from 2024 to 2032. The significant growth factor driving this market includes the increasing demand for skilled software developers and the rise of remote hiring practices, which have necessitated the need for robust, scalable, and efficient coding interview platforms.

One of the prominent growth factors in the coding interview platform market is the ever-increasing demand for software developers across various industries. As technology continues to evolve and permeate every aspect of business operations, the need for proficient coders has surged. Companies are investing heavily in recruiting top-tier coding talent to stay competitive in the digital age. This demand creates a substantial market for platforms that offer coding interview services, enabling employers to assess and hire the best talent efficiently.

Another crucial driver is the transformation of the hiring process due to the COVID-19 pandemic, which has accelerated the adoption of remote work and virtual hiring practices. Organizations are now more inclined to use coding interview platforms to evaluate candidates remotely. These platforms provide an effective way to conduct technical assessments, reducing the need for in-person interviews and allowing companies to expand their talent pool beyond geographical constraints. This shift towards remote hiring is expected to have a lasting impact on the demand for coding interview platforms.

Additionally, the integration of advanced technologies such as artificial intelligence and machine learning in coding interview platforms is enhancing their capabilities and making them more attractive to employers. These technologies enable the automation of candidate evaluation processes, provide deeper insights into coding skills, and offer personalized feedback to candidates. As a result, coding interview platforms are becoming more sophisticated and efficient, further driving their adoption across various industries.

From a regional perspective, North America holds the largest share in the coding interview platform market. This dominance can be attributed to the presence of a significant number of tech giants and startups in the region, which are constantly on the lookout for skilled coders. Additionally, the Asia Pacific region is expected to witness the highest growth rate during the forecast period. The burgeoning tech industry in countries like India and China, coupled with a growing number of coding bootcamps and educational institutions, is contributing to the increasing demand for coding interview platforms in the region.

Product Type Analysis

The coding interview platform market can be segmented by product type into web-based and mobile-based platforms. Web-based coding interview platforms dominate the market, primarily due to their ease of accessibility and comprehensive features. These platforms are widely used by enterprises and individuals alike, thanks to their ability to offer real-time coding assessments, video interviews, and collaborative coding environments. With continuous advancements in web technologies, these platforms are becoming increasingly robust, scalable, and user-friendly, making them a preferred choice for many.

Mobile-based coding interview platforms, on the other hand, are gaining traction, especially among younger, tech-savvy candidates and startups that emphasize flexibility and mobility. These platforms allow users to participate in coding interviews from their smartphones or tablets, offering a convenient alternative to traditional web-based platforms. The rise of mobile app development and the proliferation of smartphones are key factors driving the growth of mobile-based coding interview platforms. While not as comprehensive as their web-based counterparts, mobile-based platforms often offer streamlined, user-friendly interfaces tailored to on-the-go use.

Both web-based and mobile-based platforms are continually evolving to meet the diverse needs of their users. For instance, web-based platforms are integrating advanced analytics and reporting features, allowing recruiters to make data-driven hiring decisions. Similarly, mobile-based platforms are incorporating features such as offline coding capabilities and push notifications to enhance user engagement and experience. The competition between these two product types is fostering innovation, resulting in more sophisticated
c
Global AI Code Tools Market Report 2025 Edition, Market Size, Share, CAGR,...
cognitivemarketresearch.com
pdf,excel,csv,ppt
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cognitive Market Research, Global AI Code Tools Market Report 2025 Edition, Market Size, Share, CAGR, Forecast, Revenue [Dataset]. https://www.cognitivemarketresearch.com/ai-code-tools-market-report
Explore at:
pdf,excel,csv,pptAvailable download formats
Dataset authored and provided by
Cognitive Market Research
License
https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy
Time period covered
2021 - 2033
Area covered
Global
Description
According to Cognitive Market Research, the global AI Code Tools Market size is USD 4.0 billion in 2024 and will expand at a compound annual growth rate (CAGR) of 22.6% from 2024 to 2031. Market Dynamics of AI Code Tools Market

Key Drivers for AI Code Tools Market

Need to Help Developers with Tough Coding Tasks - Al code tools are invaluable to software developers when dealing with complex coding assignments, and this has emerged as a powerful motivator for their proliferation in the software development scene. One key component of this support is the capacity of Al coding tools to simplify code transition, which is especially useful when dealing with legacy source code or different programming languages. A study presented at the 2021 International Conference on Intelligent User Interfaces described how generative Al provided developers with a skeletal framework for translating source code into Python. A 2022 study published in the Proceedings of the Association for Computing Machinery on Programming Languages (PACMPL) found that various tools, such as GitHub Copilot, sped up coding by providing end-of-line suggestions for function calls and argument completions. Increasing Adoption of Low-Code/No-Code Platforms

Key Restraints for AI Code Tools Market

Complex and specialized applications Overreliance on Al code tools can hamper problem-solving abilities Introduction of the AI Code Tools Market

Machine learning and artificial intelligence-powered code tools are revolutionizing software development by boosting productivity and streamlining workflows. These tools aim to automate, optimize, and streamline many parts of software engineering, hence enhancing developer efficiency and accessibility. AI coding tools offer a variety of capabilities and functions. It can give developers intelligent code ideas, allowing them to write faster and with fewer mistakes. They examine the code context and suggest suitable code snippets, function names, and variable names. In addition, growing investment in AI code tools companies is moving the AI code tools sector forward rapidly. This funding enables startups to innovate, create cutting-edge technologies, and improve their existing capabilities. With appropriate funding, these companies can expand their research, speed product development, and provide more advanced solutions to developers. However, these technologies usually demand access to sensitive codebases and private information, which raises data and intellectual property security risks. AI code tools must adhere to strong data privacy guidelines and be safeguarded against unauthorized access. To secure their intellectual assets, developers and organizations demand strong encryption mechanisms and access controls. Therefore, data privacy and security must be considered while adopting AI code tools.
Data from: CoUpJava: A Dataset of Code Upgrade Histories in Open-Source Java...
zenodo.org
application/gzip, bin
Updated Apr 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaihang Jiang; Jin Bihui; Nie Pengyu; Kaihang Jiang; Jin Bihui; Nie Pengyu (2025). CoUpJava: A Dataset of Code Upgrade Histories in Open-Source Java Repositories [Dataset]. http://doi.org/10.5281/zenodo.15293313
Explore at:
bin, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15293313
Dataset updated
Apr 28, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Kaihang Jiang; Jin Bihui; Nie Pengyu; Kaihang Jiang; Jin Bihui; Nie Pengyu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Modern programming languages are constantly evolving, introducing new language features and APIs to enhance software development practices. Software developers often face the tedious task of upgrading their codebase to new programming language versions. Recently, large language models (LLMs) have demonstrated potential in automating various code generation and editing tasks, suggesting their applicability in automating code upgrade. However, there exists no benchmark for evaluating the code upgrade ability of LLMs, as distilling code changes related to programming language evolution from real-world software repositories’ commit histories is a complex challenge.
In this work, we introduce CoUpJava, the first large-scale dataset for code upgrade, focusing on the code changes related to the evolution of Java. CoUpJava comprises 10,697 code upgrade samples, distilled from the commit histories of 1,379 open-source Java repositories and covering Java versions 7–23. The dataset is divided into two subsets: CoUpJava-Fine, which captures fine-grained method-level refactorings towards new language features; and CoUpJava-Coarse, which includes coarse-grained repository-level changes encompassing new language features, standard library APIs, and build configurations. Our proposed dataset provides high-quality samples by filtering irrelevant and noisy changes and verifying the compilability of upgraded code. Moreover, CoUpJava reveals diversity in code upgrade scenarios, ranging from small, fine-grained refactorings to large-scale repository modifications.
Harmonized Sentinel-2 MSI: MultiSpectral Instrument, Level-1C (TOA)
developers.google.com
Updated Feb 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
European Union/ESA/Copernicus (2024). Harmonized Sentinel-2 MSI: MultiSpectral Instrument, Level-1C (TOA) [Dataset]. https://developers.google.com/earth-engine/datasets/catalog/COPERNICUS_S2_HARMONIZED
Explore at:
Dataset updated
Feb 15, 2024
Dataset provided by
European Space Agencyhttp://www.esa.int/
Time period covered
Jun 27, 2015 - Jul 31, 2025
Area covered
Description
After 2022-01-25, Sentinel-2 scenes with PROCESSING_BASELINE '04.00' or above have their DN (value) range shifted by 1000. The HARMONIZED collection shifts data in newer scenes to be in the same range as in older scenes. Sentinel-2 is a wide-swath, high-resolution, multi-spectral imaging mission supporting Copernicus Land Monitoring studies, including the monitoring of vegetation, soil and water cover, as well as observation of inland waterways and coastal areas. The Sentinel-2 data contain 13 UINT16 spectral bands representing TOA reflectance scaled by 10000. See the Sentinel-2 User Handbook for details. QA60 is a bitmask band that contained rasterized cloud mask polygons until Feb 2022, when these polygons stopped being produced. Starting in February 2024, legacy-consistent QA60 bands are constructed from the MSK_CLASSI cloud classification bands. For more details, see the full explanation of how cloud masks are computed.. Each Sentinel-2 product (zip archive) may contain multiple granules. Each granule becomes a separate Earth Engine asset. EE asset ids for Sentinel-2 assets have the following format: COPERNICUS/S2/20151128T002653_20151128T102149_T56MNN. Here the first numeric part represents the sensing date and time, the second numeric part represents the product generation date and time, and the final 6-character string is a unique granule identifier indicating its UTM grid reference (see MGRS). The Level-2 data produced by ESA can be found in the collection COPERNICUS/S2_SR. For datasets to assist with cloud and/or cloud shadow detection, see COPERNICUS/S2_CLOUD_PROBABILITY and GOOGLE/CLOUD_SCORE_PLUS/V1/S2_HARMONIZED. For more details on Sentinel-2 radiometric resolution, see this page.
f
LAIL
figshare.com
zip
Updated Jul 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jia Li (2024). LAIL [Dataset]. http://doi.org/10.6084/m9.figshare.22014596.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22014596.v1
Dataset updated
Jul 30, 2024
Dataset provided by
figshare
Authors
Jia Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LAILLAIL is a Large language model-Aware selection approach for In-context-Learning-based code generation named LAIL. LAIL uses LLMs themselves to select examples. It requires LLMs themselves to label a candidate example as a positive example or a negative example for a requirement.## Requirements- openai- tqdm- javaWe also privide a scripts (/Evaluation/evaluation_setup.sh) to help set up programming language dependencies that are used in evaluation.bashbash evaluation_setup.sh###### DatasetThe datasets contain DevEval, MBJP, MBPP, MBCPP, and HumanEval. DevEval is a repository-level code generation dataset, which is collected from real-word code repositories. The dataset aligns with real-world code repositories in multiple dimensions. Thus, we take DevEval as the example to demonstrate how to process the dataset. Take ../Dataset/DevEval as example:train.jsonl and test.jsonl:(1) We randomly select two domains to evaluate LAIL and baselines, including the scientific engineering domain and text processing domain. (2) We randomly split the tasks of the two domains into the training set and the test set. Finally, we acquire 101 examples in the training set and 49 examples in the test set. (3) Given a requirement from a repository, we use tree-sitter to parse the repository and acquire all functions of the repository. (4) We treat functions contained in the repository as the candidate pool. Then LAIL and baselines retrieve a few functions from thecandidate pool as demonstration examples. source data and test_source data folders consist of the original code repositories collected from Github.estimate_prompt folder contain the constructed prompts to estimate candidate examples.generation_prompt folder contains the constructed prompts where the demonstration examples are selected by LAIL and different baselines. For example:(1) ICL_LAIL folder provides the selected examples' id in LAIL_id by our LAIL. Developers can directly use these provided prompts through codellama_completion.py to generate programs. (2) After generating programs, developers need to process generated programs with process_generation.py. (3) Finally, developers evaluate the generated programs with the source code in Evaluation folder.############ LAIL ### Estimate candidate examples by LLMs themselvesWe leverage LLM themseleves to estimate candidate examples. The code is storaged in the LAIL/estimate_examples package.Take DevEval as example:(1) /Dataset/DevEval/estimate_prompt folder contains the constructed prompts to estimate candidate examples.(2) Developers run the following command to estimate candidate examples by CodeLlama-7B:bashbash make_estimation_prompt.sh ../Dataset/DevEval/estimation_prompt(3) According to the probability feedback of LLMs, we acquire the positive and negative examples.###### Train a neural retriever(1) We use the labeled positive and negative examples to train a neural retriever with contrastive learning. The code is storaged in the /LAIL/LAIL/retriever/train folder.bashexport CUDA_VISIBLE_DEVICES=0nohup python run.py \ --output_dir=/saved_models \ --model_type=roberta \ --config_name=microsoft/graphcodebert-base \ --model_name_or_path=microsoft/graphcodebert-base \ --tokenizer_name=microsoft/graphcodebert-base \ --do_train \ --train_data_file=/id.jsonl \ --epoch 100 \ --block_size 128 \ --train_batch_size 16 \ --learning_rate 1e-4 \ --max_grad_norm 1.0 \ --seed 123456 >mbpp.txt 2>&1 &## Select a few demonstration examples using the trained retriever(2) Given a test requirement, developers use the trained retriever to select a few demonstration examples.The code is storaged in the /LAIL/LAIL/retriever/train folder.bashbash run_inference.sh ../Dataset/DevEval###### Code Generation(1) After acquired the prompt context consisting of a few selected examples, developers input a test requirement and the prompt context into LLMs and acquire desired programs.For example, developers use CodeLlama ( ../LAIL/ICL_LAIL/codellama_completion.py) to generate programs:bashexport CUDA_VISIBLE_DEVICES=0torchrun --nproc_per_node=1 --master_port=16665 codellama_completion.py Salesforce/CodeLlama-7b ../Dataset/DevEval/prompt_LAIL.jsonl --temperature=0.8 --max_batch_size=4 --output_base=output_random --get_logits=False(2) After generating programs, developers need to process generated programs with ../LAIL/ICL_LAIL/process_generation.py. bashpython process_generation.py###### BaselinesThis paper contains seven baselines that use different approaches to select demonstration examples for ICL_based code generation.(1) The source code is in the baselines folder and each baseline is in a individual folder.Developers can acquire the selected examples of all baselines by runing source code as follows:bashpython baselines.py(2) Then, developers use /baselines/make_prompt.py to contruct a prompt context using the selected candidate examples as follows:bashpython make_prompt.py ICLCoder ICLCoder -1###### EvaluationIn this paper, we use Pass@k to evaluate the performances of LAIL and baselines by the source code in LAIL/EvaluationSince the DevEval dataset is a repository-level code generation which is complex to evaluate, developers can use the following pipeline to evaluate different approaches by the source code in /LAIL/Evaluation/.## CitationIf you have any questions or suggestions, please email us at lijiaa@pku.edu.cn.
A
Data from: Google Earth Engine (GEE)
data.amerigeoss.org
esri rest, html
Updated Nov 28, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AmeriGEO ArcGIS (2018). Google Earth Engine (GEE) [Dataset]. https://data.amerigeoss.org/dataset/google-earth-engine-gee
Explore at:
esri rest, htmlAvailable download formats
Dataset updated
Nov 28, 2018
Dataset provided by
AmeriGEO ArcGIS
Description
Meet Earth Engine
Google Earth Engine combines a multi-petabyte catalog of satellite imagery and geospatial datasets with planetary-scale analysis capabilities and makes it available for scientists, researchers, and developers to detect changes, map trends, and quantify differences on the Earth's surface.
SATELLITE IMAGERY+YOUR ALGORITHMS+REAL WORLD APPLICATIONS
LEARN MORE
GLOBAL-SCALE INSIGHT
Explore our interactive timelapse viewer to travel back in time and see how the world has changed over the past twenty-nine years. Timelapse is one example of how Earth Engine can help gain insight into petabyte-scale datasets.
EXPLORE TIMELAPSE
READY-TO-USE DATASETS
The public data archive includes more than thirty years of historical imagery and scientific datasets, updated and expanded daily. It contains over twenty petabytes of geospatial data instantly available for analysis.
EXPLORE DATASETS
SIMPLE, YET POWERFUL API
The Earth Engine API is available in Python and JavaScript, making it easy to harness the power of Google’s cloud for your own geospatial analysis.
EXPLORE THE API
Google Earth Engine has made it possible for the first time in history to rapidly and accurately process vast amounts of satellite imagery, identifying where and when tree cover change has occurred at high resolution. Global Forest Watch would not exist without it. For those who care about the future of the planet Google Earth Engine is a great blessing!-Dr. Andrew Steer, President and CEO of the World Resources Institute.
CONVENIENT TOOLS
Use our web-based code editor for fast, interactive algorithm development with instant access to petabytes of data.
LEARN ABOUT THE CODE EDITOR
SCIENTIFIC AND HUMANITARIAN IMPACT
Scientists and non-profits use Earth Engine for remote sensing research, predicting disease outbreaks, natural resource management, and more.
SEE CASE STUDIES
READY TO BE PART OF THE SOLUTION?SIGN UP NOW
TERMS OF SERVICE PRIVACY ABOUT GOOGLE
f
Data from: Embracing the Future: Novice Software Engineers’ Perspective on...
figshare.com
zip
Updated Mar 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
emre ilgin; ESRA KIDIMAN; Murat YILMAZ; Filiz Mumcu (2024). Embracing the Future: Novice Software Engineers’ Perspective on the Rise of Hybrid Work Models in a Post-Pandemic World [Dataset]. http://doi.org/10.6084/m9.figshare.25331593.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25331593.v1
Dataset updated
Mar 3, 2024
Dataset provided by
figshare
Authors
emre ilgin; ESRA KIDIMAN; Murat YILMAZ; Filiz Mumcu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
World
Description
Perspectives of novice software engineers (NSEs) regarding hybrid work, examining their views on hybrid work conditions and their experiences with hybrid tools.
Real-world Wireless Communication Dataset
kaggle.com
Updated Apr 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SiddSS (2024). Real-world Wireless Communication Dataset [Dataset]. https://www.kaggle.com/datasets/siddss/real-world-wireless-communication-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 28, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
SiddSS
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset presents a collection of real-world RF signals encompassing three prominent wireless communication technologies: Wi-Fi (IEEE 802.11ax), LTE, and 5G. The data aims to facilitate advanced research in spectrum analysis, interference identification, and wireless communication optimization. The signals were meticulously captured under varying conditions to ensure a broad representation of real-world scenarios, including different modulation schemes, channel conditions, and data rates. This diverse collection serves as a benchmark for developers, researchers, and industry professionals striving to understand, compare, and innovate within the domains of Wi-Fi, LTE, and 5G. Potential applications range from algorithm development for signal processing, interference mitigation, signal classification, and so on.

**Instructions: **

Data is stored in numpy.int16 format. The python code to read the data is included in the .rar file.
Z
A Study of Real-world Data Races in Golang (Artifact)
data.niaid.nih.gov
Updated Mar 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ramanathan, Murali Krishna (2022). A Study of Real-world Data Races in Golang (Artifact) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6329163
Explore at:
Dataset updated
Mar 5, 2022
Dataset provided by
Ramanathan, Murali Krishna
Chabbi, Milind
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The concurrent programming literature is rich with tools and techniques for data race detection. Less, however, has been known about real-world, industry-scale deployment, experience, and insights about data races. Golang (Go for short) is a modern programming language that makes concurrency a first-class citizen. Go offers both message passing and shared memory for communicating among concurrent threads. Go is gaining popularity in modern microservice-based systems. Data races in Go stand in the face of its emerging popularity.

In this paper, using our industrial codebase as an example, we demonstrate that Go developers embrace concurrency and show how the abundance of concurrency alongside language idioms and nuances make Go programs highly susceptible to data races. Google's Go distribution ships with a built-in dynamic data race detector based on ThreadSanitizer. Dynamic race detectors pose scalability and flakiness challenges; we discuss various software engineering trade-offs to scale this detector to work effectively at scale.

We have deployed this detector in our 50-million lines of Go codebase hosting 2100 distinct microservices, found over 2000 data races, fixed over 1000 data races, spanning 790 distinct code patches submitted by 210 unique developers over a six-month period. Based on a detailed investigation of these data race patterns in Go, we make seven high-level observations relating to the complex interplay between the Go language paradigm and data races.
BETA-FOR_SP3_EnvironmentalAttributes_DLM/ESAWC/SRTM_2023
zenodo.org
Updated Feb 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick Kacic; Patrick Kacic (2025). BETA-FOR_SP3_EnvironmentalAttributes_DLM/ESAWC/SRTM_2023 [Dataset]. http://doi.org/10.5281/zenodo.14850688
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.14850688
Dataset updated
Feb 18, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Patrick Kacic; Patrick Kacic
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Time period covered
Oct 24, 2023
Description
This dataset provides additional information on environmental attributes (minimum distance to land cover classes, topographic information) based on the dataset "BETA-FOR_SPZ_Patches_2022/2023" (https://zenodo.org/records/14748236) (centroid coordinates: decimalLongitude, decimalLatitude).

From the following three geospatial datasets the information on environmental attributes were derived:

- DLM250 = Digital Landscape Model for Germany (Vector data, https://gdz.bkg.bund.de/index.php/default/open-data/digitales-landschaftsmodell-1-250-000-ebenen-dlm250-ebenen.html)

- ESA Worldcover = Global product on land cover (Raster data, https://developers.google.com/earth-engine/datasets/catalog/ESA_WorldCover_v100?hl=en)

- SRTM = Global Digital Elevation Model (Raster data, https://developers.google.com/earth-engine/datasets/catalog/USGS_SRTMGL1_003?hl=en)

The following attributes were added to the "BETA-FOR_SPZ_Patches_2022/2023" table and exported as .csv file (tabular data):

DLM250:

- min_dist_sie01_p = minimum distance to urban areas [m]

- min_dist_ver01_l = minimum distance to technical infrastructure (roads) [m]

- min_dist_veg01_f = minimum distance to agricultural areas [m]

- min_dist_gew01_l = minimum distance to waterbodies [m]

Please consider that the DLM250 is spatially discontinuous vector data where e.g. agricultural areas are incompletely assessed.

ESA WorldCover (ESAWC):

- min_dist_esawc_30 = minimum distance to grasslands (land cover class value = 30) [m]

- min_dist_esawc_40 = minimum distance to cropland (land cover class value = 40) [m]

SRTM:

- SRTM_elevation = elevation [m]

- SRTM_slope = slope [°]

- SRTM_aspect = aspect; 90° = E, 180° = S; 270 ° = W; 360°/0° = N [°]

The original vector and raster data can be made available upon request, e.g. to inspect benefits and limitations of DLM250 and ESA WorldCover.
Hello, salut!
kaggle.com
Updated Mar 10, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stefan Bohacek (2019). Hello, salut! [Dataset]. https://www.kaggle.com/datasets/fourtonfish/hello-salut/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 10, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Stefan Bohacek
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Context

I created the original Hello, salut! API as a free service that translates the word "hello" based on provided IP address or browser language setting. This is the dataset that powers my API.

Example uses:

Visualization of the dataset using Google Maps (source code)

A pair of bots (Twitter, Mastodon) that overlay the translation of the word "hello" on a map of the corresponding country

https://cdn.glitch.com/a2518d3c-4005-4f7c-997e-35c746b866e0%2Fhello-world.gif?1548593398618" alt="Hello around the world">

Content

There are four columns in this dataset:

code: The ISO code of a country.

country: The name of a country.

language: The ISO language code. In case of countries with more than one official language, I tried to determine the most dominant one.

hello: The translation of the world "hello". This is saved as an HTML-encoded string due to the original use being as a web API.

Example HTML decoding in Python:

import html print(html.unescape('&#x 4F60;&#x 597D;')) # prints 你好

lat, long: Latitude and longitude of the center of the country. (Source.)

Acknowledgements

This is currently a one-person effort, but I would love for others to join in!

Inspiration

I made this as a fun way to learn about PHP and SQL :-)
G
MCD12Q1.061 MODIS Land Cover Type Yearly Global 500m
developers.google.com
Updated Jan 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NASA LP DAAC at the USGS EROS Center (2023). MCD12Q1.061 MODIS Land Cover Type Yearly Global 500m [Dataset]. http://doi.org/10.5067/MODIS/MCD12Q1.061
Explore at:
Unique identifier
https://doi.org/10.5067/MODIS/MCD12Q1.061
Dataset updated
Jan 1, 2023
Dataset provided by
NASA LP DAAC at the USGS EROS Center
Time period covered
Jan 1, 2001 - Jan 1, 2023
Area covered
Earth
Description
The Terra and Aqua combined Moderate Resolution Imaging Spectroradiometer (MODIS) Land Cover Type (MCD12Q1) Version 6.1 data product provides global land cover types at yearly intervals. The MCD12Q1 Version 6.1 data product is derived using supervised classifications of MODIS Terra and Aqua reflectance data. Land cover types are derived from the International Geosphere-Biosphere Programme (IGBP), University of Maryland (UMD), Leaf Area Index (LAI), BIOME-Biogeochemical Cycles (BGC), and Plant Functional Types (PFT) classification schemes. The supervised classifications then underwent additional post-processing that incorporate prior knowledge and ancillary information to further refine specific classes. Additional land cover property assessment layers are provided by the Food and Agriculture Organization (FAO) Land Cover Classification System (LCCS) for land cover, land use, and surface hydrology. Layers for Land Cover Type 1-5, Land Cover Property 1-3, Land Cover Property Assessment 1-3, Land Cover Quality Control (QC), and a Land Water Mask are also provided. Documentation: User's Guide Algorithm Theoretical Basis Document (ATBD) General Documentation
M
Global Offline Programmer Market Research and Development Focus 2025-2032
statsndata.org
excel, pdf
Updated Jul 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stats N Data (2025). Global Offline Programmer Market Research and Development Focus 2025-2032 [Dataset]. https://www.statsndata.org/report/offline-programmer-market-369122
Explore at:
pdf, excelAvailable download formats
Dataset updated
Jul 2025
Dataset authored and provided by
Stats N Data
License
https://www.statsndata.org/how-to-orderhttps://www.statsndata.org/how-to-order
Area covered
Global
Description
The Offline Programmer market is increasingly gaining traction as industries look for efficient and cost-effective solutions to streamline their manufacturing processes. Offline programming refers to the use of advanced software to create and simulate programming instructions for robots and automated systems without
G
MOD11A1.061 Terra Land Surface Temperature and Emissivity Daily Global 1km
developers.google.com
Updated May 1, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NASA LP DAAC at the USGS EROS Center (2018). MOD11A1.061 Terra Land Surface Temperature and Emissivity Daily Global 1km [Dataset]. http://doi.org/10.5067/MODIS/MOD11A1.061
Explore at:
Unique identifier
https://doi.org/10.5067/MODIS/MOD11A1.061
Dataset updated
May 1, 2018
Dataset provided by
NASA LP DAAC at the USGS EROS Center
Time period covered
Feb 24, 2000 - Jul 29, 2025
Area covered
Earth
Description
The MOD11A1 V6.1 product provides daily land surface temperature (LST) and emissivity values in a 1200 x 1200 kilometer grid. The temperature value is derived from the MOD11_L2 swath product. Above 30 degrees latitude, some pixels may have multiple observations where the criteria for clear-sky are met. When this occurs, the pixel value is the average of all qualifying observations. Provided along with both the day-time and night-time surface temperature bands and their quality indicator layers are MODIS bands 31 and 32 and six observation layers. Documentation: User's Guide Algorithm Theoretical Basis Document (ATBD) General Documentation
I
Global IC Programmer Market Technological Advancements 2025-2032
statsndata.org
excel, pdf
Updated Jun 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stats N Data (2025). Global IC Programmer Market Technological Advancements 2025-2032 [Dataset]. https://www.statsndata.org/report/ic-programmer-market-69416
Explore at:
pdf, excelAvailable download formats
Dataset updated
Jun 2025
Dataset authored and provided by
Stats N Data
License
https://www.statsndata.org/how-to-orderhttps://www.statsndata.org/how-to-order
Area covered
Global
Description
The Integrated Circuit (IC) Programmer market is a vital component of the electronics manufacturing and design ecosystem, serving as the bridge between innovative semiconductor designs and their practical implementation in a myriad of applications. IC Programmers are primarily used to load software onto chips, enabl

Facebook

Twitter

Click to copy link

Link copied

Cite

Statista (2024). Developers population worldwide 2018-2024 [Dataset]. https://www.statista.com/statistics/627312/worldwide-developer-population/

Developers population worldwide 2018-2024

Explore at:

26 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Nov 26, 2024

Dataset authored and provided by

Statistahttp://statista.com/

Area covered

Worldwide

Description

The global developer population is expected to reach 28.7 million people by 2024, an increase of 3.2 million from the number seen in 2020. According to the source, much of this growth is expected to occur in China, where the growth rate is between six percent to eight percent heading up to 2023. How much do software developers earn in the U.S.? Software developers work within a wide array of specialties, honing their skills in different programming languages, techniques, or in disciplines such as design. The average salary of U.S.-based designers working in software development reached 108 thousand U.S. dollars as of June 2021, while this figure climbs to 165 thousand U.S. dollars for engineering managers. Salaries are highly dependent on location, however, with an entry-level developer working in the San Francisco/Bay area earning an average of 44.79 percent more than their counterparts starting out in Austin. JavaScript and HTML/CSS still the most widely used languages While programming languages continue to emerge or fall out of favor, JavaScript and HTML/CSS are mainstays of the coding landscape. In a global survey of software developers, over 60 percent of respondents reported using JavaScript, and HTML/CSS. SQL, Python, and Java rounded out the top five.

Clear search

Close search

Google apps

Main menu

Language	Number of Samples
Java	153,119
Ruby	233,710
Go	137,998
JavaScript	373,598
Python	472,469
PHP	294,394

Developers population worldwide 2018-2024

Most used programming languages among developers worldwide 2024

CommitBench

Data Statement for CommitBench

EXECUTIVE SUMMARY

CURATION RATIONALE

DOCUMENTATION FOR SOURCE DATASETS

LANGUAGE VARIETIES

SPEAKER DEMOGRAPHIC

ANNOTATOR DEMOGRAPHIC

SPEECH SITUATION AND CHARACTERISTICS

PREPROCESSING AND DATA FORMATTING

CAPTURE QUALITY

LIMITATIONS

METADATA

DISCLOSURES AND ETHICAL REVIEW

ABOUT THIS DOCUMENT

Global sought-after database skills for developers 2021

VIIRS Nighttime Day/Night Band Composites Version 1

Coding Interview Platform Market Report | Global Forecast From 2025 To 2033

Coding Interview Platform Market Outlook

Product Type Analysis

Global AI Code Tools Market Report 2025 Edition, Market Size, Share, CAGR,...

Data from: CoUpJava: A Dataset of Code Upgrade Histories in Open-Source Java...

Harmonized Sentinel-2 MSI: MultiSpectral Instrument, Level-1C (TOA)

LAIL

Data from: Google Earth Engine (GEE)

Data from: Embracing the Future: Novice Software Engineers’ Perspective on...

Real-world Wireless Communication Dataset

A Study of Real-world Data Races in Golang (Artifact)

BETA-FOR_SP3_EnvironmentalAttributes_DLM/ESAWC/SRTM_2023

Hello, salut!

Context

Content

Acknowledgements

Inspiration

MCD12Q1.061 MODIS Land Cover Type Yearly Global 500m

Global Offline Programmer Market Research and Development Focus 2025-2032

MOD11A1.061 Terra Land Surface Temperature and Emissivity Daily Global 1km

Global IC Programmer Market Technological Advancements 2025-2032

Developers population worldwide 2018-2024