95 datasets found
  1. Developers population worldwide 2018-2024

    • statista.com
    Updated Nov 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Developers population worldwide 2018-2024 [Dataset]. https://www.statista.com/statistics/627312/worldwide-developer-population/
    Explore at:
    Dataset updated
    Nov 26, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Worldwide
    Description

    The global developer population is expected to reach 28.7 million people by 2024, an increase of 3.2 million from the number seen in 2020. According to the source, much of this growth is expected to occur in China, where the growth rate is between six percent to eight percent heading up to 2023. How much do software developers earn in the U.S.? Software developers work within a wide array of specialties, honing their skills in different programming languages, techniques, or in disciplines such as design. The average salary of U.S.-based designers working in software development reached 108 thousand U.S. dollars as of June 2021, while this figure climbs to 165 thousand U.S. dollars for engineering managers. Salaries are highly dependent on location, however, with an entry-level developer working in the San Francisco/Bay area earning an average of 44.79 percent more than their counterparts starting out in Austin. JavaScript and HTML/CSS still the most widely used languages While programming languages continue to emerge or fall out of favor, JavaScript and HTML/CSS are mainstays of the coding landscape. In a global survey of software developers, over 60 percent of respondents reported using JavaScript, and HTML/CSS. SQL, Python, and Java rounded out the top five.

  2. Most used programming languages among developers worldwide 2024

    • statista.com
    Updated Feb 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Most used programming languages among developers worldwide 2024 [Dataset]. https://www.statista.com/statistics/793628/worldwide-developer-survey-most-used-languages/
    Explore at:
    Dataset updated
    Feb 6, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    May 19, 2024 - Jun 20, 2024
    Area covered
    Worldwide
    Description

    As of 2024, JavaScript and HTML/CSS were the most commonly used programming languages among software developers around the world, with more than 62 percent of respondents stating that they used JavaScript and just around 53 percent using HTML/CSS. Python, SQL, and TypeScript rounded out the top five most widely used programming languages around the world. Programming languages At a very basic level, programming languages serve as sets of instructions that direct computers on how to behave and carry out tasks. Thanks to the increased prevalence of, and reliance on, computers and electronic devices in today’s society, these languages play a crucial role in the everyday lives of people around the world. An increasing number of people are interested in furthering their understanding of these tools through courses and bootcamps, while current developers are constantly seeking new languages and resources to learn to add to their skills. Furthermore, programming knowledge is becoming an important skill to possess within various industries throughout the business world. Job seekers with skills in Python, R, and SQL will find their knowledge to be among the most highly desirable data science skills and likely assist in their search for employment.

  3. CommitBench

    • zenodo.org
    csv, json
    Updated Feb 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maximilian Schall; Maximilian Schall; Tamara Czinczoll; Tamara Czinczoll; Gerard de Melo; Gerard de Melo (2024). CommitBench [Dataset]. http://doi.org/10.5281/zenodo.10497442
    Explore at:
    json, csvAvailable download formats
    Dataset updated
    Feb 14, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Maximilian Schall; Maximilian Schall; Tamara Czinczoll; Tamara Czinczoll; Gerard de Melo; Gerard de Melo
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Time period covered
    Dec 15, 2023
    Description

    Data Statement for CommitBench

    - Dataset Title: CommitBench
    - Dataset Curator: Maximilian Schall, Tamara Czinczoll, Gerard de Melo
    - Dataset Version: 1.0, 15.12.2023
    - Data Statement Author: Maximilian Schall, Tamara Czinczoll
    - Data Statement Version: 1.0, 16.01.2023

    EXECUTIVE SUMMARY

    We provide CommitBench as an open-source, reproducible and privacy- and license-aware benchmark for commit message generation. The dataset is gathered from github repositories with licenses that permit redistribution. We provide six programming languages, Java, Python, Go, JavaScript, PHP and Ruby. The commit messages in natural language are restricted to English, as it is the working language in many software development projects. The dataset has 1,664,590 examples that were generated by using extensive quality-focused filtering techniques (e.g. excluding bot commits). Additionally, we provide a version with longer sequences for benchmarking models with more extended sequence input, as well a version with

    CURATION RATIONALE

    We created this dataset due to quality and legal issues with previous commit message generation datasets. Given a git diff displaying code changes between two file versions, the task is to predict the accompanying commit message describing these changes in natural language. We base our GitHub repository selection on that of a previous dataset, CodeSearchNet, but apply a large number of filtering techniques to improve the data quality and eliminate noise. Due to the original repository selection, we are also restricted to the aforementioned programming languages. It was important to us, however, to provide some number of programming languages to accommodate any changes in the task due to the degree of hardware-relatedness of a language. The dataset is provides as a large CSV file containing all samples. We provide the following fields: Diff, Commit Message, Hash, Project, Split.

    DOCUMENTATION FOR SOURCE DATASETS

    Repository selection based on CodeSearchNet, which can be found under https://github.com/github/CodeSearchNet

    LANGUAGE VARIETIES

    Since GitHub hosts software projects from all over the world, there is no single uniform variety of English used across all commit messages. This means that phrasing can be regional or subject to influences from the programmer's native language. It also means that different spelling conventions may co-exist and that different terms may used for the same concept. Any model trained on this data should take these factors into account. For the number of samples for different programming languages, see Table below:

    LanguageNumber of Samples
    Java153,119
    Ruby233,710
    Go137,998
    JavaScript373,598
    Python472,469
    PHP294,394

    SPEAKER DEMOGRAPHIC

    Due to the extremely diverse (geographically, but also socio-economically) backgrounds of the software development community, there is no single demographic the data comes from. Of course, this does not entail that there are no biases when it comes to the data origin. Globally, the average software developer tends to be male and has obtained higher education. Due to the anonymous nature of GitHub profiles, gender distribution information cannot be extracted.

    ANNOTATOR DEMOGRAPHIC

    Due to the automated generation of the dataset, no annotators were used.

    SPEECH SITUATION AND CHARACTERISTICS

    The public nature and often business-related creation of the data by the original GitHub users fosters a more neutral, information-focused and formal language. As it is not uncommon for developers to find the writing of commit messages tedious, there can also be commit messages representing the frustration or boredom of the commit author. While our filtering is supposed to catch these types of messages, there can be some instances still in the dataset.

    PREPROCESSING AND DATA FORMATTING

    See paper for all preprocessing steps. We do not provide the un-processed raw data due to privacy concerns, but it can be obtained via CodeSearchNet or requested from the authors.

    CAPTURE QUALITY

    While our dataset is completely reproducible at the time of writing, there are external dependencies that could restrict this. If GitHub shuts down and someone with a software project in the dataset deletes their repository, there can be instances that are non-reproducible.

    LIMITATIONS

    While our filters are meant to ensure a high quality for each data sample in the dataset, we cannot ensure that only low-quality examples were removed. Similarly, we cannot guarantee that our extensive filtering methods catch all low-quality examples. Some might remain in the dataset. Another limitation of our dataset is the low number of programming languages (there are many more) as well as our focus on English commit messages. There might be some people that only write commit messages in their respective languages, e.g., because the organization they work at has established this or because they do not speak English (confidently enough). Perhaps some languages' syntax better aligns with that of programming languages. These effects cannot be investigated with CommitBench.

    Although we anonymize the data as far as possible, the required information for reproducibility, including the organization, project name, and project hash, makes it possible to refer back to the original authoring user account, since this information is freely available in the original repository on GitHub.

    METADATA

    License: Dataset under the CC BY-NC 4.0 license

    DISCLOSURES AND ETHICAL REVIEW

    While we put substantial effort into removing privacy-sensitive information, our solutions cannot find 100% of such cases. This means that researchers and anyone using the data need to incorporate their own safeguards to effectively reduce the amount of personal information that can be exposed.

    ABOUT THIS DOCUMENT

    A data statement is a characterization of a dataset that provides context to allow developers and users to better understand how experimental results might generalize, how software might be appropriately deployed, and what biases might be reflected in systems built on the software.

    This data statement was written based on the template for the Data Statements Version 2 schema. The template was prepared by Angelina McMillan-Major, Emily M. Bender, and Batya Friedman and can be found at https://techpolicylab.uw.edu/data-statements/ and was updated from the community Version 1 Markdown template by Leon Dercyznski.

  4. Global sought-after database skills for developers 2021

    • statista.com
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2023). Global sought-after database skills for developers 2021 [Dataset]. https://www.statista.com/statistics/793854/worldwide-developer-survey-most-wanted-database/
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    May 25, 2021 - Jun 15, 2021
    Area covered
    Worldwide
    Description

    According to the survey, just under 18 percent of respondents identified PostgreSQQL as one of the most-wanted database skills. MongoDB ranked second with 17.89 percent stating they are not developing with it, but want to.

  5. G

    VIIRS Nighttime Day/Night Band Composites Version 1

    • developers.google.com
    Updated May 31, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Earth Observation Group, Payne Institute for Public Policy, Colorado School of Mines (2017). VIIRS Nighttime Day/Night Band Composites Version 1 [Dataset]. https://developers.google.com/earth-engine/datasets/catalog/NOAA_VIIRS_DNB_MONTHLY_V1_VCMCFG
    Explore at:
    Dataset updated
    May 31, 2017
    Dataset provided by
    Earth Observation Group, Payne Institute for Public Policy, Colorado School of Mines
    Time period covered
    Apr 1, 2012 - Mar 1, 2025
    Area covered
    Description

    Monthly average radiance composite images using nighttime data from the Visible Infrared Imaging Radiometer Suite (VIIRS) Day/Night Band (DNB). As these data are composited monthly, there are many areas of the globe where it is impossible to get good quality data coverage for that month. This can be due to …

  6. D

    Coding Interview Platform Market Report | Global Forecast From 2025 To 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Jan 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Coding Interview Platform Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/coding-interview-platform-market
    Explore at:
    pptx, csv, pdfAvailable download formats
    Dataset updated
    Jan 7, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Coding Interview Platform Market Outlook



    The global coding interview platform market size was valued at approximately USD 500 million in 2023 and is expected to reach USD 1.8 billion by 2032, growing at a CAGR of 15% from 2024 to 2032. The significant growth factor driving this market includes the increasing demand for skilled software developers and the rise of remote hiring practices, which have necessitated the need for robust, scalable, and efficient coding interview platforms.



    One of the prominent growth factors in the coding interview platform market is the ever-increasing demand for software developers across various industries. As technology continues to evolve and permeate every aspect of business operations, the need for proficient coders has surged. Companies are investing heavily in recruiting top-tier coding talent to stay competitive in the digital age. This demand creates a substantial market for platforms that offer coding interview services, enabling employers to assess and hire the best talent efficiently.



    Another crucial driver is the transformation of the hiring process due to the COVID-19 pandemic, which has accelerated the adoption of remote work and virtual hiring practices. Organizations are now more inclined to use coding interview platforms to evaluate candidates remotely. These platforms provide an effective way to conduct technical assessments, reducing the need for in-person interviews and allowing companies to expand their talent pool beyond geographical constraints. This shift towards remote hiring is expected to have a lasting impact on the demand for coding interview platforms.



    Additionally, the integration of advanced technologies such as artificial intelligence and machine learning in coding interview platforms is enhancing their capabilities and making them more attractive to employers. These technologies enable the automation of candidate evaluation processes, provide deeper insights into coding skills, and offer personalized feedback to candidates. As a result, coding interview platforms are becoming more sophisticated and efficient, further driving their adoption across various industries.



    From a regional perspective, North America holds the largest share in the coding interview platform market. This dominance can be attributed to the presence of a significant number of tech giants and startups in the region, which are constantly on the lookout for skilled coders. Additionally, the Asia Pacific region is expected to witness the highest growth rate during the forecast period. The burgeoning tech industry in countries like India and China, coupled with a growing number of coding bootcamps and educational institutions, is contributing to the increasing demand for coding interview platforms in the region.



    Product Type Analysis



    The coding interview platform market can be segmented by product type into web-based and mobile-based platforms. Web-based coding interview platforms dominate the market, primarily due to their ease of accessibility and comprehensive features. These platforms are widely used by enterprises and individuals alike, thanks to their ability to offer real-time coding assessments, video interviews, and collaborative coding environments. With continuous advancements in web technologies, these platforms are becoming increasingly robust, scalable, and user-friendly, making them a preferred choice for many.



    Mobile-based coding interview platforms, on the other hand, are gaining traction, especially among younger, tech-savvy candidates and startups that emphasize flexibility and mobility. These platforms allow users to participate in coding interviews from their smartphones or tablets, offering a convenient alternative to traditional web-based platforms. The rise of mobile app development and the proliferation of smartphones are key factors driving the growth of mobile-based coding interview platforms. While not as comprehensive as their web-based counterparts, mobile-based platforms often offer streamlined, user-friendly interfaces tailored to on-the-go use.



    Both web-based and mobile-based platforms are continually evolving to meet the diverse needs of their users. For instance, web-based platforms are integrating advanced analytics and reporting features, allowing recruiters to make data-driven hiring decisions. Similarly, mobile-based platforms are incorporating features such as offline coding capabilities and push notifications to enhance user engagement and experience. The competition between these two product types is fostering innovation, resulting in more sophisticated

  7. c

    Global AI Code Tools Market Report 2025 Edition, Market Size, Share, CAGR,...

    • cognitivemarketresearch.com
    pdf,excel,csv,ppt
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cognitive Market Research, Global AI Code Tools Market Report 2025 Edition, Market Size, Share, CAGR, Forecast, Revenue [Dataset]. https://www.cognitivemarketresearch.com/ai-code-tools-market-report
    Explore at:
    pdf,excel,csv,pptAvailable download formats
    Dataset authored and provided by
    Cognitive Market Research
    License

    https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy

    Time period covered
    2021 - 2033
    Area covered
    Global
    Description

    According to Cognitive Market Research, the global AI Code Tools Market size is USD 4.0 billion in 2024 and will expand at a compound annual growth rate (CAGR) of 22.6% from 2024 to 2031. Market Dynamics of AI Code Tools Market

    Key Drivers for AI Code Tools Market

    Need to Help Developers with Tough Coding Tasks - Al code tools are invaluable to software developers when dealing with complex coding assignments, and this has emerged as a powerful motivator for their proliferation in the software development scene. One key component of this support is the capacity of Al coding tools to simplify code transition, which is especially useful when dealing with legacy source code or different programming languages. A study presented at the 2021 International Conference on Intelligent User Interfaces described how generative Al provided developers with a skeletal framework for translating source code into Python. A 2022 study published in the Proceedings of the Association for Computing Machinery on Programming Languages (PACMPL) found that various tools, such as GitHub Copilot, sped up coding by providing end-of-line suggestions for function calls and argument completions. Increasing Adoption of Low-Code/No-Code Platforms

    Key Restraints for AI Code Tools Market

    Complex and specialized applications Overreliance on Al code tools can hamper problem-solving abilities Introduction of the AI Code Tools Market

    Machine learning and artificial intelligence-powered code tools are revolutionizing software development by boosting productivity and streamlining workflows. These tools aim to automate, optimize, and streamline many parts of software engineering, hence enhancing developer efficiency and accessibility. AI coding tools offer a variety of capabilities and functions. It can give developers intelligent code ideas, allowing them to write faster and with fewer mistakes. They examine the code context and suggest suitable code snippets, function names, and variable names. In addition, growing investment in AI code tools companies is moving the AI code tools sector forward rapidly. This funding enables startups to innovate, create cutting-edge technologies, and improve their existing capabilities. With appropriate funding, these companies can expand their research, speed product development, and provide more advanced solutions to developers. However, these technologies usually demand access to sensitive codebases and private information, which raises data and intellectual property security risks. AI code tools must adhere to strong data privacy guidelines and be safeguarded against unauthorized access. To secure their intellectual assets, developers and organizations demand strong encryption mechanisms and access controls. Therefore, data privacy and security must be considered while adopting AI code tools.

  8. Data from: CoUpJava: A Dataset of Code Upgrade Histories in Open-Source Java...

    • zenodo.org
    application/gzip, bin
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaihang Jiang; Jin Bihui; Nie Pengyu; Kaihang Jiang; Jin Bihui; Nie Pengyu (2025). CoUpJava: A Dataset of Code Upgrade Histories in Open-Source Java Repositories [Dataset]. http://doi.org/10.5281/zenodo.15293313
    Explore at:
    bin, application/gzipAvailable download formats
    Dataset updated
    Apr 28, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Kaihang Jiang; Jin Bihui; Nie Pengyu; Kaihang Jiang; Jin Bihui; Nie Pengyu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Modern programming languages are constantly evolving, introducing new language features and APIs to enhance software development practices. Software developers often face the tedious task of upgrading their codebase to new programming language versions. Recently, large language models (LLMs) have demonstrated potential in automating various code generation and editing tasks, suggesting their applicability in automating code upgrade. However, there exists no benchmark for evaluating the code upgrade ability of LLMs, as distilling code changes related to programming language evolution from real-world software repositories’ commit histories is a complex challenge.
    In this work, we introduce CoUpJava, the first large-scale dataset for code upgrade, focusing on the code changes related to the evolution of Java. CoUpJava comprises 10,697 code upgrade samples, distilled from the commit histories of 1,379 open-source Java repositories and covering Java versions 7–23. The dataset is divided into two subsets: CoUpJava-Fine, which captures fine-grained method-level refactorings towards new language features; and CoUpJava-Coarse, which includes coarse-grained repository-level changes encompassing new language features, standard library APIs, and build configurations. Our proposed dataset provides high-quality samples by filtering irrelevant and noisy changes and verifying the compilability of upgraded code. Moreover, CoUpJava reveals diversity in code upgrade scenarios, ranging from small, fine-grained refactorings to large-scale repository modifications.

  9. Harmonized Sentinel-2 MSI: MultiSpectral Instrument, Level-1C (TOA)

    • developers.google.com
    Updated Feb 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    European Union/ESA/Copernicus (2024). Harmonized Sentinel-2 MSI: MultiSpectral Instrument, Level-1C (TOA) [Dataset]. https://developers.google.com/earth-engine/datasets/catalog/COPERNICUS_S2_HARMONIZED
    Explore at:
    Dataset updated
    Feb 15, 2024
    Dataset provided by
    European Space Agencyhttp://www.esa.int/
    Time period covered
    Jun 27, 2015 - Jul 31, 2025
    Area covered
    Description

    After 2022-01-25, Sentinel-2 scenes with PROCESSING_BASELINE '04.00' or above have their DN (value) range shifted by 1000. The HARMONIZED collection shifts data in newer scenes to be in the same range as in older scenes. Sentinel-2 is a wide-swath, high-resolution, multi-spectral imaging mission supporting Copernicus Land Monitoring studies, including the monitoring of vegetation, soil and water cover, as well as observation of inland waterways and coastal areas. The Sentinel-2 data contain 13 UINT16 spectral bands representing TOA reflectance scaled by 10000. See the Sentinel-2 User Handbook for details. QA60 is a bitmask band that contained rasterized cloud mask polygons until Feb 2022, when these polygons stopped being produced. Starting in February 2024, legacy-consistent QA60 bands are constructed from the MSK_CLASSI cloud classification bands. For more details, see the full explanation of how cloud masks are computed.. Each Sentinel-2 product (zip archive) may contain multiple granules. Each granule becomes a separate Earth Engine asset. EE asset ids for Sentinel-2 assets have the following format: COPERNICUS/S2/20151128T002653_20151128T102149_T56MNN. Here the first numeric part represents the sensing date and time, the second numeric part represents the product generation date and time, and the final 6-character string is a unique granule identifier indicating its UTM grid reference (see MGRS). The Level-2 data produced by ESA can be found in the collection COPERNICUS/S2_SR. For datasets to assist with cloud and/or cloud shadow detection, see COPERNICUS/S2_CLOUD_PROBABILITY and GOOGLE/CLOUD_SCORE_PLUS/V1/S2_HARMONIZED. For more details on Sentinel-2 radiometric resolution, see this page.

  10. f

    LAIL

    • figshare.com
    zip
    Updated Jul 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jia Li (2024). LAIL [Dataset]. http://doi.org/10.6084/m9.figshare.22014596.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 30, 2024
    Dataset provided by
    figshare
    Authors
    Jia Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LAILLAIL is a Large language model-Aware selection approach for In-context-Learning-based code generation named LAIL. LAIL uses LLMs themselves to select examples. It requires LLMs themselves to label a candidate example as a positive example or a negative example for a requirement.## Requirements- openai- tqdm- javaWe also privide a scripts (/Evaluation/evaluation_setup.sh) to help set up programming language dependencies that are used in evaluation.bashbash evaluation_setup.sh###### DatasetThe datasets contain DevEval, MBJP, MBPP, MBCPP, and HumanEval. DevEval is a repository-level code generation dataset, which is collected from real-word code repositories. The dataset aligns with real-world code repositories in multiple dimensions. Thus, we take DevEval as the example to demonstrate how to process the dataset. Take ../Dataset/DevEval as example:train.jsonl and test.jsonl:(1) We randomly select two domains to evaluate LAIL and baselines, including the scientific engineering domain and text processing domain. (2) We randomly split the tasks of the two domains into the training set and the test set. Finally, we acquire 101 examples in the training set and 49 examples in the test set. (3) Given a requirement from a repository, we use tree-sitter to parse the repository and acquire all functions of the repository. (4) We treat functions contained in the repository as the candidate pool. Then LAIL and baselines retrieve a few functions from thecandidate pool as demonstration examples. source data and test_source data folders consist of the original code repositories collected from Github.estimate_prompt folder contain the constructed prompts to estimate candidate examples.generation_prompt folder contains the constructed prompts where the demonstration examples are selected by LAIL and different baselines. For example:(1) ICL_LAIL folder provides the selected examples' id in LAIL_id by our LAIL. Developers can directly use these provided prompts through codellama_completion.py to generate programs. (2) After generating programs, developers need to process generated programs with process_generation.py. (3) Finally, developers evaluate the generated programs with the source code in Evaluation folder.############ LAIL ### Estimate candidate examples by LLMs themselvesWe leverage LLM themseleves to estimate candidate examples. The code is storaged in the LAIL/estimate_examples package.Take DevEval as example:(1) /Dataset/DevEval/estimate_prompt folder contains the constructed prompts to estimate candidate examples.(2) Developers run the following command to estimate candidate examples by CodeLlama-7B:bashbash make_estimation_prompt.sh ../Dataset/DevEval/estimation_prompt(3) According to the probability feedback of LLMs, we acquire the positive and negative examples.###### Train a neural retriever(1) We use the labeled positive and negative examples to train a neural retriever with contrastive learning. The code is storaged in the /LAIL/LAIL/retriever/train folder.bashexport CUDA_VISIBLE_DEVICES=0nohup python run.py \ --output_dir=/saved_models \ --model_type=roberta \ --config_name=microsoft/graphcodebert-base \ --model_name_or_path=microsoft/graphcodebert-base \ --tokenizer_name=microsoft/graphcodebert-base \ --do_train \ --train_data_file=/id.jsonl \ --epoch 100 \ --block_size 128 \ --train_batch_size 16 \ --learning_rate 1e-4 \ --max_grad_norm 1.0 \ --seed 123456 >mbpp.txt 2>&1 &## Select a few demonstration examples using the trained retriever(2) Given a test requirement, developers use the trained retriever to select a few demonstration examples.The code is storaged in the /LAIL/LAIL/retriever/train folder.bashbash run_inference.sh ../Dataset/DevEval###### Code Generation(1) After acquired the prompt context consisting of a few selected examples, developers input a test requirement and the prompt context into LLMs and acquire desired programs.For example, developers use CodeLlama ( ../LAIL/ICL_LAIL/codellama_completion.py) to generate programs:bashexport CUDA_VISIBLE_DEVICES=0torchrun --nproc_per_node=1 --master_port=16665 codellama_completion.py Salesforce/CodeLlama-7b ../Dataset/DevEval/prompt_LAIL.jsonl --temperature=0.8 --max_batch_size=4 --output_base=output_random --get_logits=False(2) After generating programs, developers need to process generated programs with ../LAIL/ICL_LAIL/process_generation.py. bashpython process_generation.py###### BaselinesThis paper contains seven baselines that use different approaches to select demonstration examples for ICL_based code generation.(1) The source code is in the baselines folder and each baseline is in a individual folder.Developers can acquire the selected examples of all baselines by runing source code as follows:bashpython baselines.py(2) Then, developers use /baselines/make_prompt.py to contruct a prompt context using the selected candidate examples as follows:bashpython make_prompt.py ICLCoder ICLCoder -1###### EvaluationIn this paper, we use Pass@k to evaluate the performances of LAIL and baselines by the source code in LAIL/EvaluationSince the DevEval dataset is a repository-level code generation which is complex to evaluate, developers can use the following pipeline to evaluate different approaches by the source code in /LAIL/Evaluation/.## CitationIf you have any questions or suggestions, please email us at lijiaa@pku.edu.cn.

  11. A

    Data from: Google Earth Engine (GEE)

    • data.amerigeoss.org
    esri rest, html
    Updated Nov 28, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AmeriGEO ArcGIS (2018). Google Earth Engine (GEE) [Dataset]. https://data.amerigeoss.org/dataset/google-earth-engine-gee
    Explore at:
    esri rest, htmlAvailable download formats
    Dataset updated
    Nov 28, 2018
    Dataset provided by
    AmeriGEO ArcGIS
    Description

    Meet Earth Engine

    Google Earth Engine combines a multi-petabyte catalog of satellite imagery and geospatial datasets with planetary-scale analysis capabilities and makes it available for scientists, researchers, and developers to detect changes, map trends, and quantify differences on the Earth's surface.

    Satellite imagerySATELLITE IMAGERY+Your algorithmsYOUR ALGORITHMS+Causes you care aboutREAL WORLD APPLICATIONS
    LEARN MORE
    GLOBAL-SCALE INSIGHT

    Explore our interactive timelapse viewer to travel back in time and see how the world has changed over the past twenty-nine years. Timelapse is one example of how Earth Engine can help gain insight into petabyte-scale datasets.

    EXPLORE TIMELAPSE
    READY-TO-USE DATASETS

    The public data archive includes more than thirty years of historical imagery and scientific datasets, updated and expanded daily. It contains over twenty petabytes of geospatial data instantly available for analysis.

    EXPLORE DATASETS
    SIMPLE, YET POWERFUL API

    The Earth Engine API is available in Python and JavaScript, making it easy to harness the power of Google’s cloud for your own geospatial analysis.

    EXPLORE THE API
    Google Earth Engine has made it possible for the first time in history to rapidly and accurately process vast amounts of satellite imagery, identifying where and when tree cover change has occurred at high resolution. Global Forest Watch would not exist without it. For those who care about the future of the planet Google Earth Engine is a great blessing!-Dr. Andrew Steer, President and CEO of the World Resources Institute.
    CONVENIENT TOOLS

    Use our web-based code editor for fast, interactive algorithm development with instant access to petabytes of data.

    LEARN ABOUT THE CODE EDITOR
    SCIENTIFIC AND HUMANITARIAN IMPACT

    Scientists and non-profits use Earth Engine for remote sensing research, predicting disease outbreaks, natural resource management, and more.

    SEE CASE STUDIES
    READY TO BE PART OF THE SOLUTION?SIGN UP NOW
    TERMS OF SERVICE PRIVACY ABOUT GOOGLE

  12. f

    Data from: Embracing the Future: Novice Software Engineers’ Perspective on...

    • figshare.com
    zip
    Updated Mar 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    emre ilgin; ESRA KIDIMAN; Murat YILMAZ; Filiz Mumcu (2024). Embracing the Future: Novice Software Engineers’ Perspective on the Rise of Hybrid Work Models in a Post-Pandemic World [Dataset]. http://doi.org/10.6084/m9.figshare.25331593.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 3, 2024
    Dataset provided by
    figshare
    Authors
    emre ilgin; ESRA KIDIMAN; Murat YILMAZ; Filiz Mumcu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    World
    Description

    Perspectives of novice software engineers (NSEs) regarding hybrid work, examining their views on hybrid work conditions and their experiences with hybrid tools.

  13. Real-world Wireless Communication Dataset

    • kaggle.com
    Updated Apr 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SiddSS (2024). Real-world Wireless Communication Dataset [Dataset]. https://www.kaggle.com/datasets/siddss/real-world-wireless-communication-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 28, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    SiddSS
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset presents a collection of real-world RF signals encompassing three prominent wireless communication technologies: Wi-Fi (IEEE 802.11ax), LTE, and 5G. The data aims to facilitate advanced research in spectrum analysis, interference identification, and wireless communication optimization. The signals were meticulously captured under varying conditions to ensure a broad representation of real-world scenarios, including different modulation schemes, channel conditions, and data rates. This diverse collection serves as a benchmark for developers, researchers, and industry professionals striving to understand, compare, and innovate within the domains of Wi-Fi, LTE, and 5G. Potential applications range from algorithm development for signal processing, interference mitigation, signal classification, and so on.

    **Instructions: **

    Data is stored in numpy.int16 format. The python code to read the data is included in the .rar file.

  14. Z

    A Study of Real-world Data Races in Golang (Artifact)

    • data.niaid.nih.gov
    Updated Mar 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ramanathan, Murali Krishna (2022). A Study of Real-world Data Races in Golang (Artifact) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6329163
    Explore at:
    Dataset updated
    Mar 5, 2022
    Dataset provided by
    Ramanathan, Murali Krishna
    Chabbi, Milind
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The concurrent programming literature is rich with tools and techniques for data race detection. Less, however, has been known about real-world, industry-scale deployment, experience, and insights about data races. Golang (Go for short) is a modern programming language that makes concurrency a first-class citizen. Go offers both message passing and shared memory for communicating among concurrent threads. Go is gaining popularity in modern microservice-based systems. Data races in Go stand in the face of its emerging popularity.

    In this paper, using our industrial codebase as an example, we demonstrate that Go developers embrace concurrency and show how the abundance of concurrency alongside language idioms and nuances make Go programs highly susceptible to data races. Google's Go distribution ships with a built-in dynamic data race detector based on ThreadSanitizer. Dynamic race detectors pose scalability and flakiness challenges; we discuss various software engineering trade-offs to scale this detector to work effectively at scale.

    We have deployed this detector in our 50-million lines of Go codebase hosting 2100 distinct microservices, found over 2000 data races, fixed over 1000 data races, spanning 790 distinct code patches submitted by 210 unique developers over a six-month period. Based on a detailed investigation of these data race patterns in Go, we make seven high-level observations relating to the complex interplay between the Go language paradigm and data races.

  15. BETA-FOR_SP3_EnvironmentalAttributes_DLM/ESAWC/SRTM_2023

    • zenodo.org
    Updated Feb 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Kacic; Patrick Kacic (2025). BETA-FOR_SP3_EnvironmentalAttributes_DLM/ESAWC/SRTM_2023 [Dataset]. http://doi.org/10.5281/zenodo.14850688
    Explore at:
    Dataset updated
    Feb 18, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Patrick Kacic; Patrick Kacic
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Time period covered
    Oct 24, 2023
    Description
    This dataset provides additional information on environmental attributes (minimum distance to land cover classes, topographic information) based on the dataset "BETA-FOR_SPZ_Patches_2022/2023" (https://zenodo.org/records/14748236) (centroid coordinates: decimalLongitude, decimalLatitude).
    From the following three geospatial datasets the information on environmental attributes were derived:
    - ESA Worldcover = Global product on land cover (Raster data, https://developers.google.com/earth-engine/datasets/catalog/ESA_WorldCover_v100?hl=en)
    - SRTM = Global Digital Elevation Model (Raster data, https://developers.google.com/earth-engine/datasets/catalog/USGS_SRTMGL1_003?hl=en)
    The following attributes were added to the "BETA-FOR_SPZ_Patches_2022/2023" table and exported as .csv file (tabular data):
    DLM250:
    - min_dist_sie01_p = minimum distance to urban areas [m]
    - min_dist_ver01_l = minimum distance to technical infrastructure (roads) [m]
    - min_dist_veg01_f = minimum distance to agricultural areas [m]
    - min_dist_gew01_l = minimum distance to waterbodies [m]
    Please consider that the DLM250 is spatially discontinuous vector data where e.g. agricultural areas are incompletely assessed.
    ESA WorldCover (ESAWC):
    - min_dist_esawc_30 = minimum distance to grasslands (land cover class value = 30) [m]
    - min_dist_esawc_40 = minimum distance to cropland (land cover class value = 40) [m]
    SRTM:
    - SRTM_elevation = elevation [m]
    - SRTM_slope = slope [°]
    - SRTM_aspect = aspect; 90° = E, 180° = S; 270 ° = W; 360°/0° = N [°]
    The original vector and raster data can be made available upon request, e.g. to inspect benefits and limitations of DLM250 and ESA WorldCover.
  16. Hello, salut!

    • kaggle.com
    Updated Mar 10, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stefan Bohacek (2019). Hello, salut! [Dataset]. https://www.kaggle.com/datasets/fourtonfish/hello-salut/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 10, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Stefan Bohacek
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Context

    I created the original Hello, salut! API as a free service that translates the word "hello" based on provided IP address or browser language setting. This is the dataset that powers my API.

    Example uses:

    https://cdn.glitch.com/a2518d3c-4005-4f7c-997e-35c746b866e0%2Fhello-world.gif?1548593398618" alt="Hello around the world">

    Content

    There are four columns in this dataset:

    • code: The ISO code of a country.
    • country: The name of a country.
    • language: The ISO language code. In case of countries with more than one official language, I tried to determine the most dominant one.
    • hello: The translation of the world "hello". This is saved as an HTML-encoded string due to the original use being as a web API.

    Example HTML decoding in Python:

    import html
    print(html.unescape('&#x 4F60;&#x 597D;'))
    # prints 你好
    
    • lat, long: Latitude and longitude of the center of the country. (Source.)

    Acknowledgements

    This is currently a one-person effort, but I would love for others to join in!

    Inspiration

    I made this as a fun way to learn about PHP and SQL :-)

  17. G

    MCD12Q1.061 MODIS Land Cover Type Yearly Global 500m

    • developers.google.com
    Updated Jan 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NASA LP DAAC at the USGS EROS Center (2023). MCD12Q1.061 MODIS Land Cover Type Yearly Global 500m [Dataset]. http://doi.org/10.5067/MODIS/MCD12Q1.061
    Explore at:
    Dataset updated
    Jan 1, 2023
    Dataset provided by
    NASA LP DAAC at the USGS EROS Center
    Time period covered
    Jan 1, 2001 - Jan 1, 2023
    Area covered
    Earth
    Description

    The Terra and Aqua combined Moderate Resolution Imaging Spectroradiometer (MODIS) Land Cover Type (MCD12Q1) Version 6.1 data product provides global land cover types at yearly intervals. The MCD12Q1 Version 6.1 data product is derived using supervised classifications of MODIS Terra and Aqua reflectance data. Land cover types are derived from the International Geosphere-Biosphere Programme (IGBP), University of Maryland (UMD), Leaf Area Index (LAI), BIOME-Biogeochemical Cycles (BGC), and Plant Functional Types (PFT) classification schemes. The supervised classifications then underwent additional post-processing that incorporate prior knowledge and ancillary information to further refine specific classes. Additional land cover property assessment layers are provided by the Food and Agriculture Organization (FAO) Land Cover Classification System (LCCS) for land cover, land use, and surface hydrology. Layers for Land Cover Type 1-5, Land Cover Property 1-3, Land Cover Property Assessment 1-3, Land Cover Quality Control (QC), and a Land Water Mask are also provided. Documentation: User's Guide Algorithm Theoretical Basis Document (ATBD) General Documentation

  18. M

    Global Offline Programmer Market Research and Development Focus 2025-2032

    • statsndata.org
    excel, pdf
    Updated Jul 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stats N Data (2025). Global Offline Programmer Market Research and Development Focus 2025-2032 [Dataset]. https://www.statsndata.org/report/offline-programmer-market-369122
    Explore at:
    pdf, excelAvailable download formats
    Dataset updated
    Jul 2025
    Dataset authored and provided by
    Stats N Data
    License

    https://www.statsndata.org/how-to-orderhttps://www.statsndata.org/how-to-order

    Area covered
    Global
    Description

    The Offline Programmer market is increasingly gaining traction as industries look for efficient and cost-effective solutions to streamline their manufacturing processes. Offline programming refers to the use of advanced software to create and simulate programming instructions for robots and automated systems without

  19. G

    MOD11A1.061 Terra Land Surface Temperature and Emissivity Daily Global 1km

    • developers.google.com
    Updated May 1, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NASA LP DAAC at the USGS EROS Center (2018). MOD11A1.061 Terra Land Surface Temperature and Emissivity Daily Global 1km [Dataset]. http://doi.org/10.5067/MODIS/MOD11A1.061
    Explore at:
    Dataset updated
    May 1, 2018
    Dataset provided by
    NASA LP DAAC at the USGS EROS Center
    Time period covered
    Feb 24, 2000 - Jul 29, 2025
    Area covered
    Earth
    Description

    The MOD11A1 V6.1 product provides daily land surface temperature (LST) and emissivity values in a 1200 x 1200 kilometer grid. The temperature value is derived from the MOD11_L2 swath product. Above 30 degrees latitude, some pixels may have multiple observations where the criteria for clear-sky are met. When this occurs, the pixel value is the average of all qualifying observations. Provided along with both the day-time and night-time surface temperature bands and their quality indicator layers are MODIS bands 31 and 32 and six observation layers. Documentation: User's Guide Algorithm Theoretical Basis Document (ATBD) General Documentation

  20. I

    Global IC Programmer Market Technological Advancements 2025-2032

    • statsndata.org
    excel, pdf
    Updated Jun 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stats N Data (2025). Global IC Programmer Market Technological Advancements 2025-2032 [Dataset]. https://www.statsndata.org/report/ic-programmer-market-69416
    Explore at:
    pdf, excelAvailable download formats
    Dataset updated
    Jun 2025
    Dataset authored and provided by
    Stats N Data
    License

    https://www.statsndata.org/how-to-orderhttps://www.statsndata.org/how-to-order

    Area covered
    Global
    Description

    The Integrated Circuit (IC) Programmer market is a vital component of the electronics manufacturing and design ecosystem, serving as the bridge between innovative semiconductor designs and their practical implementation in a myriad of applications. IC Programmers are primarily used to load software onto chips, enabl

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Statista (2024). Developers population worldwide 2018-2024 [Dataset]. https://www.statista.com/statistics/627312/worldwide-developer-population/
Organization logo

Developers population worldwide 2018-2024

Explore at:
26 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Nov 26, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Worldwide
Description

The global developer population is expected to reach 28.7 million people by 2024, an increase of 3.2 million from the number seen in 2020. According to the source, much of this growth is expected to occur in China, where the growth rate is between six percent to eight percent heading up to 2023. How much do software developers earn in the U.S.? Software developers work within a wide array of specialties, honing their skills in different programming languages, techniques, or in disciplines such as design. The average salary of U.S.-based designers working in software development reached 108 thousand U.S. dollars as of June 2021, while this figure climbs to 165 thousand U.S. dollars for engineering managers. Salaries are highly dependent on location, however, with an entry-level developer working in the San Francisco/Bay area earning an average of 44.79 percent more than their counterparts starting out in Austin. JavaScript and HTML/CSS still the most widely used languages While programming languages continue to emerge or fall out of favor, JavaScript and HTML/CSS are mainstays of the coding landscape. In a global survey of software developers, over 60 percent of respondents reported using JavaScript, and HTML/CSS. SQL, Python, and Java rounded out the top five.

Search
Clear search
Close search
Google apps
Main menu