Saved datasets
Last updated
Download format
Croissant is a format for Machine Learning datasets
Learn more about this at
Usage rights
License from data provider
Please review the applicable license to make sure your contemplated use is permitted.
Cost to access
Described as free to access or have a license that allows redistribution.
100+ datasets found
  1. F

    Data from: On the Transferability of Pre-trained Language Models for...

    Updated Mar 23, 2022
  2. E

    Most Popular Programming Languages Statistics

    Updated Jan 5, 2023
  3. P

    APPS Dataset

    Updated Mar 30, 2023
  4. P

    CodeContests Dataset

    Updated Dec 25, 2023
  5. P

    Python Programming Puzzles (P3) Dataset

    Updated Jun 9, 2021
  6. Programming Languages

    Updated Oct 5, 2023
  7. Most widely utilized programming languages among developers worldwide 2023

  8. +5 Million Python & Bash Programming Submissions for 5 Courses & Grades for...

    Updated May 31, 2023
  9. c

    Data from: Database Web Programming (Complete)

    Updated 2020
  10. f

    Programming Languages

    Updated Jun 1, 2023
  11. c

    Data from: Towards Usable Security Analysis Tools for Trigger-Action...

    Updated Jul 31, 2023
  12. Share of children registered for programming classes online in China by age...

    Updated Dec 16, 2022
  13. Programming languages used for software development worldwide 2022

    Updated Feb 20, 2023
  14. Global Programming Software Market Size By Product Type (Cloud Based,...

    Updated May 19, 2023
  15. w

    Data from: On-line teaching of programming languages

    Updated Jan 10, 2022
  16. h


    Updated Nov 13, 2023
  17. code_contests

    Updated Sep 17, 2022
  18. Defect Prediction - Programming Construct Usage - dataset

    Updated May 23, 2021
  19. w

    Systems programming

    Updated Apr 13, 2024
  20. w

    Data from: Programming languages : principles and practice

    Updated Jul 28, 2023
    + more versions
Click to copy link
Link copied
Chen, Fuxiang (2022). On the Transferability of Pre-trained Language Models for Low-Resource Programming Languages [Dataset].

Data from: On the Transferability of Pre-trained Language Models for Low-Resource Programming Languages

Related Article
Explore at:
Dataset updated
Mar 23, 2022
Dataset provided by
Federated Research Data Repository / dépôt fédéré de données de recherche
Chen, Fuxiang

CC0 1.0 Universal Public Domain Dedication
License information was derived automatically


Pre-trained Language Models (PLM) such as CodeBERT and GraphCodeBERT, when trained on a large corpus of code, have recently displayed promising results in Software Engineering (SE) down-stream tasks. A PLM is most useful if it can be leveraged to improve the performance on code corpora written in low-resource programming languages, where training data is limited. In this work, our focus is on studying the impact of PLMs on a low-resource programming language corpus — specifically, we choose Ruby as the study subject. A recent study by Ahmed and Devanbu reported that using a corpus of code written in multilingual datasets to fine-tune multilingual PLMs achieves higher performance as opposed to using a corpus of code written in just one programming language. However, no analysis was made with respect to monolingual PLMs. Furthermore, some programming languages are inherently different and code written in one language usually cannot be interchanged with the others, i.e., Ruby and Java code possess very different structure. To better understand how monolingual and multilingual PLM affects different programming languages, we investigate 1) the performance of PLMs on Ruby for two popular SE tasks: Code Summarization and Code Search, 2) the strategy (to select programming languages) that works well on fine-tuning multilingual PLMs for Ruby, and 3) the performance of the fine-tuned PLMs on Ruby given different code lengths — here, we bin the Ruby code based on its number of tokens; understanding the performance on different code lengths will enable developers to make more informed decision on the use of PLMs based on their code.

This dataset, containing the PLMs and their fine-tuned models (there are over a hundred trained and fine-tuned models), was generated by the researchers at the University of British Columbia, Singapore Management University and JetBrains.

Clear search
Close search
Google apps
Main menu