Saved datasets
Last updated
Download format
Croissant
Croissant is a format for Machine Learning datasets
Learn more about this at mlcommons.org/croissant.
Usage rights
License from data provider
Please review the applicable license to make sure your contemplated use is permitted.
Topic
Provider
Free
Cost to access
Described as free to access or have a license that allows redistribution.
100+ datasets found
  1. F

    Data from: On the Transferability of Pre-trained Language Models for...

    • frdr-dfdr.ca
    Updated Mar 23, 2022
  2. E

    Most Popular Programming Languages Statistics

    • enterpriseappstoday.com
    Updated Jan 5, 2023
  3. P

    APPS Dataset

    • paperswithcode.com
    Updated Mar 30, 2023
  4. P

    CodeContests Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Dec 25, 2023
  5. P

    Python Programming Puzzles (P3) Dataset

    • paperswithcode.com
    Updated Jun 9, 2021
  6. Programming Languages

    • kaggle.com
    zip
    Updated Oct 5, 2023
  7. Most widely utilized programming languages among developers worldwide 2023

    • statista.com
    • globalrsi.net
  8. +5 Million Python & Bash Programming Submissions for 5 Courses & Grades for...

    • figshare.com
    txt
    Updated May 31, 2023
  9. c

    Data from: Database Web Programming (Complete)

    • spectrum.library.concordia.ca
    zip
    Updated 2020
  10. f

    Programming Languages

    • figshare.com
    zip
    Updated Jun 1, 2023
  11. c

    Data from: Towards Usable Security Analysis Tools for Trigger-Action...

    • kilthub.cmu.edu
    txt
    Updated Jul 31, 2023
  12. Share of children registered for programming classes online in China by age...

    • statista.com
    Updated Dec 16, 2022
  13. Programming languages used for software development worldwide 2022

    • statista.com
    Updated Feb 20, 2023
  14. Global Programming Software Market Size By Product Type (Cloud Based,...

    • verifiedmarketresearch.com
    Updated May 19, 2023
  15. w

    Data from: On-line teaching of programming languages

    • workwithdata.com
    Updated Jan 10, 2022
  16. h

    programming-languages-keywords

    • huggingface.co
    Updated Nov 13, 2023
  17. code_contests

    • huggingface.co
    Updated Sep 17, 2022
  18. Defect Prediction - Programming Construct Usage - dataset

    • ieee-dataport.org
    Updated May 23, 2021
  19. w

    Systems programming

    • workwithdata.com
    Updated Apr 13, 2024
  20. w

    Data from: Programming languages : principles and practice

    • workwithdata.com
    Updated Jul 28, 2023
    + more versions
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Chen, Fuxiang (2022). On the Transferability of Pre-trained Language Models for Low-Resource Programming Languages [Dataset]. http://doi.org/10.20383/102.0563

Data from: On the Transferability of Pre-trained Language Models for Low-Resource Programming Languages

Related Article
Explore at:
Dataset updated
Mar 23, 2022
Dataset provided by
Federated Research Data Repository / dépôt fédéré de données de recherche
Authors
Chen, Fuxiang
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Pre-trained Language Models (PLM) such as CodeBERT and GraphCodeBERT, when trained on a large corpus of code, have recently displayed promising results in Software Engineering (SE) down-stream tasks. A PLM is most useful if it can be leveraged to improve the performance on code corpora written in low-resource programming languages, where training data is limited. In this work, our focus is on studying the impact of PLMs on a low-resource programming language corpus — specifically, we choose Ruby as the study subject. A recent study by Ahmed and Devanbu reported that using a corpus of code written in multilingual datasets to fine-tune multilingual PLMs achieves higher performance as opposed to using a corpus of code written in just one programming language. However, no analysis was made with respect to monolingual PLMs. Furthermore, some programming languages are inherently different and code written in one language usually cannot be interchanged with the others, i.e., Ruby and Java code possess very different structure. To better understand how monolingual and multilingual PLM affects different programming languages, we investigate 1) the performance of PLMs on Ruby for two popular SE tasks: Code Summarization and Code Search, 2) the strategy (to select programming languages) that works well on fine-tuning multilingual PLMs for Ruby, and 3) the performance of the fine-tuned PLMs on Ruby given different code lengths — here, we bin the Ruby code based on its number of tokens; understanding the performance on different code lengths will enable developers to make more informed decision on the use of PLMs based on their code.

This dataset, containing the PLMs and their fine-tuned models (there are over a hundred trained and fine-tuned models), was generated by the researchers at the University of British Columbia, Singapore Management University and JetBrains.

Search
Clear search
Close search
Google apps
Main menu