100+ datasets found
  1. h

    code-text-java

    • huggingface.co
    Updated Jul 6, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Semeru Lab (2023). code-text-java [Dataset]. https://huggingface.co/datasets/semeru/code-text-java
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 6, 2023
    Dataset authored and provided by
    Semeru Lab
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset is imported from CodeXGLUE and pre-processed using their script.

      Where to find in Semeru:
    

    The dataset can be found at /nfs/semeru/semeru_datasets/code_xglue/code-to-text/java in Semeru

      CodeXGLUE -- Code-To-Text
    
    
    
    
    
      Task Definition
    

    The task is to generate natural language comments for a code, and evaluted by smoothed bleu-4 score.

      Dataset
    

    The dataset we use comes from CodeSearchNet and we filter the dataset as the following:

    Remove… See the full description on the dataset page: https://huggingface.co/datasets/semeru/code-text-java.

  2. P

    Vulnerability Java Dataset Dataset

    • paperswithcode.com
    Updated Jul 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexey Shestov; Rodion Levichev; Ravil Mussabayev; Evgeny Maslov; Anton Cheshkov; Pavel Zadorozhny (2024). Vulnerability Java Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/vulnerability-java-dataset
    Explore at:
    Dataset updated
    Jul 4, 2024
    Authors
    Alexey Shestov; Rodion Levichev; Ravil Mussabayev; Evgeny Maslov; Anton Cheshkov; Pavel Zadorozhny
    Description

    The dataset consists of two versions: $X_1$ with $P_3$ and $X_1$ without $P_3$, where $P_3$ represents a set of random unchanged functions from vulnerability fixing commits. This dataset is designed for finetuning large language models to detect vulnerabilities in code. It can be used for training and evaluating models in automated vulnerability detection tasks.

  3. h

    instructional_code-search-net-java

    • huggingface.co
    Updated May 24, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fernando Tarin Morales (2023). instructional_code-search-net-java [Dataset]. https://huggingface.co/datasets/Nan-Do/instructional_code-search-net-java
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 24, 2023
    Authors
    Fernando Tarin Morales
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for "instructional_code-search-net-java"

      Dataset Summary
    

    This is an instructional dataset for Java. The dataset contains two different kind of tasks:

    Given a piece of code generate a description of what it does. Given a description generate a piece of code that fulfils the description.

      Languages
    

    The dataset is in English.

      Data Splits
    

    There are no splits.

      Dataset Creation
    

    May of 2023

      Curation Rationale
    

    This dataset… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/instructional_code-search-net-java.

  4. Data from: DataTD: A Dataset of Java Projects Including Test Doubles

    • zenodo.org
    zip
    Updated Feb 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mengzhen Li; Mattia Fazzini; Mengzhen Li; Mattia Fazzini (2025). DataTD: A Dataset of Java Projects Including Test Doubles [Dataset]. http://doi.org/10.5281/zenodo.14796282
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 3, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mengzhen Li; Mattia Fazzini; Mengzhen Li; Mattia Fazzini
    License

    https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html

    Description

    This dataset contains 1,070 open-source Java projects that include test doubles. The projects were mined from GitHub. The dataset was built by selecting all projects whose primary language is Java and that had at least five stars as of October 29, 2023. This list of projects is available in java_repositories_with_five_stars.txt. The 1,070 projects in this dataset use Maven as their build system, contain JUnit tests, and utilize Mockito to create test doubles. The projects are available in the project.zip archive file. Additionally, the dataset includes metadata about the projects, stored in the projects.json file. This metadata describes the characteristics of each project, along with test double definitions, stubbings, and verifications. Finally, we also provide the source code used to build DataTD, enabling future research on expanding and utilizing the dataset.

  5. P

    DeepCom-Java Dataset

    • paperswithcode.com
    Updated Jan 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). DeepCom-Java Dataset [Dataset]. https://paperswithcode.com/dataset/deepcom-java
    Explore at:
    Dataset updated
    Jan 17, 2022
    Description

    The Java dataset introduced in DeepCom (Deep Code Comment Generation), commonly used to evaluate automated code summarization.

  6. P

    ManySStuBs4J Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Jun 10, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael-Michael Karampatsis; Charles Sutton (2021). ManySStuBs4J Dataset [Dataset]. https://paperswithcode.com/dataset/manysstubs4j
    Explore at:
    Dataset updated
    Jun 10, 2021
    Authors
    Rafael-Michael Karampatsis; Charles Sutton
    Description

    The ManySStuBs4J corpus is a collection of simple fixes to Java bugs, designed for evaluating program repair techniques. We collect all bug-fixing changes using the SZZ heuristic, and then filter these to obtain a data set of small bug fix changes. These are single statement fixes, classified where possible into one of 16 syntactic templates which we call SStuBs. The dataset contains simple statement bugs mined from open-source Java projects hosted in GitHub. There are two variants of the dataset. One mined from the 100 Java Maven Projects and one mined from the top 1000 Java Projects.

    The dataest contains 153,652 single statement bugfix changes mined from 1,000 popular open-source Java projects, annotated by whether they match any of a set of 16 bug templates, inspired by state-of-the-art program repair techniques.

  7. h

    code-code-translation-java-csharp

    • huggingface.co
    Updated Jul 11, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Semeru Lab (2017). code-code-translation-java-csharp [Dataset]. https://huggingface.co/datasets/semeru/code-code-translation-java-csharp
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 11, 2017
    Dataset authored and provided by
    Semeru Lab
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset is imported from CodeXGLUE and pre-processed using their script.

      Where to find in Semeru:
    

    The dataset can be found at /nfs/semeru/semeru_datasets/code_xglue/code-to-code/code-to-code-trans in Semeru

      CodeXGLUE -- Code2Code Translation
    
    
    
    
    
      Task Definition
    

    Code translation aims to migrate legacy software from one programming language in a platform toanother. In CodeXGLUE, given a piece of Java (C#) code, the task is to translate the code into C#… See the full description on the dataset page: https://huggingface.co/datasets/semeru/code-code-translation-java-csharp.

  8. Data from: Dataset of Functionally Equivalent Java Methods

    • zenodo.org
    zip
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yoshiki Higo; Yoshiki Higo (2022). Dataset of Functionally Equivalent Java Methods [Dataset]. http://doi.org/10.5281/zenodo.5905349
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 28, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Yoshiki Higo; Yoshiki Higo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a dataset of functionally equivalent Java methods.

    This dataset is published as a supplemental data as the following submission.

    Yoshiki Higo, Shinsuke Matsumoto, Shinji Kusumoto, and Kazuya Yasuda, "Constructing Dataset of Functionally Equivalent Java Methods Using Automated Test Generation Techniques", submitted to MSR 2022.

    This dataset includes 276 groups of functionally equivalent Java methods, which have been manually verified by the authors.

    The 276 groups include 728 Java methods in total.

  9. w

    Dataset of books called Java : a framework for programming and problem...

    • workwithdata.com
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of books called Java : a framework for programming and problem solving [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Java+%3A+a+framework+for+programming+and+problem+solving
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about books. It has 2 rows and is filtered where the book is Java : a framework for programming and problem solving. It features 7 columns including author, publication date, language, and book publisher.

  10. N

    Java, New York Median Income by Age Groups Dataset: A Comprehensive...

    • neilsberg.com
    csv, json
    Updated Feb 25, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2025). Java, New York Median Income by Age Groups Dataset: A Comprehensive Breakdown of Java town Annual Median Income Across 4 Key Age Groups // 2025 Edition [Dataset]. https://www.neilsberg.com/research/datasets/e93cbd70-f353-11ef-8577-3860777c1fe6/
    Explore at:
    csv, jsonAvailable download formats
    Dataset updated
    Feb 25, 2025
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Java
    Variables measured
    Income for householder under 25 years, Income for householder 65 years and over, Income for householder between 25 and 44 years, Income for householder between 45 and 64 years
    Measurement technique
    The data presented in this dataset is derived from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. It delineates income distributions across four age groups (Under 25 years, 25 to 44 years, 45 to 64 years, and 65 years and over) following an initial analysis and categorization. Subsequently, we adjusted these figures for inflation using the Consumer Price Index retroactive series via current methods (R-CPI-U-RS). For additional information about these estimations, please contact us via email at research@neilsberg.com
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset presents the distribution of median household income among distinct age brackets of householders in Java town. Based on the latest 2019-2023 5-Year Estimates from the American Community Survey, it displays how income varies among householders of different ages in Java town. It showcases how household incomes typically rise as the head of the household gets older. The dataset can be utilized to gain insights into age-based household income trends and explore the variations in incomes across households.

    Key observations: Insights from 2023

    In terms of income distribution across age cohorts, in Java town, the median household income stands at $105,057 for householders within the 25 to 44 years age group, followed by $103,750 for the 45 to 64 years age group. Notably, householders within the 65 years and over age group, had the lowest median household income at $59,125.

    Content

    When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. All incomes have been adjusting for inflation and are presented in 2023-inflation-adjusted dollars.

    Age groups classifications include:

    • Under 25 years
    • 25 to 44 years
    • 45 to 64 years
    • 65 years and over

    Variables / Data Columns

    • Age Of The Head Of Household: This column presents the age of the head of household
    • Median Household Income: Median household income, in 2023 inflation-adjusted dollars for the specific age group

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for Java town median household income by age. You can refer the same here

  11. o

    Dataset of Functionally Equivalent Java Methods

    • explore.openaire.eu
    Updated Jan 24, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yoshiki Higo (2022). Dataset of Functionally Equivalent Java Methods [Dataset]. http://doi.org/10.5281/zenodo.5896268
    Explore at:
    Dataset updated
    Jan 24, 2022
    Authors
    Yoshiki Higo
    Description

    This is a dataset of functionally equivalent Java methods. This dataset is published as a supplemental data as the following submission. MSR2022: Constructing Dataset of Functionally Equivalent Java Methods Using Automated Test Generation Techniques (in the double-blined manner) This dataset includes 276 groups of functionally equivalent Java methods, which have been manually verified by the authors. The 276 groups include 728 Java methods in total.

  12. E

    GitHub Java Corpus

    • dtechtive.com
    • find.data.gov.scot
    gz, txt
    Updated Jan 10, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Edinburgh: School of Informatics (2017). GitHub Java Corpus [Dataset]. http://doi.org/10.7488/ds/1690
    Explore at:
    gz(0.6836 MB), gz(1836.032 MB), txt(0.0028 MB), txt(0.0166 MB)Available download formats
    Dataset updated
    Jan 10, 2017
    Dataset provided by
    University of Edinburgh: School of Informatics
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The GitHub Java Corpus is a snapshot of all open-source Java code on GitHub in October 2012 that is contained in open-source projects that at the time had at least one fork. It contains code from 14,785 projects amounting to about 352 million lines of code. The dataset has been used to study coding practice in Java at a large scale.

  13. d

    Java Ocean Atlas - Reid/Mantyla Section Data, a Library of More than 2000...

    • catalog.data.gov
    Updated Jun 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (Point of Contact) (2025). Java Ocean Atlas - Reid/Mantyla Section Data, a Library of More than 2000 Oceanographic Sections Developed from the Cruises Used in the Reid/Mantyla Pre-WOCE Data Set (NCEI Accession 0001456) [Dataset]. https://catalog.data.gov/dataset/java-ocean-atlas-reid-mantyla-section-data-a-library-of-more-than-2000-oceanographic-sections-d
    Explore at:
    Dataset updated
    Jun 1, 2025
    Dataset provided by
    (Point of Contact)
    Description

    This dataset includes data from approximately 12,000 stations that J. L. Reid and A. W. Mantyla have used in various world ocean studies. These data have been accumulated for the purpose of global ocean studies and are not intended for fine scale analyses. Each station represents the best station available for that locality at the time of the selection. The set was compiled over many years and from many sources and has been brought up to date as new data have become available. Most of the data were obtained from the National Oceanographic Data Center (NODC). The others came directly from various P.I.s in various formats and may lack some NODC parameters such as ship, country, and institution codes and NODC accession number. It should be noted that these are edited data files and an accurate account of deletions and corrections is, unfortunately, not available. In some cases these data may not agree exactly with versions published later or data supplied later by the NODC or an originator. Only stations that reach close to the bottom were chosen. This means, unfortunately, that the set is rather sparse near the equator. It is believed that the temperature and salinity measurements are acceptable. However, some of the oxygen and nutrient data are quite poor. They have not been eliminated from the data set, but simply ignore them in hand-contouring. They would have been eliminated if the troubles they cause in computer-contouring or instant atlases has been understood, but this set was begun before such methods were generally available. A few known systematic errors such as IGY oxygens or early Discovery oxygens and silicates have been adjusted, based upon deep comparisons with more modern data. In a few localities, stations have been reoccupied many times and a mean composite profile is given; at other localities, only the most recent, or the best sampled profile is saved and all others deleted. Because of the large-scale scope intended for this data array, some closely-spaced stations have been omitted. When needed, those stations can be retrieved from tapes of the entire cruise.

  14. h

    code-search-net-java

    • huggingface.co
    Updated Apr 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fernando Tarin Morales (2025). code-search-net-java [Dataset]. https://huggingface.co/datasets/Nan-Do/code-search-net-java
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 29, 2025
    Authors
    Fernando Tarin Morales
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for "code-search-net-java"

      Dataset Summary
    

    This dataset is the Java portion of the CodeSarchNet annotated with a summary column.The code-search-net dataset includes open source functions that include comments found at GitHub.The summary is a short description of what the function does.

      Languages
    

    The dataset's comments are in English and the functions are coded in Java

      Data Splits
    

    Train, test, validation labels are included in the… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/code-search-net-java.

  15. Atoms of Confusion Dataset in Java Programs

    • zenodo.org
    zip
    Updated Sep 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wendell Mendes; Oton Pinheiro; Windson Viana; Lincoln Rocha; Emanuele Santos; Wendell Mendes; Oton Pinheiro; Windson Viana; Lincoln Rocha; Emanuele Santos (2022). Atoms of Confusion Dataset in Java Programs [Dataset]. http://doi.org/10.5281/zenodo.7065842
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 10, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Wendell Mendes; Oton Pinheiro; Windson Viana; Lincoln Rocha; Emanuele Santos; Wendell Mendes; Oton Pinheiro; Windson Viana; Lincoln Rocha; Emanuele Santos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Double-checked gold standard dataset of Atoms of Confusion in Java. Data extracted from the main source code package of four open-source projects, excluding the test files. This dataset also includes a sample created from two other open-source projects.

    ProjectVersionRepository
    FastUtil8.5.6https://github.com/vigna/fastutil
    Moshi1.12.0https://github.com/square/moshi
    Jimfs1.2https://github.com/google/jimfs
    uCrop2.2.7https://github.com/Yalantis/uCrop

    The sample was created by extracting Java files from the following projects:

    ProjectVersionRepository
    Guava31.0.1https://github.com/google/guava
    Redisson3.6.16https://github.com/redisson/redisson

  16. Z

    Embedding Java Classes with code2vec - Java Datasets

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous (2020). Embedding Java Classes with code2vec - Java Datasets [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_3575196
    Explore at:
    Dataset updated
    Jul 1, 2020
    Dataset authored and provided by
    Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a collection of Java class classification datasets (i.e., classify a class into one of a set of categories), collected for the research work 'Embedding Java Classes with code2vec: Improvements from Variable Obfuscation'. These are shared for further research in static code analysis tasks (malware classification, author attribution, etc).

    Obfuscation & Pipeline Code: Download

    code2vec Models: Download

  17. w

    Dataset of books called Java 9 modularity : patterns and practices for...

    • workwithdata.com
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of books called Java 9 modularity : patterns and practices for developing maintainable applications [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Java+9+modularity+%3A+patterns+and+practices+for+developing+maintainable+applications
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about books. It has 1 row and is filtered where the book is Java 9 modularity : patterns and practices for developing maintainable applications. It features 7 columns including author, publication date, language, and book publisher.

  18. w

    Dataset of books called Reactive programming with Java 9 : develop...

    • workwithdata.com
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of books called Reactive programming with Java 9 : develop concurrent and asynchronous applications with Java 9 [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Reactive+programming+with+Java+9+%3A+develop+concurrent+and+asynchronous+applications+with+Java+9
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about books. It has 1 row and is filtered where the book is Reactive programming with Java 9 : develop concurrent and asynchronous applications with Java 9. It features 7 columns including author, publication date, language, and book publisher.

  19. P

    CodeQA Dataset

    • paperswithcode.com
    Updated Dec 29, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chenxiao Liu; Xiaojun Wan (2023). CodeQA Dataset [Dataset]. https://paperswithcode.com/dataset/codeqa
    Explore at:
    Dataset updated
    Dec 29, 2023
    Authors
    Chenxiao Liu; Xiaojun Wan
    Description

    CodeQA is a free-form question answering dataset for the purpose of source code comprehension: given a code snippet and a question, a textual answer is required to be generated. CodeQA contains a Java dataset with 119,778 question-answer pairs and a Python dataset with 70,085 question-answer pairs.

    Description from: CodeQA: A Question Answering Dataset for Source Code Comprehension

  20. P

    Code comments in Java, Python, and Pharo Dataset

    • paperswithcode.com
    Updated Apr 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Code comments in Java, Python, and Pharo Dataset [Dataset]. https://paperswithcode.com/dataset/code-comments-in-java-python-and-pharo
    Explore at:
    Dataset updated
    Apr 26, 2023
    Description

    It contains the dataset of class comments extracted from various projects of three programming languages Java, Pharo, and Python

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Semeru Lab (2023). code-text-java [Dataset]. https://huggingface.co/datasets/semeru/code-text-java

code-text-java

semeru/code-text-java

Explore at:
7 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 6, 2023
Dataset authored and provided by
Semeru Lab
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dataset is imported from CodeXGLUE and pre-processed using their script.

  Where to find in Semeru:

The dataset can be found at /nfs/semeru/semeru_datasets/code_xglue/code-to-text/java in Semeru

  CodeXGLUE -- Code-To-Text





  Task Definition

The task is to generate natural language comments for a code, and evaluted by smoothed bleu-4 score.

  Dataset

The dataset we use comes from CodeSearchNet and we filter the dataset as the following:

Remove… See the full description on the dataset page: https://huggingface.co/datasets/semeru/code-text-java.

Search
Clear search
Close search
Google apps
Main menu