100+ datasets found
  1. h

    code-text-java

    • huggingface.co
    Updated Jul 6, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Semeru Lab (2023). code-text-java [Dataset]. https://huggingface.co/datasets/semeru/code-text-java
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 6, 2023
    Dataset authored and provided by
    Semeru Lab
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset is imported from CodeXGLUE and pre-processed using their script.

      Where to find in Semeru:
    

    The dataset can be found at /nfs/semeru/semeru_datasets/code_xglue/code-to-text/java in Semeru

      CodeXGLUE -- Code-To-Text
    
    
    
    
    
      Task Definition
    

    The task is to generate natural language comments for a code, and evaluted by smoothed bleu-4 score.

      Dataset
    

    The dataset we use comes from CodeSearchNet and we filter the dataset as the following:

    Remove… See the full description on the dataset page: https://huggingface.co/datasets/semeru/code-text-java.

  2. final-java-datasets-with-source-code

    • kaggle.com
    zip
    Updated Mar 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anower Zihad (2025). final-java-datasets-with-source-code [Dataset]. https://www.kaggle.com/datasets/rollingstonn/final-java-datasets-with-source-code
    Explore at:
    zip(11414845 bytes)Available download formats
    Dataset updated
    Mar 6, 2025
    Authors
    Anower Zihad
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Anower Zihad

    Released under CC0: Public Domain

    Contents

  3. h

    instructional_code-search-net-java

    • huggingface.co
    Updated May 24, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fernando Tarin Morales (2023). instructional_code-search-net-java [Dataset]. https://huggingface.co/datasets/Nan-Do/instructional_code-search-net-java
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 24, 2023
    Authors
    Fernando Tarin Morales
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for "instructional_code-search-net-java"

      Dataset Summary
    

    This is an instructional dataset for Java. The dataset contains two different kind of tasks:

    Given a piece of code generate a description of what it does. Given a description generate a piece of code that fulfils the description.

      Languages
    

    The dataset is in English.

      Data Splits
    

    There are no splits.

      Dataset Creation
    

    May of 2023

      Curation Rationale
    

    This dataset… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/instructional_code-search-net-java.

  4. GPT Java Dataset - Detect LLM-Written Code

    • kaggle.com
    zip
    Updated Jan 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timothy Paek (2025). GPT Java Dataset - Detect LLM-Written Code [Dataset]. https://www.kaggle.com/datasets/timothypaek/gpt-java-dataset
    Explore at:
    zip(1135067 bytes)Available download formats
    Dataset updated
    Jan 30, 2025
    Authors
    Timothy Paek
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description
    "https://github.com/tipaek/GPTJavaDataset">

    GPT Java Source Code Dataset

    A dataset composed of 976 total Java source code files from 11 authors' GitHub pages and ChatGPT 3.5 and BingGPT rewritten code for code classification.

    About The Project

    With the release of OpenAI's ChatGPT, code written by GPT is becoming increasingly more common in everyday usage. However, students often use generated code to cheat on exams and homework. Being able to detect code written by GPT could be useful for organizations and schools as a classification or anomaly detection task. I wasn't able to find a publicly available online dataset of Java source code written by GPT to be trained on for research purposes, so I created my own.

    Here's the general idea: * 666 Java source code files from 11 different authors' GitHub pages were acquired via another public dataset. * 5 of the 11 authors' files were passed through either ChatGPT-3.5 or Bing GPT-4 in a rewriting task. * The prompt: "The messages I send you will be in Java code. I want you to rewrite all of it while maintaining functionality." * The entirety of the file was passed through ChatGPT (no cutoff) and BingGPT (4000 character limit) without additional prompting. The resulting code was then pasted into a new file. * The resulting files were either saved without additional formatting or were formatted by VSCode's format when saving setting.

    Of course, there are limitations to this dataset as code classification by an LLM is novel. However, this could be a reasonable starting point for those who want to detect GPT. Feel free to use this dataset for research or training.

    Getting Started

    Dataset Structure

    Here's a breakdown of the files in this dataset: * 976 total files * 666 files of original authors * 108 rewritten files using Bing GPT-4 (61 formatted, 47 non-formatted) * 202 rewritten files using ChatGPT-3.5 (59 formatted, 143 non-formatted)

    (back to top)

    Citation

    If you use this dataset, please cite:

    @misc{P24_Java,
     author = {Paek, Timothy},
     title = {GPT Java Dataset: A Dataset for LLM-Generated Code Detection},
     year = {2024},
     howpublished = {GitHub Repository},
     url = {https://github.com/tipaek/GPT-Java-Dataset}
    }
    

    Contact

    Timothy Paek - Linked-In - tipaek@syr.edu

    Acknowledgments

    What I used in making this dataset:

  5. h

    the-stack-java-clean

    • huggingface.co
    Updated Jul 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ammar Khairi (2023). the-stack-java-clean [Dataset]. https://huggingface.co/datasets/ammarnasr/the-stack-java-clean
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 9, 2023
    Authors
    Ammar Khairi
    License

    https://choosealicense.com/licenses/openrail/https://choosealicense.com/licenses/openrail/

    Description

    Dataset 1: TheStack - Java - Cleaned

    Description: This dataset is drawn from TheStack Corpus, an open-source code dataset with over 3TB of GitHub data covering 48 programming languages. We selected a small portion of this dataset to optimize smaller language models for Java, a popular statically typed language. Target Language: Java Dataset Size:

    Training: 900,000 files Validation: 50,000 files Test: 50,000 files

    Preprocessing:

    Selected Java as the target language due to its… See the full description on the dataset page: https://huggingface.co/datasets/ammarnasr/the-stack-java-clean.

  6. ManySStuBs4J Dataset

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    bin, csv, txt
    Updated Feb 7, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael - Michael Karampatsis; Charles Sutton; Rafael - Michael Karampatsis; Charles Sutton (2020). ManySStuBs4J Dataset [Dataset]. http://doi.org/10.5281/zenodo.3653444
    Explore at:
    txt, bin, csvAvailable download formats
    Dataset updated
    Feb 7, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Rafael - Michael Karampatsis; Charles Sutton; Rafael - Michael Karampatsis; Charles Sutton
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The ManySStuBs4J corpus is a collection of simple fixes to Java bugs, designed for evaluating program repair techniques.
    We collect all bug-fixing changes using the SZZ heuristic, and then filter these to obtain a data set of small bug fix changes.
    These are single statement fixes, classified where possible into one of 16 syntactic templates which we call SStuBs.
    The dataset contains simple statement bugs mined from open-source Java projects hosted in GitHub.
    There are two variants of the dataset. One mined from the 100 Java Maven Projects and one mined from the top 1000 Java Projects.
    A project's popularity is determined by computing the sum of z-scores of its forks and watchers.
    We kept only bug commits that contain only single statement changes and ignore stylistic differences such as spaces or empty as well as differences in comments.
    Some single statement changes can be caused by refactorings, like changing a variable name rather than bug fixes.
    We attempted to detect and exclude refactorings such as variable, function, and class renamings, function argument renamings or changing the number of arguments in a function.
    The commits are classified as bug fixes or not by checking if the commit message contains any of a set of predetermined keywords such as bug, fix, fault etc.
    We evaluated the accuracy of this method on a random sample of 100 commits that contained SStuBs from the smaller version of the dataset and found it to achieve a satisfactory 94% accuracy.
    This method has also been used before to extract bug datasets (Ray et al., 2015; Tufano et al., 2018) where it achieved an accuracy of 96% and 97.6% respectively.

    The bugs are stored in a JSON file (each version of the dataset has each own instance of this file).
    Any bugs that fit one of 16 patterns are also annotated by which pattern(s) they fit in a separate JSON file (each version of the dataset has each own instance of this file).
    We refer to bugs that fit any of the 16 patterns as simple stupid bugs (SStuBs).

    For more information on extracting the dataset and a detailed documentation of the software visit our GitHub repo: https://github.com/mast-group/SStuBs-mining

  7. h

    code-search-net-java

    • huggingface.co
    Updated Apr 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fernando Tarin Morales (2025). code-search-net-java [Dataset]. https://huggingface.co/datasets/Nan-Do/code-search-net-java
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 29, 2025
    Authors
    Fernando Tarin Morales
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for "code-search-net-java"

      Dataset Summary
    

    This dataset is the Java portion of the CodeSarchNet annotated with a summary column.The code-search-net dataset includes open source functions that include comments found at GitHub.The summary is a short description of what the function does.

      Languages
    

    The dataset's comments are in English and the functions are coded in Java

      Data Splits
    

    Train, test, validation labels are included in the… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/code-search-net-java.

  8. Embedding Java Classes with code2vec - Java Datasets

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jul 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous; Anonymous (2020). Embedding Java Classes with code2vec - Java Datasets [Dataset]. http://doi.org/10.5281/zenodo.3575197
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 1, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous; Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a collection of Java class classification datasets (i.e., classify a class into one of a set of categories), collected for the research work 'Embedding Java Classes with code2vec: Improvements from Variable Obfuscation'. These are shared for further research in static code analysis tasks (malware classification, author attribution, etc).

    Obfuscation & Pipeline Code: Download

    code2vec Models: Download

  9. AVATAR: Java-Python Program Translation Dataset

    • kaggle.com
    zip
    Updated Dec 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    makeitsimple (2022). AVATAR: Java-Python Program Translation Dataset [Dataset]. https://www.kaggle.com/datasets/hetulvpatel/avatar-javapython-program-translation-dataset
    Explore at:
    zip(8699485 bytes)Available download formats
    Dataset updated
    Dec 3, 2022
    Authors
    makeitsimple
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    What is AVATAR?

    Paper | Code

    • AVATAR stands for jAVA-pyThon progrAm tRanslation.
    • AVATAR is a corpus of 9,515 programming problems and their solutions written in Java and Python.

    Files Description

    • {{language}}_programms_{{split}}.tfrecord: Programs for unsupervised pretraining for java and python languages divided into the train, valid and test split.

      keys: code: source code and language: language name.

  10. N

    Java, SD Population Breakdown by Gender Dataset: Male and Female Population...

    • neilsberg.com
    csv, json
    Updated Feb 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2025). Java, SD Population Breakdown by Gender Dataset: Male and Female Population Distribution // 2025 Edition [Dataset]. https://www.neilsberg.com/research/datasets/b23b79ff-f25d-11ef-8c1b-3860777c1fe6/
    Explore at:
    csv, jsonAvailable download formats
    Dataset updated
    Feb 24, 2025
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    South Dakota, Java
    Variables measured
    Male Population, Female Population, Male Population as Percent of Total Population, Female Population as Percent of Total Population
    Measurement technique
    The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. To measure the two variables, namely (a) population and (b) population as a percentage of the total population, we initially analyzed and categorized the data for each of the gender classifications (biological sex) reported by the US Census Bureau. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the population of Java by gender, including both male and female populations. This dataset can be utilized to understand the population distribution of Java across both sexes and to determine which sex constitutes the majority.

    Key observations

    There is a considerable majority of female population, with 65.66% of total population being female. Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

    Content

    When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

    Scope of gender :

    Please note that American Community Survey asks a question about the respondents current sex, but not about gender, sexual orientation, or sex at birth. The question is intended to capture data for biological sex, not gender. Respondents are supposed to respond with the answer as either of Male or Female. Our research and this dataset mirrors the data reported as Male and Female for gender distribution analysis. No further analysis is done on the data reported from the Census Bureau.

    Variables / Data Columns

    • Gender: This column displays the Gender (Male / Female)
    • Population: The population of the gender in the Java is shown in this column.
    • % of Total Population: This column displays the percentage distribution of each gender as a proportion of Java total population. Please note that the sum of all percentages may not equal one due to rounding of values.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for Java Population by Race & Ethnicity. You can refer the same here

  11. N

    Java, New York Population Breakdown by Gender and Age Dataset: Male and...

    • neilsberg.com
    csv, json
    Updated Feb 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2025). Java, New York Population Breakdown by Gender and Age Dataset: Male and Female Population Distribution Across 18 Age Groups // 2025 Edition [Dataset]. https://www.neilsberg.com/research/datasets/e1e8e49e-f25d-11ef-8c1b-3860777c1fe6/
    Explore at:
    json, csvAvailable download formats
    Dataset updated
    Feb 24, 2025
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Java, New York
    Variables measured
    Male and Female Population Under 5 Years, Male and Female Population over 85 years, Male and Female Population Between 5 and 9 years, Male and Female Population Between 10 and 14 years, Male and Female Population Between 15 and 19 years, Male and Female Population Between 20 and 24 years, Male and Female Population Between 25 and 29 years, Male and Female Population Between 30 and 34 years, Male and Female Population Between 35 and 39 years, Male and Female Population Between 40 and 44 years, and 8 more
    Measurement technique
    The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. To measure the three variables, namely (a) Population (Male), (b) Population (Female), and (c) Gender Ratio (Males per 100 Females), we initially analyzed and categorized the data for each of the gender classifications (biological sex) reported by the US Census Bureau across 18 age groups, ranging from under 5 years to 85 years and above. These age groups are described above in the variables section. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the population of Java town by gender across 18 age groups. It lists the male and female population in each age group along with the gender ratio for Java town. The dataset can be utilized to understand the population distribution of Java town by gender and age. For example, using this dataset, we can identify the largest age group for both Men and Women in Java town. Additionally, it can be used to see how the gender ratio changes from birth to senior most age group and male to female ratio across each age group for Java town.

    Key observations

    Largest age group (population): Male # 40-44 years (139) | Female # 65-69 years (126). Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

    Content

    When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

    Age groups:

    • Under 5 years
    • 5 to 9 years
    • 10 to 14 years
    • 15 to 19 years
    • 20 to 24 years
    • 25 to 29 years
    • 30 to 34 years
    • 35 to 39 years
    • 40 to 44 years
    • 45 to 49 years
    • 50 to 54 years
    • 55 to 59 years
    • 60 to 64 years
    • 65 to 69 years
    • 70 to 74 years
    • 75 to 79 years
    • 80 to 84 years
    • 85 years and over

    Scope of gender :

    Please note that American Community Survey asks a question about the respondents current sex, but not about gender, sexual orientation, or sex at birth. The question is intended to capture data for biological sex, not gender. Respondents are supposed to respond with the answer as either of Male or Female. Our research and this dataset mirrors the data reported as Male and Female for gender distribution analysis.

    Variables / Data Columns

    • Age Group: This column displays the age group for the Java town population analysis. Total expected values are 18 and are define above in the age groups section.
    • Population (Male): The male population in the Java town is shown in the following column.
    • Population (Female): The female population in the Java town is shown in the following column.
    • Gender Ratio: Also known as the sex ratio, this column displays the number of males per 100 females in Java town for each age group.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for Java town Population by Gender. You can refer the same here

  12. w

    Dataset of books called Machine learning in Java : helpful techniques to...

    • workwithdata.com
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of books called Machine learning in Java : helpful techniques to design, build, and deploy powerful machine learning applications in Java [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Machine+learning+in+Java+%3A+helpful+techniques+to+design%2C+build%2C+and+deploy+powerful+machine+learning+applications+in+Java
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about books. It has 1 row and is filtered where the book is Machine learning in Java : helpful techniques to design, build, and deploy powerful machine learning applications in Java. It features 7 columns including author, publication date, language, and book publisher.

  13. w

    Dataset of books series that contain Java language reference

    • workwithdata.com
    Updated Nov 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2024). Dataset of books series that contain Java language reference [Dataset]. https://www.workwithdata.com/datasets/book-series?f=1&fcol0=j0-book&fop0=%3D&fval0=Java+language+reference&j=1&j0=books
    Explore at:
    Dataset updated
    Nov 25, 2024
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about book series. It has 1 row and is filtered where the books is Java language reference. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.

  14. w

    Dataset of books called Java 2 in plain English

    • workwithdata.com
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of books called Java 2 in plain English [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Java+2+in+plain+English
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about books. It has 1 row and is filtered where the book is Java 2 in plain English. It features 7 columns including author, publication date, language, and book publisher.

  15. R

    Java Apple Leaf Dataset

    • universe.roboflow.com
    zip
    Updated Apr 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    East West University BD (2025). Java Apple Leaf Dataset [Dataset]. https://universe.roboflow.com/east-west-university-bd/java-apple-leaf-dataset-lxfjd/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 25, 2025
    Dataset authored and provided by
    East West University BD
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Java Apple Leaf Dataset
    Description

    Java Apple Leaf Dataset

    ## Overview
    
    Java Apple Leaf Dataset is a dataset for classification tasks - it contains Java Apple Leaf Dataset annotations for 1,102 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  16. Data from: DataTD: A Dataset of Java Projects Including Test Doubles

    • zenodo.org
    zip
    Updated Dec 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mengzhen Li; Mattia Fazzini; Mengzhen Li; Mattia Fazzini (2024). DataTD: A Dataset of Java Projects Including Test Doubles [Dataset]. http://doi.org/10.5281/zenodo.14271508
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 4, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mengzhen Li; Mattia Fazzini; Mengzhen Li; Mattia Fazzini
    License

    https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html

    Description

    This dataset contains 1,070 open-source Java projects including test doubles. The projects were mined from GitHub. The starting point for building the dataset included all projects whose main language is Java and that had at least five stars as of October 29, 2023. This set of projects is listed in java_repositories_with_five_stars.txt. The 1,070 projects comprising this dataset use Maven as their build system, containing JUnit tests, and use Mockito to create test doubles. The projects are available in the project.zip archive file. The dataset also contains metadata about the projects, which is available in the projects.json file. The metadata describes the characteristics of each project together with the test double definitions, stubbings, and verifications inside the project. Finally, we also make available the source code used to build DataTD for future research on using and extending the dataset.

  17. Exploring Open Source Java Technical Debt

    • kaggle.com
    zip
    Updated Feb 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Exploring Open Source Java Technical Debt [Dataset]. https://www.kaggle.com/datasets/thedevastator/exploring-open-source-java-technical-debt
    Explore at:
    zip(1639103449 bytes)Available download formats
    Dataset updated
    Feb 11, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Exploring Open Source Java Technical Debt

    Analyzing 55 Million Files

    By [source]

    About this dataset

    This dataset contains a treasure-trove of information on over 55 million open source Java files, providing technical debt-related insights that can be used to inform a range of research and analytical activities. Every file captured in the dataset is assigned an MD5-hash to ensure unique identification, along with key metrics including its technical debt probability, fan-in/fan-out levels, total methods & variables, lines of code & comment lines, and the number of occurrences recorded.

    These data points can each provide important guidance into the magnitude and scope of technical debt in open source Java software development projects. Researchers can analyse correlations between their technical debt probability and levels of fan-in/fan-out as well as variables such as methods created & number of lines written. Meanwhile analysts are enabled to identify files with high impacts on code quality through comparing their joint location in both technical debt probability rankings and highest occurrence rankings.

    Utilizing this comprehensive dataset opens up opportunities for a wide range of investigations which seek to unlock greater understanding surrounding the complex relationships between software development practices and code quality. It presents an invaluable resource for anyone looking to gain key insights into spiritual subject matter– turning questions into answers via exploration!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    How to use this dataset:

    The dataset contains several columns with different pieces of information including file_md5 (a unique identifier for each file), td_probability (the probability that the file contains technical debt), fanin (the number of incoming dependencies for the file), fanout (the number of outgoing dependencies for the file), total methods and variables, total lines of code and comment lines. researchers or analysts may perform statistical analysis on these parameters to get an overall idea of the impact that these values have on code quality. Additionally they may also find correlations between certain values such as fan-in/fan-out ratio and sums or averages when it comes to looking at methods/variables used in a particular set of files. Finally they can look at occurences column which contains information about how many times a particular MD5 hash has been used in open source repositories - this could help identify any particularly well received files which have been widely used across multiple platforms

    By examining these columns together you will be able to gain insight into trends related to technical debt in Open Source Java programs as well as identify key areas where there is potential danger/challenges associated with implementation within your own projects. With enough data manipulation you may even make predictions regarding future implementation based on past experiences!

    Research Ideas

    • Correlating technical debt probability and lines of code or variables to determine how additional code complexity impacts the magnitude of technical debt.
    • Identifying files with a high probability of technical debt which have been used in multiple projects, so that those files may be improved to help future projects.
    • Analyzing the average fan-in and fan-out for different programming paradigms, such as MVC, to determine if any design patterns produce higher degrees of technical debt than other paradigms or architectures

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: TD_of_55M_files.csv | Column name | Description | |:--------------------|:---------------------------------------------------------------------------------------------------------------------| | file_md5 | A unique identifier for each file that can also be used to track them across repositories or other sources. (String) | | td_probability | The p...

  18. Details of dataset information.

    • plos.figshare.com
    xls
    Updated May 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fahmi H. Quradaa; Sara Shahzad; Rashad Saeed; Mubarak M. Sufyan (2024). Details of dataset information. [Dataset]. http://doi.org/10.1371/journal.pone.0302333.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 10, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Fahmi H. Quradaa; Sara Shahzad; Rashad Saeed; Mubarak M. Sufyan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In software development, it’s common to reuse existing source code by copying and pasting, resulting in the proliferation of numerous code clones—similar or identical code fragments—that detrimentally affect software quality and maintainability. Although several techniques for code clone detection exist, many encounter challenges in effectively identifying semantic clones due to their inability to extract syntax and semantics information. Fewer techniques leverage low-level source code representations like bytecode or assembly for clone detection. This work introduces a novel code representation for identifying syntactic and semantic clones in Java source code. It integrates high-level features extracted from the Abstract Syntax Tree with low-level features derived from intermediate representations generated by static analysis tools, like the Soot framework. Leveraging this combined representation, fifteen machine-learning models are trained to effectively detect code clones. Evaluation on a large dataset demonstrates the models’ efficacy in accurately identifying semantic clones. Among these classifiers, ensemble classifiers, such as the LightGBM classifier, exhibit exceptional accuracy. Linearly combining features enhances the effectiveness of the models compared to multiplication and distance combination techniques. The experimental findings indicate that the proposed method can outperform the current clone detection techniques in detecting semantic clones.

  19. w

    Dataset of book subjects that contain The Java tutorial : object-oriented...

    • workwithdata.com
    Updated Nov 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2024). Dataset of book subjects that contain The Java tutorial : object-oriented programming for the Internet [Dataset]. https://www.workwithdata.com/datasets/book-subjects?f=1&fcol0=j0-book&fop0=%3D&fval0=The+Java+tutorial+:+object-oriented+programming+for+the+Internet&j=1&j0=books
    Explore at:
    Dataset updated
    Nov 7, 2024
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about book subjects. It has 3 rows and is filtered where the books is The Java tutorial : object-oriented programming for the Internet. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.

  20. E

    GitHub Java Corpus

    • dtechtive.com
    • find.data.gov.scot
    gz, txt
    Updated Jan 10, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Edinburgh: School of Informatics (2017). GitHub Java Corpus [Dataset]. http://doi.org/10.7488/ds/1690
    Explore at:
    gz(0.6836 MB), gz(1836.032 MB), txt(0.0028 MB), txt(0.0166 MB)Available download formats
    Dataset updated
    Jan 10, 2017
    Dataset provided by
    University of Edinburgh: School of Informatics
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The GitHub Java Corpus is a snapshot of all open-source Java code on GitHub in October 2012 that is contained in open-source projects that at the time had at least one fork. It contains code from 14,785 projects amounting to about 352 million lines of code. The dataset has been used to study coding practice in Java at a large scale.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Semeru Lab (2023). code-text-java [Dataset]. https://huggingface.co/datasets/semeru/code-text-java

code-text-java

semeru/code-text-java

Explore at:
6 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 6, 2023
Dataset authored and provided by
Semeru Lab
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dataset is imported from CodeXGLUE and pre-processed using their script.

  Where to find in Semeru:

The dataset can be found at /nfs/semeru/semeru_datasets/code_xglue/code-to-text/java in Semeru

  CodeXGLUE -- Code-To-Text





  Task Definition

The task is to generate natural language comments for a code, and evaluted by smoothed bleu-4 score.

  Dataset

The dataset we use comes from CodeSearchNet and we filter the dataset as the following:

Remove… See the full description on the dataset page: https://huggingface.co/datasets/semeru/code-text-java.

Search
Clear search
Close search
Google apps
Main menu