Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset is imported from CodeXGLUE and pre-processed using their script.
Where to find in Semeru:
The dataset can be found at /nfs/semeru/semeru_datasets/code_xglue/code-to-text/java in Semeru
CodeXGLUE -- Code-To-Text
Task Definition
The task is to generate natural language comments for a code, and evaluted by smoothed bleu-4 score.
Dataset
The dataset we use comes from CodeSearchNet and we filter the dataset as the following:
Remove… See the full description on the dataset page: https://huggingface.co/datasets/semeru/code-text-java.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Anower Zihad
Released under CC0: Public Domain
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for "instructional_code-search-net-java"
Dataset Summary
This is an instructional dataset for Java. The dataset contains two different kind of tasks:
Given a piece of code generate a description of what it does. Given a description generate a piece of code that fulfils the description.
Languages
The dataset is in English.
Data Splits
There are no splits.
Dataset Creation
May of 2023
Curation Rationale
This dataset… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/instructional_code-search-net-java.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A dataset composed of 976 total Java source code files from 11 authors' GitHub pages and ChatGPT 3.5 and BingGPT rewritten code for code classification.
With the release of OpenAI's ChatGPT, code written by GPT is becoming increasingly more common in everyday usage. However, students often use generated code to cheat on exams and homework. Being able to detect code written by GPT could be useful for organizations and schools as a classification or anomaly detection task. I wasn't able to find a publicly available online dataset of Java source code written by GPT to be trained on for research purposes, so I created my own.
Here's the general idea: * 666 Java source code files from 11 different authors' GitHub pages were acquired via another public dataset. * 5 of the 11 authors' files were passed through either ChatGPT-3.5 or Bing GPT-4 in a rewriting task. * The prompt: "The messages I send you will be in Java code. I want you to rewrite all of it while maintaining functionality." * The entirety of the file was passed through ChatGPT (no cutoff) and BingGPT (4000 character limit) without additional prompting. The resulting code was then pasted into a new file. * The resulting files were either saved without additional formatting or were formatted by VSCode's format when saving setting.
Of course, there are limitations to this dataset as code classification by an LLM is novel. However, this could be a reasonable starting point for those who want to detect GPT. Feel free to use this dataset for research or training.
Here's a breakdown of the files in this dataset: * 976 total files * 666 files of original authors * 108 rewritten files using Bing GPT-4 (61 formatted, 47 non-formatted) * 202 rewritten files using ChatGPT-3.5 (59 formatted, 143 non-formatted)
If you use this dataset, please cite:
@misc{P24_Java,
author = {Paek, Timothy},
title = {GPT Java Dataset: A Dataset for LLM-Generated Code Detection},
year = {2024},
howpublished = {GitHub Repository},
url = {https://github.com/tipaek/GPT-Java-Dataset}
}
Timothy Paek - Linked-In - tipaek@syr.edu
What I used in making this dataset:
Facebook
Twitterhttps://choosealicense.com/licenses/openrail/https://choosealicense.com/licenses/openrail/
Dataset 1: TheStack - Java - Cleaned
Description: This dataset is drawn from TheStack Corpus, an open-source code dataset with over 3TB of GitHub data covering 48 programming languages. We selected a small portion of this dataset to optimize smaller language models for Java, a popular statically typed language. Target Language: Java Dataset Size:
Training: 900,000 files Validation: 50,000 files Test: 50,000 files
Preprocessing:
Selected Java as the target language due to its… See the full description on the dataset page: https://huggingface.co/datasets/ammarnasr/the-stack-java-clean.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The ManySStuBs4J corpus is a collection of simple fixes to Java bugs, designed for evaluating program repair techniques.
We collect all bug-fixing changes using the SZZ heuristic, and then filter these to obtain a data set of small bug fix changes.
These are single statement fixes, classified where possible into one of 16 syntactic templates which we call SStuBs.
The dataset contains simple statement bugs mined from open-source Java projects hosted in GitHub.
There are two variants of the dataset. One mined from the 100 Java Maven Projects and one mined from the top 1000 Java Projects.
A project's popularity is determined by computing the sum of z-scores of its forks and watchers.
We kept only bug commits that contain only single statement changes and ignore stylistic differences such as spaces or empty as well as differences in comments.
Some single statement changes can be caused by refactorings, like changing a variable name rather than bug fixes.
We attempted to detect and exclude refactorings such as variable, function, and class renamings, function argument renamings or changing the number of arguments in a function.
The commits are classified as bug fixes or not by checking if the commit message contains any of a set of predetermined keywords such as bug, fix, fault etc.
We evaluated the accuracy of this method on a random sample of 100 commits that contained SStuBs from the smaller version of the dataset and found it to achieve a satisfactory 94% accuracy.
This method has also been used before to extract bug datasets (Ray et al., 2015; Tufano et al., 2018) where it achieved an accuracy of 96% and 97.6% respectively.
The bugs are stored in a JSON file (each version of the dataset has each own instance of this file).
Any bugs that fit one of 16 patterns are also annotated by which pattern(s) they fit in a separate JSON file (each version of the dataset has each own instance of this file).
We refer to bugs that fit any of the 16 patterns as simple stupid bugs (SStuBs).
For more information on extracting the dataset and a detailed documentation of the software visit our GitHub repo: https://github.com/mast-group/SStuBs-mining
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for "code-search-net-java"
Dataset Summary
This dataset is the Java portion of the CodeSarchNet annotated with a summary column.The code-search-net dataset includes open source functions that include comments found at GitHub.The summary is a short description of what the function does.
Languages
The dataset's comments are in English and the functions are coded in Java
Data Splits
Train, test, validation labels are included in the… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/code-search-net-java.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a collection of Java class classification datasets (i.e., classify a class into one of a set of categories), collected for the research work 'Embedding Java Classes with code2vec: Improvements from Variable Obfuscation'. These are shared for further research in static code analysis tasks (malware classification, author attribution, etc).
Obfuscation & Pipeline Code: Download
code2vec Models: Download
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
{{language}}_programms_{{split}}.tfrecord: Programs for unsupervised pretraining for java and python languages divided into the train, valid and test split.
keys: code: source code and language: language name.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset tabulates the population of Java by gender, including both male and female populations. This dataset can be utilized to understand the population distribution of Java across both sexes and to determine which sex constitutes the majority.
Key observations
There is a considerable majority of female population, with 65.66% of total population being female. Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.
Scope of gender :
Please note that American Community Survey asks a question about the respondents current sex, but not about gender, sexual orientation, or sex at birth. The question is intended to capture data for biological sex, not gender. Respondents are supposed to respond with the answer as either of Male or Female. Our research and this dataset mirrors the data reported as Male and Female for gender distribution analysis. No further analysis is done on the data reported from the Census Bureau.
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Java Population by Race & Ethnicity. You can refer the same here
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset tabulates the population of Java town by gender across 18 age groups. It lists the male and female population in each age group along with the gender ratio for Java town. The dataset can be utilized to understand the population distribution of Java town by gender and age. For example, using this dataset, we can identify the largest age group for both Men and Women in Java town. Additionally, it can be used to see how the gender ratio changes from birth to senior most age group and male to female ratio across each age group for Java town.
Key observations
Largest age group (population): Male # 40-44 years (139) | Female # 65-69 years (126). Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.
Age groups:
Scope of gender :
Please note that American Community Survey asks a question about the respondents current sex, but not about gender, sexual orientation, or sex at birth. The question is intended to capture data for biological sex, not gender. Respondents are supposed to respond with the answer as either of Male or Female. Our research and this dataset mirrors the data reported as Male and Female for gender distribution analysis.
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Java town Population by Gender. You can refer the same here
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is Machine learning in Java : helpful techniques to design, build, and deploy powerful machine learning applications in Java. It features 7 columns including author, publication date, language, and book publisher.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book series. It has 1 row and is filtered where the books is Java language reference. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is Java 2 in plain English. It features 7 columns including author, publication date, language, and book publisher.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Java Apple Leaf Dataset is a dataset for classification tasks - it contains Java Apple Leaf Dataset annotations for 1,102 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Facebook
Twitterhttps://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html
This dataset contains 1,070 open-source Java projects including test doubles. The projects were mined from GitHub. The starting point for building the dataset included all projects whose main language is Java and that had at least five stars as of October 29, 2023. This set of projects is listed in java_repositories_with_five_stars.txt. The 1,070 projects comprising this dataset use Maven as their build system, containing JUnit tests, and use Mockito to create test doubles. The projects are available in the project.zip archive file. The dataset also contains metadata about the projects, which is available in the projects.json file. The metadata describes the characteristics of each project together with the test double definitions, stubbings, and verifications inside the project. Finally, we also make available the source code used to build DataTD for future research on using and extending the dataset.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
This dataset contains a treasure-trove of information on over 55 million open source Java files, providing technical debt-related insights that can be used to inform a range of research and analytical activities. Every file captured in the dataset is assigned an MD5-hash to ensure unique identification, along with key metrics including its technical debt probability, fan-in/fan-out levels, total methods & variables, lines of code & comment lines, and the number of occurrences recorded.
These data points can each provide important guidance into the magnitude and scope of technical debt in open source Java software development projects. Researchers can analyse correlations between their technical debt probability and levels of fan-in/fan-out as well as variables such as methods created & number of lines written. Meanwhile analysts are enabled to identify files with high impacts on code quality through comparing their joint location in both technical debt probability rankings and highest occurrence rankings.
Utilizing this comprehensive dataset opens up opportunities for a wide range of investigations which seek to unlock greater understanding surrounding the complex relationships between software development practices and code quality. It presents an invaluable resource for anyone looking to gain key insights into spiritual subject matter– turning questions into answers via exploration!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
How to use this dataset:
The dataset contains several columns with different pieces of information including file_md5 (a unique identifier for each file), td_probability (the probability that the file contains technical debt), fanin (the number of incoming dependencies for the file), fanout (the number of outgoing dependencies for the file), total methods and variables, total lines of code and comment lines. researchers or analysts may perform statistical analysis on these parameters to get an overall idea of the impact that these values have on code quality. Additionally they may also find correlations between certain values such as fan-in/fan-out ratio and sums or averages when it comes to looking at methods/variables used in a particular set of files. Finally they can look at occurences column which contains information about how many times a particular MD5 hash has been used in open source repositories - this could help identify any particularly well received files which have been widely used across multiple platforms
By examining these columns together you will be able to gain insight into trends related to technical debt in Open Source Java programs as well as identify key areas where there is potential danger/challenges associated with implementation within your own projects. With enough data manipulation you may even make predictions regarding future implementation based on past experiences!
- Correlating technical debt probability and lines of code or variables to determine how additional code complexity impacts the magnitude of technical debt.
- Identifying files with a high probability of technical debt which have been used in multiple projects, so that those files may be improved to help future projects.
- Analyzing the average fan-in and fan-out for different programming paradigms, such as MVC, to determine if any design patterns produce higher degrees of technical debt than other paradigms or architectures
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: TD_of_55M_files.csv | Column name | Description | |:--------------------|:---------------------------------------------------------------------------------------------------------------------| | file_md5 | A unique identifier for each file that can also be used to track them across repositories or other sources. (String) | | td_probability | The p...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In software development, it’s common to reuse existing source code by copying and pasting, resulting in the proliferation of numerous code clones—similar or identical code fragments—that detrimentally affect software quality and maintainability. Although several techniques for code clone detection exist, many encounter challenges in effectively identifying semantic clones due to their inability to extract syntax and semantics information. Fewer techniques leverage low-level source code representations like bytecode or assembly for clone detection. This work introduces a novel code representation for identifying syntactic and semantic clones in Java source code. It integrates high-level features extracted from the Abstract Syntax Tree with low-level features derived from intermediate representations generated by static analysis tools, like the Soot framework. Leveraging this combined representation, fifteen machine-learning models are trained to effectively detect code clones. Evaluation on a large dataset demonstrates the models’ efficacy in accurately identifying semantic clones. Among these classifiers, ensemble classifiers, such as the LightGBM classifier, exhibit exceptional accuracy. Linearly combining features enhances the effectiveness of the models compared to multiplication and distance combination techniques. The experimental findings indicate that the proposed method can outperform the current clone detection techniques in detecting semantic clones.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book subjects. It has 3 rows and is filtered where the books is The Java tutorial : object-oriented programming for the Internet. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The GitHub Java Corpus is a snapshot of all open-source Java code on GitHub in October 2012 that is contained in open-source projects that at the time had at least one fork. It contains code from 14,785 projects amounting to about 352 million lines of code. The dataset has been used to study coding practice in Java at a large scale.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset is imported from CodeXGLUE and pre-processed using their script.
Where to find in Semeru:
The dataset can be found at /nfs/semeru/semeru_datasets/code_xglue/code-to-text/java in Semeru
CodeXGLUE -- Code-To-Text
Task Definition
The task is to generate natural language comments for a code, and evaluted by smoothed bleu-4 score.
Dataset
The dataset we use comes from CodeSearchNet and we filter the dataset as the following:
Remove… See the full description on the dataset page: https://huggingface.co/datasets/semeru/code-text-java.