MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset is imported from CodeXGLUE and pre-processed using their script.
Where to find in Semeru:
The dataset can be found at /nfs/semeru/semeru_datasets/code_xglue/code-to-text/java in Semeru
CodeXGLUE -- Code-To-Text
Task Definition
The task is to generate natural language comments for a code, and evaluted by smoothed bleu-4 score.
Dataset
The dataset we use comes from CodeSearchNet and we filter the dataset as the following:
Remove… See the full description on the dataset page: https://huggingface.co/datasets/semeru/code-text-java.
The dataset consists of two versions: $X_1$ with $P_3$ and $X_1$ without $P_3$, where $P_3$ represents a set of random unchanged functions from vulnerability fixing commits. This dataset is designed for finetuning large language models to detect vulnerabilities in code. It can be used for training and evaluating models in automated vulnerability detection tasks.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for "instructional_code-search-net-java"
Dataset Summary
This is an instructional dataset for Java. The dataset contains two different kind of tasks:
Given a piece of code generate a description of what it does. Given a description generate a piece of code that fulfils the description.
Languages
The dataset is in English.
Data Splits
There are no splits.
Dataset Creation
May of 2023
Curation Rationale
This dataset… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/instructional_code-search-net-java.
https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html
This dataset contains 1,070 open-source Java projects that include test doubles. The projects were mined from GitHub. The dataset was built by selecting all projects whose primary language is Java and that had at least five stars as of October 29, 2023. This list of projects is available in java_repositories_with_five_stars.txt
. The 1,070 projects in this dataset use Maven as their build system, contain JUnit tests, and utilize Mockito to create test doubles. The projects are available in the project.zip
archive file. Additionally, the dataset includes metadata about the projects, stored in the projects.json
file. This metadata describes the characteristics of each project, along with test double definitions, stubbings, and verifications. Finally, we also provide the source code used to build DataTD, enabling future research on expanding and utilizing the dataset.
The Java dataset introduced in DeepCom (Deep Code Comment Generation), commonly used to evaluate automated code summarization.
The ManySStuBs4J corpus is a collection of simple fixes to Java bugs, designed for evaluating program repair techniques. We collect all bug-fixing changes using the SZZ heuristic, and then filter these to obtain a data set of small bug fix changes. These are single statement fixes, classified where possible into one of 16 syntactic templates which we call SStuBs. The dataset contains simple statement bugs mined from open-source Java projects hosted in GitHub. There are two variants of the dataset. One mined from the 100 Java Maven Projects and one mined from the top 1000 Java Projects.
The dataest contains 153,652 single statement bugfix changes mined from 1,000 popular open-source Java projects, annotated by whether they match any of a set of 16 bug templates, inspired by state-of-the-art program repair techniques.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset is imported from CodeXGLUE and pre-processed using their script.
Where to find in Semeru:
The dataset can be found at /nfs/semeru/semeru_datasets/code_xglue/code-to-code/code-to-code-trans in Semeru
CodeXGLUE -- Code2Code Translation
Task Definition
Code translation aims to migrate legacy software from one programming language in a platform toanother. In CodeXGLUE, given a piece of Java (C#) code, the task is to translate the code into C#… See the full description on the dataset page: https://huggingface.co/datasets/semeru/code-code-translation-java-csharp.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a dataset of functionally equivalent Java methods.
This dataset is published as a supplemental data as the following submission.
Yoshiki Higo, Shinsuke Matsumoto, Shinji Kusumoto, and Kazuya Yasuda, "Constructing Dataset of Functionally Equivalent Java Methods Using Automated Test Generation Techniques", submitted to MSR 2022.
This dataset includes 276 groups of functionally equivalent Java methods, which have been manually verified by the authors.
The 276 groups include 728 Java methods in total.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 2 rows and is filtered where the book is Java : a framework for programming and problem solving. It features 7 columns including author, publication date, language, and book publisher.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset presents the distribution of median household income among distinct age brackets of householders in Java town. Based on the latest 2019-2023 5-Year Estimates from the American Community Survey, it displays how income varies among householders of different ages in Java town. It showcases how household incomes typically rise as the head of the household gets older. The dataset can be utilized to gain insights into age-based household income trends and explore the variations in incomes across households.
Key observations: Insights from 2023
In terms of income distribution across age cohorts, in Java town, the median household income stands at $105,057 for householders within the 25 to 44 years age group, followed by $103,750 for the 45 to 64 years age group. Notably, householders within the 65 years and over age group, had the lowest median household income at $59,125.
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. All incomes have been adjusting for inflation and are presented in 2023-inflation-adjusted dollars.
Age groups classifications include:
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Java town median household income by age. You can refer the same here
This is a dataset of functionally equivalent Java methods. This dataset is published as a supplemental data as the following submission. MSR2022: Constructing Dataset of Functionally Equivalent Java Methods Using Automated Test Generation Techniques (in the double-blined manner) This dataset includes 276 groups of functionally equivalent Java methods, which have been manually verified by the authors. The 276 groups include 728 Java methods in total.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The GitHub Java Corpus is a snapshot of all open-source Java code on GitHub in October 2012 that is contained in open-source projects that at the time had at least one fork. It contains code from 14,785 projects amounting to about 352 million lines of code. The dataset has been used to study coding practice in Java at a large scale.
This dataset includes data from approximately 12,000 stations that J. L. Reid and A. W. Mantyla have used in various world ocean studies. These data have been accumulated for the purpose of global ocean studies and are not intended for fine scale analyses. Each station represents the best station available for that locality at the time of the selection. The set was compiled over many years and from many sources and has been brought up to date as new data have become available. Most of the data were obtained from the National Oceanographic Data Center (NODC). The others came directly from various P.I.s in various formats and may lack some NODC parameters such as ship, country, and institution codes and NODC accession number. It should be noted that these are edited data files and an accurate account of deletions and corrections is, unfortunately, not available. In some cases these data may not agree exactly with versions published later or data supplied later by the NODC or an originator. Only stations that reach close to the bottom were chosen. This means, unfortunately, that the set is rather sparse near the equator. It is believed that the temperature and salinity measurements are acceptable. However, some of the oxygen and nutrient data are quite poor. They have not been eliminated from the data set, but simply ignore them in hand-contouring. They would have been eliminated if the troubles they cause in computer-contouring or instant atlases has been understood, but this set was begun before such methods were generally available. A few known systematic errors such as IGY oxygens or early Discovery oxygens and silicates have been adjusted, based upon deep comparisons with more modern data. In a few localities, stations have been reoccupied many times and a mean composite profile is given; at other localities, only the most recent, or the best sampled profile is saved and all others deleted. Because of the large-scale scope intended for this data array, some closely-spaced stations have been omitted. When needed, those stations can be retrieved from tapes of the entire cruise.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for "code-search-net-java"
Dataset Summary
This dataset is the Java portion of the CodeSarchNet annotated with a summary column.The code-search-net dataset includes open source functions that include comments found at GitHub.The summary is a short description of what the function does.
Languages
The dataset's comments are in English and the functions are coded in Java
Data Splits
Train, test, validation labels are included in the… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/code-search-net-java.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Double-checked gold standard dataset of Atoms of Confusion in Java. Data extracted from the main source code package of four open-source projects, excluding the test files. This dataset also includes a sample created from two other open-source projects.
Project | Version | Repository |
---|---|---|
FastUtil | 8.5.6 | https://github.com/vigna/fastutil |
Moshi | 1.12.0 | https://github.com/square/moshi |
Jimfs | 1.2 | https://github.com/google/jimfs |
uCrop | 2.2.7 | https://github.com/Yalantis/uCrop |
The sample was created by extracting Java files from the following projects:
Project | Version | Repository |
---|---|---|
Guava | 31.0.1 | https://github.com/google/guava |
Redisson | 3.6.16 | https://github.com/redisson/redisson |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a collection of Java class classification datasets (i.e., classify a class into one of a set of categories), collected for the research work 'Embedding Java Classes with code2vec: Improvements from Variable Obfuscation'. These are shared for further research in static code analysis tasks (malware classification, author attribution, etc).
Obfuscation & Pipeline Code: Download
code2vec Models: Download
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is Java 9 modularity : patterns and practices for developing maintainable applications. It features 7 columns including author, publication date, language, and book publisher.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is Reactive programming with Java 9 : develop concurrent and asynchronous applications with Java 9. It features 7 columns including author, publication date, language, and book publisher.
CodeQA is a free-form question answering dataset for the purpose of source code comprehension: given a code snippet and a question, a textual answer is required to be generated. CodeQA contains a Java dataset with 119,778 question-answer pairs and a Python dataset with 70,085 question-answer pairs.
Description from: CodeQA: A Question Answering Dataset for Source Code Comprehension
It contains the dataset of class comments extracted from various projects of three programming languages Java, Pharo, and Python
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset is imported from CodeXGLUE and pre-processed using their script.
Where to find in Semeru:
The dataset can be found at /nfs/semeru/semeru_datasets/code_xglue/code-to-text/java in Semeru
CodeXGLUE -- Code-To-Text
Task Definition
The task is to generate natural language comments for a code, and evaluted by smoothed bleu-4 score.
Dataset
The dataset we use comes from CodeSearchNet and we filter the dataset as the following:
Remove… See the full description on the dataset page: https://huggingface.co/datasets/semeru/code-text-java.