75 datasets found

f
High-Throughput Tabular Data Processor – Platform independent graphical tool...
plos.figshare.com
pdf
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Piotr Madanecki; Magdalena Bałut; Patrick G. Buckley; J. Renata Ochocka; Rafał Bartoszewski; David K. Crossman; Ludwine M. Messiaen; Arkadiusz Piotrowski (2023). High-Throughput Tabular Data Processor – Platform independent graphical tool for processing large data sets [Dataset]. http://doi.org/10.1371/journal.pone.0192858
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0192858
Dataset updated
Jun 2, 2023
Dataset provided by
PLOS ONE
Authors
Piotr Madanecki; Magdalena Bałut; Patrick G. Buckley; J. Renata Ochocka; Rafał Bartoszewski; David K. Crossman; Ludwine M. Messiaen; Arkadiusz Piotrowski
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
High-throughput technologies generate considerable amount of data which often requires bioinformatic expertise to analyze. Here we present High-Throughput Tabular Data Processor (HTDP), a platform independent Java program. HTDP works on any character-delimited column data (e.g. BED, GFF, GTF, PSL, WIG, VCF) from multiple text files and supports merging, filtering and converting of data that is produced in the course of high-throughput experiments. HTDP can also utilize itemized sets of conditions from external files for complex or repetitive filtering/merging tasks. The program is intended to aid global, real-time processing of large data sets using a graphical user interface (GUI). Therefore, no prior expertise in programming, regular expression, or command line usage is required of the user. Additionally, no a priori assumptions are imposed on the internal file composition. We demonstrate the flexibility and potential of HTDP in real-life research tasks including microarray and massively parallel sequencing, i.e. identification of disease predisposing variants in the next generation sequencing data as well as comprehensive concurrent analysis of microarray and sequencing results. We also show the utility of HTDP in technical tasks including data merge, reduction and filtering with external criteria files. HTDP was developed to address functionality that is missing or rudimentary in other GUI software for processing character-delimited column data from high-throughput technologies. Flexibility, in terms of input file handling, provides long term potential functionality in high-throughput analysis pipelines, as the program is not limited by the currently existing applications and data formats. HTDP is available as the Open Source software (https://github.com/pmadanecki/htdp).
h
codeparrot-java-all
huggingface.co
Updated Mar 17, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aditya Goswami (2022). codeparrot-java-all [Dataset]. https://huggingface.co/datasets/Aditya78b/codeparrot-java-all
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 17, 2022
Authors
Aditya Goswami
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
GitHub Code Dataset

Dataset Description

The GitHub Code dataset consists of 115M code files from GitHub in 32 programming languages with 60 extensions totaling in 1TB of data. The dataset was created from the public GitHub dataset on Google BiqQuery.

How to use it

The GitHub Code dataset is a very large dataset so for most use cases it is recommended to make use of the streaming API of datasets. You can load and iterate through the dataset with the… See the full description on the dataset page: https://huggingface.co/datasets/Aditya78b/codeparrot-java-all.

Data from: ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

zenodo.org
explore.openaire.eu
+1more

csv, zip

Updated Jan 27, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Hossein Keshavarz; Hossein Keshavarz; Meiyappan Nagappan; Meiyappan Nagappan (2022). ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction [Dataset]. http://doi.org/10.5281/zenodo.5907002

Explore at:

zip, csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.5907002

Dataset updated

Jan 27, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Hossein Keshavarz; Hossein Keshavarz; Meiyappan Nagappan; Meiyappan Nagappan

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

This archive contains the ApacheJIT dataset presented in the paper "ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction" as well as the replication package. The paper is submitted to MSR 2022 Data Showcase Track.

The datasets are available under directory dataset. There are 4 datasets in this directory.

1. apachejit_total.csv: This file contains the entire dataset. Commits are specified by their identifier and a set of commit metrics that are explained in the paper are provided as features. Column buggy specifies whether or not the commit introduced any bug into the system.
2. apachejit_train.csv: This file is a subset of the entire dataset. It provides a balanced set that we recommend for models that are sensitive to class imbalance. This set is obtained from the first 14 years of data (2003 to 2016).
3. apachejit_test_large.csv: This file is a subset of the entire dataset. The commits in this file are the commits from the last 3 years of data. This set is not balanced to represent a real-life scenario in a JIT model evaluation where the model is trained on historical data to be applied on future data without any modification.
4. apachejit_test_small.csv: This file is a subset of the test file explained above. Since the test file has more than 30,000 commits, we also provide a smaller test set which is still unbalanced and from the last 3 years of data.

In addition to the dataset, we also provide the scripts using which we built the dataset. These scripts are written in Python 3.8. Therefore, Python 3.8 or above is required. To set up the environment, we have provided a list of required packages in file requirements.txt. Additionally, one filtering step requires GumTree [1]. For Java, GumTree requires Java 11. For other languages, external tools are needed. Installation guide and more details can be found here.

The scripts are comprised of Python scripts under directory src and Python notebooks under directory notebooks. The Python scripts are mainly responsible for conducting GitHub search via GitHub search API and collecting commits through PyDriller Package [2]. The notebooks link the fixed issue reports with their corresponding fixing commits and apply some filtering steps. The bug-inducing candidates then are filtered again using gumtree.py script that utilizes the GumTree package. Finally, the remaining bug-inducing candidates are combined with the clean commits in the dataset_construction notebook to form the entire dataset.

More specifically, git_token.py handles GitHub API token that is necessary for requests to GitHub API. Script collector.py performs GitHub search. Tracing changed lines and git annotate is done in gitminer.py using PyDriller. Finally, gumtree.py applies 4 filtering steps (number of lines, number of files, language, and change significance).

References:

1. GumTree

* https://github.com/GumTreeDiff/gumtree

Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In ACM/IEEE International Conference on Automated Software Engineering, ASE ’14,Vasteras, Sweden - September 15 - 19, 2014. 313–324

2. PyDriller

* https://pydriller.readthedocs.io/en/latest/

* Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. PyDriller: Python Framework for Mining Software Repositories. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Lake Buena Vista, FL, USA)(ESEC/FSE2018). Association for Computing Machinery, New York, NY, USA, 908–911

f
Relevance and Redundancy ranking: Code and Supplementary material
springernature.figshare.com
pdf
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arvind Kumar Shekar; Tom Bocklisch; Patricia Iglesias Sanchez; Christoph Nikolas Straehle; Emmanuel Mueller (2023). Relevance and Redundancy ranking: Code and Supplementary material [Dataset]. http://doi.org/10.6084/m9.figshare.5418706.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5418706.v1
Dataset updated
May 31, 2023
Dataset provided by
figshare
Authors
Arvind Kumar Shekar; Tom Bocklisch; Patricia Iglesias Sanchez; Christoph Nikolas Straehle; Emmanuel Mueller
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains the code for Relevance and Redundancy ranking; a an efficient filter-based feature ranking framework for evaluating relevance based on multi-feature interactions and redundancy on mixed datasets.Source code is in .scala and .sbt format, metadata in .xml, all of which can be accessed and edited in standard, openly accessible text edit software. Diagrams are in openly accessible .png format.Supplementary_2.pdf: contains the results of experiments on multiple classifiers, along with parameter settings and a description of how KLD converges to mutual information based on its symmetricity.dataGenerator.zip: Synthetic data generator inspired from NIPS: Workshop on variable and feature selection (2001), http://www.clopinet.com/isabelle/Projects/NIPS2001/rar-mfs-master.zip: Relevance and Redundancy Framework containing overview diagram, example datasets, source code and metadata. Details on installing and running are provided below.Background. Feature ranking is benfiecial to gain knowledge and to identify the relevant features from a high-dimensional dataset. However, in several datasets, few features by themselves might have small correlation with the target classes, but by combining these features with some other features, they can be strongly correlated with the target. This means that multiple features exhibit interactions among themselves. It is necessary to rank the features based on these interactions for better analysis and classifier performance. However, evaluating these interactions on large datasets is computationally challenging. Furthermore, datasets often have features with redundant information. Using such redundant features hinders both efficiency and generalization capability of the classifier. The major challenge is to efficiently rank the features based on relevance and redundancy on mixed datasets. In the related publication, we propose a filter-based framework based on Relevance and Redundancy (RaR), RaR computes a single score that quantifies the feature relevance by considering interactions between features and redundancy. The top ranked features of RaR are characterized by maximum relevance and non-redundancy. The evaluation on synthetic and real world datasets demonstrates that our approach outperforms several state of-the-art feature selection techniques.# Relevance and Redundancy Framework (rar-mfs) rar-mfs is an algorithm for feature selection and can be employed to select features from labelled data sets. The Relevance and Redundancy Framework (RaR), which is the theory behind the implementation, is a novel feature selection algorithm that - works on large data sets (polynomial runtime),- can handle differently typed features (e.g. nominal features and continuous features), and- handles multivariate correlations.## InstallationThe tool is written in scala and uses the weka framework to load and handle data sets. You can either run it independently providing the data as an .arff or .csv file or you can include the algorithm as a (maven / ivy) dependency in your project. As an example data set we use heart-c. ### Project dependencyThe project is published to maven central (link). To depend on the project use:- maven xml de.hpi.kddm rar-mfs_2.11 1.0.2 - sbt: sbt libraryDependencies += "de.hpi.kddm" %% "rar-mfs" % "1.0.2" To run the algorithm usescalaimport de.hpi.kddm.rar._// ...val dataSet = de.hpi.kddm.rar.Runner.loadCSVDataSet(new File("heart-c.csv", isNormalized = false, "")val algorithm = new RaRSearch( HicsContrastPramsFA(numIterations = config.samples, maxRetries = 1, alphaFixed = config.alpha, maxInstances = 1000), RaRParamsFixed(k = 5, numberOfMonteCarlosFixed = 5000, parallelismFactor = 4))algorithm.selectFeatures(dataSet)### Command line tool- EITHER download the prebuild binary which requires only an installation of a recent java version (>= 6) 1. download the prebuild jar from the releases tab (latest) 2. run java -jar rar-mfs-1.0.2.jar--help Using the prebuild jar, here is an example usage: sh rar-mfs > java -jar rar-mfs-1.0.2.jar arff --samples 100 --subsetSize 5 --nonorm heart-c.arff Feature Ranking: 1 - age (12) 2 - sex (8) 3 - cp (11) ...- OR build the repository on your own: 1. make sure sbt is installed 2. clone repository 3. run sbt run Simple example using sbt directly after cloning the repository: sh rar-mfs > sbt "run arff --samples 100 --subsetSize 5 --nonorm heart-c.arff" Feature Ranking: 1 - age (12) 2 - sex (8) 3 - cp (11) ... ### [Optional]To speed up the algorithm, consider using a fast solver such as Gurobi (http://www.gurobi.com/). Install the solver and put the provided gurobi.jar into the java classpath. ## Algorithm### IdeaAbstract overview of the different steps of the proposed feature selection algorithm:https://github.com/tmbo/rar-mfs/blob/master/docu/images/algorithm_overview.png" alt="Algorithm Overview">The Relevance and Redundancy ranking framework (RaR) is a method able to handle large scale data sets and data sets with mixed features. Instead of directly selecting a subset, a feature ranking gives a more detailed overview into the relevance of the features. The method consists of a multistep approach where we 1. repeatedly sample subsets from the whole feature space and examine their relevance and redundancy: exploration of the search space to gather more and more knowledge about the relevance and redundancy of features 2. decude scores for features based on the scores of the subsets 3. create the best possible ranking given the sampled insights.### Parameters| Parameter | Default value | Description || ---------- | ------------- | ------------|| m - contrast iterations | 100 | Number of different slices to evaluate while comparing marginal and conditional probabilities || alpha - subspace slice size | 0.01 | Percentage of all instances to use as part of a slice which is used to compare distributions || n - sampling itertations | 1000 | Number of different subsets to select in the sampling phase|| k - sample set size | 5 | Maximum size of the subsets to be selected in the sampling phase|
N
Java, New York Age Cohorts Dataset: Children, Working Adults, and Seniors in...
neilsberg.com
csv, json
Updated Feb 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2025). Java, New York Age Cohorts Dataset: Children, Working Adults, and Seniors in Java town - Population and Percentage Analysis // 2025 Edition [Dataset]. https://www.neilsberg.com/research/datasets/4b8a9317-f122-11ef-8c1b-3860777c1fe6/
Explore at:
csv, jsonAvailable download formats
Dataset updated
Feb 22, 2025
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
New York, Java
Variables measured
Population Over 65 Years, Population Under 18 Years, Population Between 18 and 64 Years, Percent of Total Population for Age Groups
Measurement technique
The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. To measure the two variables, namely (a) population and (b) population as a percentage of the total population, we initially analyzed and categorized the data for each of the age cohorts. For age cohorts we divided it into three buckets Children ( Under the age of 18 years), working population ( Between 18 and 64 years) and senior population ( Over 65 years). For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset tabulates the Java town population by age cohorts (Children: Under 18 years; Working population: 18-64 years; Senior population: 65 years or more). It lists the population in each age cohort group along with its percentage relative to the total population of Java town. The dataset can be utilized to understand the population distribution across children, working population and senior population for dependency ratio, housing requirements, ageing, migration patterns etc.

Key observations

The largest age group was 18 to 64 years with a poulation of 1,285 (62.68% of the total population). Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

Age cohorts:

Under 18 years

18 to 64 years

65 years and over

Variables / Data Columns

Age Group: This column displays the age cohort for the Java town population analysis. Total expected values are 3 groups ( Children, Working Population and Senior Population).

Population: The population for the age cohort in Java town is shown in the following column.

Percent of Total Population: The population as a percent of total population of the Java town is shown in the following column.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for Java town Population by Age. You can refer the same here
E
GitHub Java Corpus
find.data.gov.scot
dtechtive.com
gz, txt
Updated Jan 10, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of Edinburgh: School of Informatics (2017). GitHub Java Corpus [Dataset]. http://doi.org/10.7488/ds/1690
Explore at:
gz(0.6836 MB), gz(1836.032 MB), txt(0.0028 MB), txt(0.0166 MB)Available download formats
Unique identifier
https://doi.org/10.7488/ds/1690
Dataset updated
Jan 10, 2017
Dataset provided by
University of Edinburgh: School of Informatics
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The GitHub Java Corpus is a snapshot of all open-source Java code on GitHub in October 2012 that is contained in open-source projects that at the time had at least one fork. It contains code from 14,785 projects amounting to about 352 million lines of code. The dataset has been used to study coding practice in Java at a large scale.
f
Code smells and quality attributes dataset
figshare.com
zip
Updated Nov 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ehsan Esmaili; Morteza Zakeri; Saeed Parsa (2024). Code smells and quality attributes dataset [Dataset]. http://doi.org/10.6084/m9.figshare.24057336.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24057336.v2
Dataset updated
Nov 3, 2024
Dataset provided by
figshare
Authors
Ehsan Esmaili; Morteza Zakeri; Saeed Parsa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
1 Code smell datasetIn order to create a high quality code smell datasets, we merged five different datasets. These datasets are among the largest and most accurate in our paper “Predicting Code Quality Attributes Based on Code Smells ”. Various software projects were analyzed automatically and manually to collect these labels. Table 1 shows the dataset details.Table 1. Merged datasets and their characteristics.DatasetSamplesProjectsCode smellsPalomba (2018) [1]40888395 versions of 30 open-source projectsLarge class, complex class, class data should be private, inappropriate intimacy, lazy class, middle man, refused equest, spaghetti code, speculative generality, comments, long method, long parameter list, feature envy, message chainsMadeyski [2]3291523 open-source and industrial projectsBlob, data classKhomh [3]_54 versions of 4 open-source projectsAnti-singleton, swiss army knifePecorelli [4]3419 open-source projectsBlobPalomba (2017) [5]_6 open-source projectsDispersed coupling, shotgun surgeryCode smell datasets have been prepared at two levels: class and method. The class level is 15 different smells as labels and 81 software metrics as features. As well, there are five smells and 31 metrics on the method level. This dataset contains samples of Java classes and methods. A sample can be identified by its longname, which contains the project-name, package-name, JavaFile-name, class-name, and method-name. The quantity of each smell ranges from 40 to 11000. The total number of samples is 37517, while the number of non-smells is nearly 3 million. As a result, our dataset is the largest in the study. You can see the details in Table 2.Table 2. The number of smells and non-smells at class and method levelsLevelMetricsSmellSamplesTotalClass81Complex class126523438Class data should be private1839Inappropriate intimacy780Large class990Lazy class774Middle man193Refused bequest1985Spaghetti code3203Speculative generality2723Blob988Data class938Anti-singleton2993Swiss army knife4601Dispersed coupling41Shotgun surgery125Non-smell40506 [3] +8334 [5] +296854 [1]+43862 [2] +55214 [4]444770Method31Comments10714079Feature envy525Long method11366Long parameter list1983Message chains98Non-smell246917624691762 Quality datasetThis dataset contains over 1000 Java project instances where for each instance the relative frequency of 20 code smells has been extracted along with the value of eight software quality attributes. The code quality dataset contains 20 smells as features and 8 quality attributes as labels: Coverageability, extendability, effectiveness, flexibility, functionality, reusability, testability, and understandability. The samples are Java projects identified by their name and version. Features are the ratio of smelly and non-smelly classes or methods in a software project. The quality attributes are a normalized score calculated by QMOOD metrics [6] and models extracted by [7], [8]. 1014 samples of small and large open-source and industrial projects are included in this dataset.The data samples are used to train machine learning models predicting software quality attributes based on code smells.References[1] F. Palomba, G. Bavota, M. Di Penta, F. Fasano, R. Oliveto, and A. De Lucia, “A large-scale empirical study on the lifecycle of code smell co-occurrences,” Inf Softw Technol, vol. 99, pp. 1–10, Jul. 2018, doi: 10.1016/J.INFSOF.2018.02.004.[2] L. Madeyski and T. Lewowski, “MLCQ: Industry-Relevant Code Smell Data Set,” in ACM International Conference Proceeding Series, Association for Computing Machinery, Apr. 2020, pp. 342–347. doi: 10.1145/3383219.3383264.[3] F. Khomh, M. Di Penta, Y. G. Guéhéneuc, and G. Antoniol, “An exploratory study of the impact of antipatterns on class change- and fault-proneness,” Empir Softw Eng, vol. 17, no. 3, pp. 243–275, Jun. 2012, doi: 10.1007/s10664-011-9171-y.[4] F. Pecorelli, F. Palomba, F. Khomh, and A. De Lucia, “Developer-Driven Code Smell Prioritization,” Proceedings - 2020 IEEE/ACM 17th International Conference on Mining Software Repositories, MSR 2020, pp. 220–231, 2020, doi: 10.1145/3379597.3387457.[5] F. Palomba, M. Zanoni, F. A. Fontana, A. De Lucia, and R. Oliveto, “Smells like teen spirit: Improving bug prediction performance using the intensity of code smells,” in Proceedings - 2016 IEEE International Conference on Software Maintenance and Evolution, ICSME 2016, Institute of Electrical and Electronics Engineers Inc., Jan. 2017, pp. 244–255. doi: 10.1109/ICSME.2016.27.[6] J. Bansiya and C. G. Davis, “A hierarchical model for object-oriented design quality assessment,” IEEE Transactions on Software Engineering, vol. 28, no. 1, pp. 4–17, Jan. 2002, doi: 10.1109/32.979986.[7] M. Zakeri-Nasrabadi and S. Parsa, “Learning to predict test effectiveness,” International Journal of Intelligent Systems, 2021, doi: 10.1002/INT.22722.[8] M. Zakeri-Nasrabadi and S. Parsa, “Testability Prediction Dataset,” Mar. 2021, doi: 10.5281/ZENODO.4650228.
h
cornstack-java-v1
huggingface.co
Updated Dec 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nomic AI (2024). cornstack-java-v1 [Dataset]. https://huggingface.co/datasets/nomic-ai/cornstack-java-v1
Explore at:
Dataset updated
Dec 10, 2024
Dataset authored and provided by
Nomic AI
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
CoRNStack Python Dataset

The CoRNStack Dataset, accepted to ICLR 2025, is a large-scale high quality training dataset specifically for code retrieval across multiple programming languages. This dataset comprises of

CoRNStack Dataset Curation

Starting with the deduplicated Stackv2, we create text-code pairs from function docstrings and respective code. We filtered out… See the full description on the dataset page: https://huggingface.co/datasets/nomic-ai/cornstack-java-v1.
Dataset for Mininig Off-by-One Errors
zenodo.org
zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hendrig Sellik; Hendrig Sellik (2020). Dataset for Mininig Off-by-One Errors [Dataset]. http://doi.org/10.5281/zenodo.3566967
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3566967
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Hendrig Sellik; Hendrig Sellik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset that was made by downloading top 500 starred Java projects from GitHub and then eliminating common projects found with java-large-training and java-large-testing raw datasets*. The resulting dataset consists of 155 GitHub projects.

The repositories were downloaded and the code analyzed using code found in the following repository: https://github.com/serg-ml4se-2019/group5-deep-bugs/tree/master (more specifically, the code found in bug_mining folder)

* java-large raw dataset can be found at https://github.com/tech-srl/code2seq/blob/master/README.md#datasets
h
migration-bench-java-utg
huggingface.co
Updated May 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amazon Science (2025). migration-bench-java-utg [Dataset]. https://huggingface.co/datasets/AmazonScience/migration-bench-java-utg
Explore at:
Dataset updated
May 15, 2025
Dataset authored and provided by
Amazon Science
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
MigrationBench

1. 📖 Overview

🤗 MigrationBench is a large-scale code migration benchmark dataset at the repository level, across multiple programming languages.

Current and initial release includes java 8 repositories with the maven build system… See the full description on the dataset page: https://huggingface.co/datasets/AmazonScience/migration-bench-java-utg.
f
CLUSTOM-CLOUD: In-Memory Data Grid-Based Software for Clustering 16S rRNA...
plos.figshare.com
tiff
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeongsu Oh; Chi-Hwan Choi; Min-Kyu Park; Byung Kwon Kim; Kyuin Hwang; Sang-Heon Lee; Soon Gyu Hong; Arshan Nasir; Wan-Sup Cho; Kyung Mo Kim (2023). CLUSTOM-CLOUD: In-Memory Data Grid-Based Software for Clustering 16S rRNA Sequence Data in the Cloud Environment [Dataset]. http://doi.org/10.1371/journal.pone.0151064
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0151064
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Jeongsu Oh; Chi-Hwan Choi; Min-Kyu Park; Byung Kwon Kim; Kyuin Hwang; Sang-Heon Lee; Soon Gyu Hong; Arshan Nasir; Wan-Sup Cho; Kyung Mo Kim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
High-throughput sequencing can produce hundreds of thousands of 16S rRNA sequence reads corresponding to different organisms present in the environmental samples. Typically, analysis of microbial diversity in bioinformatics starts from pre-processing followed by clustering 16S rRNA reads into relatively fewer operational taxonomic units (OTUs). The OTUs are reliable indicators of microbial diversity and greatly accelerate the downstream analysis time. However, existing hierarchical clustering algorithms that are generally more accurate than greedy heuristic algorithms struggle with large sequence datasets. To keep pace with the rapid rise in sequencing data, we present CLUSTOM-CLOUD, which is the first distributed sequence clustering program based on In-Memory Data Grid (IMDG) technology–a distributed data structure to store all data in the main memory of multiple computing nodes. The IMDG technology helps CLUSTOM-CLOUD to enhance both its capability of handling larger datasets and its computational scalability better than its ancestor, CLUSTOM, while maintaining high accuracy. Clustering speed of CLUSTOM-CLOUD was evaluated on published 16S rRNA human microbiome sequence datasets using the small laboratory cluster (10 nodes) and under the Amazon EC2 cloud-computing environments. Under the laboratory environment, it required only ~3 hours to process dataset of size 200 K reads regardless of the complexity of the human microbiome data. In turn, one million reads were processed in approximately 20, 14, and 11 hours when utilizing 20, 30, and 40 nodes on the Amazon EC2 cloud-computing environment. The running time evaluation indicates that CLUSTOM-CLOUD can handle much larger sequence datasets than CLUSTOM and is also a scalable distributed processing system. The comparative accuracy test using 16S rRNA pyrosequences of a mock community shows that CLUSTOM-CLOUD achieves higher accuracy than DOTUR, mothur, ESPRIT-Tree, UCLUST and Swarm. CLUSTOM-CLOUD is written in JAVA and is freely available at http://clustomcloud.kopri.re.kr.
N
Dataset for Java, SD Census Bureau Income Distribution by Race
neilsberg.com
Updated Jan 3, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2024). Dataset for Java, SD Census Bureau Income Distribution by Race [Dataset]. https://www.neilsberg.com/research/datasets/80d61b5a-9fc2-11ee-b48f-3860777c1fe6/
Explore at:
Dataset updated
Jan 3, 2024
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Java, South Dakota
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset tabulates the Java median household income by race. The dataset can be utilized to understand the racial distribution of Java income.

Content

The dataset will have the following datasets when applicable

Please note: The 2020 1-Year ACS estimates data was not reported by the Census Bureau due to the impact on survey collection and analysis caused by COVID-19. Consequently, median household income data for 2020 is unavailable for large cities (population 65,000 and above).

Java, SD median household income breakdown by race betwen 2011 and 2021

Median Household Income by Racial Categories in Java, SD (2021, in 2022 inflation-adjusted dollars)

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Interested in deeper insights and visual analysis?

Explore our comprehensive data analysis and visual representations for a deeper understanding of Java median household income by race. You can refer the same here
CommitBench
zenodo.org
csv, json
Updated Feb 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maximilian Schall; Maximilian Schall; Tamara Czinczoll; Tamara Czinczoll; Gerard de Melo; Gerard de Melo (2024). CommitBench [Dataset]. http://doi.org/10.5281/zenodo.10497442
Explore at:
json, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10497442
Dataset updated
Feb 14, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Maximilian Schall; Maximilian Schall; Tamara Czinczoll; Tamara Czinczoll; Gerard de Melo; Gerard de Melo
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Time period covered
Dec 15, 2023
Description
Data Statement for CommitBench

- Dataset Title: CommitBench

- Dataset Curator: Maximilian Schall, Tamara Czinczoll, Gerard de Melo

- Dataset Version: 1.0, 15.12.2023

- Data Statement Author: Maximilian Schall, Tamara Czinczoll

- Data Statement Version: 1.0, 16.01.2023

- Code URL: https://github.com/maxscha/commitbench

EXECUTIVE SUMMARY

We provide CommitBench as an open-source, reproducible and privacy- and license-aware benchmark for commit message generation. The dataset is gathered from github repositories with licenses that permit redistribution. We provide six programming languages, Java, Python, Go, JavaScript, PHP and Ruby. The commit messages in natural language are restricted to English, as it is the working language in many software development projects. The dataset has 1,664,590 examples that were generated by using extensive quality-focused filtering techniques (e.g. excluding bot commits). Additionally, we provide a version with longer sequences for benchmarking models with more extended sequence input, as well a version with

CURATION RATIONALE

We created this dataset due to quality and legal issues with previous commit message generation datasets. Given a git diff displaying code changes between two file versions, the task is to predict the accompanying commit message describing these changes in natural language. We base our GitHub repository selection on that of a previous dataset, CodeSearchNet, but apply a large number of filtering techniques to improve the data quality and eliminate noise. Due to the original repository selection, we are also restricted to the aforementioned programming languages. It was important to us, however, to provide some number of programming languages to accommodate any changes in the task due to the degree of hardware-relatedness of a language. The dataset is provides as a large CSV file containing all samples. We provide the following fields: Diff, Commit Message, Hash, Project, Split.

DOCUMENTATION FOR SOURCE DATASETS

Repository selection based on CodeSearchNet, which can be found under https://github.com/github/CodeSearchNet

LANGUAGE VARIETIES

Since GitHub hosts software projects from all over the world, there is no single uniform variety of English used across all commit messages. This means that phrasing can be regional or subject to influences from the programmer's native language. It also means that different spelling conventions may co-exist and that different terms may used for the same concept. Any model trained on this data should take these factors into account. For the number of samples for different programming languages, see Table below:

Language Number of Samples
Java 153,119
Ruby 233,710
Go 137,998
JavaScript 373,598
Python 472,469
PHP 294,394

SPEAKER DEMOGRAPHIC

Due to the extremely diverse (geographically, but also socio-economically) backgrounds of the software development community, there is no single demographic the data comes from. Of course, this does not entail that there are no biases when it comes to the data origin. Globally, the average software developer tends to be male and has obtained higher education. Due to the anonymous nature of GitHub profiles, gender distribution information cannot be extracted.

ANNOTATOR DEMOGRAPHIC

Due to the automated generation of the dataset, no annotators were used.

SPEECH SITUATION AND CHARACTERISTICS

The public nature and often business-related creation of the data by the original GitHub users fosters a more neutral, information-focused and formal language. As it is not uncommon for developers to find the writing of commit messages tedious, there can also be commit messages representing the frustration or boredom of the commit author. While our filtering is supposed to catch these types of messages, there can be some instances still in the dataset.

PREPROCESSING AND DATA FORMATTING

See paper for all preprocessing steps. We do not provide the un-processed raw data due to privacy concerns, but it can be obtained via CodeSearchNet or requested from the authors.

CAPTURE QUALITY

While our dataset is completely reproducible at the time of writing, there are external dependencies that could restrict this. If GitHub shuts down and someone with a software project in the dataset deletes their repository, there can be instances that are non-reproducible.

LIMITATIONS

While our filters are meant to ensure a high quality for each data sample in the dataset, we cannot ensure that only low-quality examples were removed. Similarly, we cannot guarantee that our extensive filtering methods catch all low-quality examples. Some might remain in the dataset. Another limitation of our dataset is the low number of programming languages (there are many more) as well as our focus on English commit messages. There might be some people that only write commit messages in their respective languages, e.g., because the organization they work at has established this or because they do not speak English (confidently enough). Perhaps some languages' syntax better aligns with that of programming languages. These effects cannot be investigated with CommitBench.

Although we anonymize the data as far as possible, the required information for reproducibility, including the organization, project name, and project hash, makes it possible to refer back to the original authoring user account, since this information is freely available in the original repository on GitHub.

METADATA

License: Dataset under the CC BY-NC 4.0 license

DISCLOSURES AND ETHICAL REVIEW

While we put substantial effort into removing privacy-sensitive information, our solutions cannot find 100% of such cases. This means that researchers and anyone using the data need to incorporate their own safeguards to effectively reduce the amount of personal information that can be exposed.

ABOUT THIS DOCUMENT

A data statement is a characterization of a dataset that provides context to allow developers and users to better understand how experimental results might generalize, how software might be appropriately deployed, and what biases might be reflected in systems built on the software.

This data statement was written based on the template for the Data Statements Version 2 schema. The template was prepared by Angelina McMillan-Major, Emily M. Bender, and Batya Friedman and can be found at https://techpolicylab.uw.edu/data-statements/ and was updated from the community Version 1 Markdown template by Leon Dercyznski.
f
Constructed datasets.
plos.figshare.com
rar
Updated May 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fahmi H. Quradaa; Sara Shahzad; Rashad Saeed; Mubarak M. Sufyan (2024). Constructed datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0302333.s004
Explore at:
rarAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302333.s004
Dataset updated
May 10, 2024
Dataset provided by
PLOS ONE
Authors
Fahmi H. Quradaa; Sara Shahzad; Rashad Saeed; Mubarak M. Sufyan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In software development, it’s common to reuse existing source code by copying and pasting, resulting in the proliferation of numerous code clones—similar or identical code fragments—that detrimentally affect software quality and maintainability. Although several techniques for code clone detection exist, many encounter challenges in effectively identifying semantic clones due to their inability to extract syntax and semantics information. Fewer techniques leverage low-level source code representations like bytecode or assembly for clone detection. This work introduces a novel code representation for identifying syntactic and semantic clones in Java source code. It integrates high-level features extracted from the Abstract Syntax Tree with low-level features derived from intermediate representations generated by static analysis tools, like the Soot framework. Leveraging this combined representation, fifteen machine-learning models are trained to effectively detect code clones. Evaluation on a large dataset demonstrates the models’ efficacy in accurately identifying semantic clones. Among these classifiers, ensemble classifiers, such as the LightGBM classifier, exhibit exceptional accuracy. Linearly combining features enhances the effectiveness of the models compared to multiplication and distance combination techniques. The experimental findings indicate that the proposed method can outperform the current clone detection techniques in detecting semantic clones.
Z
Dataset of the publication: Large-scale Characterization of Java Streams
data.niaid.nih.gov
Updated Aug 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrea Rosà (2023). Dataset of the publication: Large-scale Characterization of Java Streams [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7681471
Explore at:
Dataset updated
Aug 1, 2023
Dataset provided by
Eduardo Rosales
Andrea Rosà
Matteo Basso
Walter Binder
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains all the data supporting the findings of the research article: Large-scale Characterization of Java Streams, including the list of all analyzed projects and their metadata. The repository contains two files as follows:

database.tar: This archive contains all the data that supports the findings reported in the manuscript entitled “Large-scale Characterization of Java Streams” which is under revision.

The data is provided as a database created via PostgreSQL and can be restored to an installation of PostgreSQL (version 12 or later) by following the next procedure:

A. Using pgAdmin (https://www.pgadmin.org/):

Enter to the "Restore Dialog" (https://www.pgadmin.org/docs/pgadmin4/development/restore_dialog.html).

Select as "Format" the "Custom or tar" option.

When selecting the "Filename", browse to the folder where this archive was decompressed and select the file "database.tar".

ListOfProjects.csv: This comma-separated value file contains the list of all analyzed projects and their metadata.
Z
Data from: Lost in Translation: A Study of Bugs Introduced by Large Language...
data.niaid.nih.gov
zenodo.org
Updated Jan 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ibrahimzada, Ali Reza (2024). Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8190051
Explore at:
Dataset updated
Jan 25, 2024
Dataset authored and provided by
Ibrahimzada, Ali Reza
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Artifact repository for the paper Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code, accepted at ICSE 2024, Lisbon, Portugal. Authors are Rangeet Pan* Ali Reza Ibrahimzada*, Rahul Krishna, Divya Sankar, Lambert Pougeum Wassi, Michele Merler, Boris Sobolev, Raju Pavuluri, Saurabh Sinha, and Reyhaneh Jabbarvand.

Install

This repository contains the source code for reproducing the results in our paper. Please start by cloning this repository:

git clone https://github.com/Intelligent-CAT-Lab/PLTranslationEmpirical

We recommend using a virtual environment for running the scripts. Please download conda 23.11.0 from this link. You can create a virtual environment using the following command:

conda create -n plempirical python=3.10.13

After creating the virtual environment, you can activate it using the following command:

conda activate plempirical

You can run the following command to make sure that you are using the correct version of Python:

python3 --version && pip3 --version

Dependencies

To install all software dependencies, please execute the following command:

pip3 install -r requirements.txt

As for hardware dependencies, we used 16 NVIDIA A100 GPUs with 80GBs of memory for inferencing models. The models can be inferenced on any combination of GPUs as long as the reader can properly distribute the model weights across the GPUs. We did not perform weight distribution since we had enough memory (80 GB) per GPU.

Moreover, for compiling and testing the generated translations, we used Python 3.10, g++ 11, GCC Clang 14.0, Java 11, Go 1.20, Rust 1.73, and .Net 7.0.14 for Python, C++, C, Java, Go, Rust, and C#, respectively. Overall, we recommend using a machine with Linux OS and at least 32GB of RAM for running the scripts.

For running scripts of alternative approaches, you need to make sure you have installed C2Rust, CxGO, and Java2C# on your machine. Please refer to their repositories for installation instructions. For Java2C#, you need to create a .csproj file like below:

Exe net7.0 enable enable

Dataset

We uploaded the dataset we used in our empirical study to Zenodo. The dataset is organized as follows:

CodeNet

AVATAR

Evalplus

Apache Commons-CLI

Click

Please download and unzip the dataset.zip file from Zenodo. After unzipping, you should see the following directory structure:

PLTranslationEmpirical ├── dataset ├── codenet ├── avatar ├── evalplus ├── real-life-cli ├── ...

The structure of each dataset is as follows:

CodeNet & Avatar: Each directory in these datasets correspond to a source language where each include two directories Code and TestCases for code snippets and test cases, respectively. Each code snippet has an id in the filename, where the id is used as a prefix for test I/O files.

Evalplus: The source language code snippets follow a similar structure as CodeNet and Avatar. However, as a one time effort, we manually created the test cases in the target Java language inside a maven project, evalplus_java. To evaluate the translations from an LLM, we recommend moving the generated Java code snippets to the src/main/java directory of the maven project and then running the command mvn clean test surefire-report:report -Dmaven.test.failure.ignore=true to compile, test, and generate reports for the translations.

Real-life Projects: The real-life-cli directory represents two real-life CLI projects from Java and Python. These datasets only contain code snippets as files and no test cases. As mentioned in the paper, the authors manually evaluated the translations for these datasets.

Scripts

We provide bash scripts for reproducing our results in this work. First, we discuss the translation script. For doing translation with a model and dataset, first you need to create a .env file in the repository and add the following:

OPENAI_API_KEY= LLAMA2_AUTH_TOKEN= STARCODER_AUTH_TOKEN=

Translation with GPT-4: You can run the following command to translate all Python -> Java code snippets in codenet dataset with the GPT-4 while top-k sampling is k=50, top-p sampling is p=0.95, and temperature=0.7:

bash scripts/translate.sh GPT-4 codenet Python Java 50 0.95 0.7 0

Translation with CodeGeeX: Prior to running the script, you need to clone the CodeGeeX repository from here and use the instructions from their artifacts to download their model weights. After cloning it inside PLTranslationEmpirical and downloading the model weights, your directory structure should be like the following:

PLTranslationEmpirical ├── dataset ├── codenet ├── avatar ├── evalplus ├── real-life-cli ├── CodeGeeX ├── codegeex ├── codegeex_13b.pt # this file is the model weight ├── ... ├── ...

You can run the following command to translate all Python -> Java code snippets in codenet dataset with the CodeGeeX while top-k sampling is k=50, top-p sampling is p=0.95, and temperature=0.2 on GPU gpu_id=0:

bash scripts/translate.sh CodeGeeX codenet Python Java 50 0.95 0.2 0

For all other models (StarCoder, CodeGen, LLaMa, TB-Airoboros, TB-Vicuna), you can execute the following command to translate all Python -> Java code snippets in codenet dataset with the StarCoder|CodeGen|LLaMa|TB-Airoboros|TB-Vicuna while top-k sampling is k=50, top-p sampling is p=0.95, and temperature=0.2 on GPU gpu_id=0:

bash scripts/translate.sh StarCoder codenet Python Java 50 0.95 0.2 0

For translating and testing pairs with traditional techniques (i.e., C2Rust, CxGO, Java2C#), you can run the following commands:

bash scripts/translate_transpiler.sh codenet C Rust c2rust fix_report bash scripts/translate_transpiler.sh codenet C Go cxgo fix_reports bash scripts/translate_transpiler.sh codenet Java C# java2c# fix_reports bash scripts/translate_transpiler.sh avatar Java C# java2c# fix_reports

For compile and testing of CodeNet, AVATAR, and Evalplus (Python to Java) translations from GPT-4, and generating fix reports, you can run the following commands:

bash scripts/test_avatar.sh Python Java GPT-4 fix_reports 1 bash scripts/test_codenet.sh Python Java GPT-4 fix_reports 1 bash scripts/test_evalplus.sh Python Java GPT-4 fix_reports 1

For repairing unsuccessful translations of Java -> Python in CodeNet dataset with GPT-4, you can run the following commands:

bash scripts/repair.sh GPT-4 codenet Python Java 50 0.95 0.7 0 1 compile bash scripts/repair.sh GPT-4 codenet Python Java 50 0.95 0.7 0 1 runtime bash scripts/repair.sh GPT-4 codenet Python Java 50 0.95 0.7 0 1 incorrect

For cleaning translations of open-source LLMs (i.e., StarCoder) in codenet, you can run the following command:

bash scripts/clean_generations.sh StarCoder codenet

Please note that for the above commands, you can change the dataset and model name to execute the same thing for other datasets and models. Moreover, you can refer to /prompts for different vanilla and repair prompts used in our study.

Artifacts

Please download the artifacts.zip file from our Zenodo repository. We have organized the artifacts as follows:

RQ1 - Translations: This directory contains the translations from all LLMs and for all datasets. We have added an excel file to show a detailed breakdown of the translation results.

RQ2 - Manual Labeling: This directory contains an excel file which includes the manual labeling results for all translation bugs.

RQ3 - Alternative Approaches: This directory contains the translations from all alternative approaches (i.e., C2Rust, CxGO, Java2C#). We have added an excel file to show a detailed breakdown of the translation results.

RQ4 - Mitigating Translation Bugs: This directory contains the fix results of GPT-4, StarCoder, CodeGen, and Llama 2. We have added an excel file to show a detailed breakdown of the fix results.

Contact

We look forward to hearing your feedback. Please contact Rangeet Pan or Ali Reza Ibrahimzada for any questions or comments 🙏.
Data from: E2EGit: A Dataset of End-to-End Web Tests in Open Source Projects...
zenodo.org
bin, pdf, txt
Updated May 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergio Di Meglio; Sergio Di Meglio; Valeria Pontillo; Valeria Pontillo; Coen De roover; Coen De roover; Luigi Libero Lucio Starace; Luigi Libero Lucio Starace; Sergio Di Martino; Sergio Di Martino; Ruben Opdebeeck; Ruben Opdebeeck (2025). E2EGit: A Dataset of End-to-End Web Tests in Open Source Projects [Dataset]. http://doi.org/10.5281/zenodo.14988988
Explore at:
txt, bin, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14988988
Dataset updated
May 20, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sergio Di Meglio; Sergio Di Meglio; Valeria Pontillo; Valeria Pontillo; Coen De roover; Coen De roover; Luigi Libero Lucio Starace; Luigi Libero Lucio Starace; Sergio Di Martino; Sergio Di Martino; Ruben Opdebeeck; Ruben Opdebeeck
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ABSTRACT
End-to-end (E2E) testing is a software validation approach that simulates realistic user scenarios throughout the entire workflow of an application. In the context of web
applications, E2E testing involves two activities: Graphic User Interface (GUI) testing, which simulates user interactions with the web app’s GUI through web browsers, and performance testing, which evaluates system workload handling. Despite its recognized importance in delivering high-quality web applications, the availability of large-scale datasets featuring real-world E2E web tests remains limited, hindering research in the field.
To address this gap, we present E2EGit, a comprehensive dataset of non-trivial open-source web projects collected on GitHub that adopt E2E testing. By analyzing over 5,000 web repositories across popular programming languages (JAVA, JAVASCRIPT, TYPESCRIPT, and PYTHON), we identified 472 repositories implementing 43,670 automated Web GUI tests with popular browser automation frameworks (SELENIUM, PLAYWRIGHT, CYPRESS, PUPPETEER), and 84 repositories that featured 271 automated performance tests implemented leveraging the most popular open-source tools (JMETER, LOCUST). Among these, 13 repositories implemented both types of testing for a total of 786 Web GUI tests and 61 performance tests.

DATASET DESCRIPTION
The dataset is provided as an SQLite database, whose structure is illustrated in Figure 3 (in the paper), which consists of five tables, each serving a specific purpose.
The repository table contains information on 1.5 million repositories collected using the SEART tool on May 4. It includes 34 fields detailing repository characteristics. The
non_trivial_repository table is a subset of the previous one, listing repositories that passed the two filtering stages described in the pipeline. For each repository, it specifies whether it is a web repository using JAVA, JAVASCRIPT, TYPESCRIPT, or PYTHON frameworks. A repository may use multiple frameworks, with corresponding fields (e.g., is web java) set to true, and the field web dependencies listing the detected web frameworks. For Web GUI testing, the dataset includes two additional tables; gui_testing_test _details, where each row represents a test file, providing the file path, the browser automation framework used, the test engine employed, and the number of tests implemented in the file. gui_testing_repo_details, aggregating data from the previous table at the repository level. Each of the 472 repositories has a row summarizing
the number of test files using frameworks like SELENIUM or PLAYWRIGHT, test engines like JUNIT, and the total number of tests identified. For performance testing, the performance_testing_test_details table contains 410 rows, one for each test identified. Each row includes the file path, whether the test uses JMETER or LOCUST, and extracted details such as the number of thread groups, concurrent users, and requests. Notably, some fields may be absent—for instance, if external files (e.g., CSVs defining workloads) were unavailable, or in the case of Locust tests, where parameters like duration and concurrent users are specified via the command line.

To cite this article refer to this citation:

@inproceedings{di2025e2egit,
title={E2EGit: A Dataset of End-to-End Web Tests in Open Source Projects},
author={Di Meglio, Sergio and Starace, Luigi Libero Lucio and Pontillo, Valeria and Opdebeeck, Ruben and De Roover, Coen and Di Martino, Sergio},
booktitle={2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR)},
pages={10--15},
year={2025},
organization={IEEE/ACM}
}

This work has been partially supported by the Italian PNRR MUR project PE0000013-FAIR.
D
Java Develop Service Market Report | Global Forecast From 2025 To 2033
dataintelo.com
csv, pdf, pptx
Updated Sep 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2024). Java Develop Service Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-java-develop-service-market
Explore at:
csv, pdf, pptxAvailable download formats
Dataset updated
Sep 23, 2024
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Java Development Services Market Outlook

The global market size for Java Development Services in 2023 is projected to be approximately USD 8.5 billion, growing to an estimated USD 18.7 billion by 2032, at a CAGR of 9.1% during the forecast period. The growth in this market is primarily driven by the increasing demand for robust, scalable, and secure enterprise applications across various industries globally. Factors such as digital transformation, the rise of e-commerce, and the growing prevalence of mobile and web applications contribute significantly to this market's expansion.

One of the key growth factors for the Java Development Services market is the widespread adoption of digital transformation initiatives across businesses of all sizes. Companies are increasingly investing in creating or upgrading their digital infrastructure to stay competitive in a rapidly evolving market landscape. Java, being a versatile and powerful programming language, is often the preferred choice for developing enterprise-grade applications that require high reliability, performance, and scalability. With organizations striving to enhance their digital capabilities, the demand for Java development services has seen a substantial rise.

Another crucial growth driver is the burgeoning e-commerce sector, which relies heavily on robust backend systems to handle large volumes of transactions and user data. Java's strong security features and its ability to support complex, high-traffic applications make it an ideal choice for e-commerce platforms. As online retail continues to grow at an unprecedented rate, the necessity for customized, scalable, and secure Java-based solutions is expected to propel the market further. Additionally, the trend of leveraging big data analytics and artificial intelligence within e-commerce frameworks further amplifies the demand for proficient Java development.

The proliferation of mobile and web applications has also significantly contributed to the increased demand for Java Development Services. With the surge in smartphone usage and internet penetration, businesses are aiming to offer seamless user experiences across multiple platforms. Java offers cross-platform capabilities, making it highly suitable for developing both mobile and web applications. Furthermore, the rise of cloud computing and the increasing preference for deploying applications on the cloud have accentuated the need for Java developers who can create and manage robust cloud-based applications.

From a regional perspective, the Asia Pacific region stands out as a significant growth area for Java Development Services, driven by the rapid digitalization in emerging economies such as China, India, and Southeast Asian countries. The presence of a large pool of skilled developers and cost-effective development services further boosts the region’s attractiveness. North America and Europe also represent substantial markets due to the advanced technological landscape and the high adoption rates of digital transformation initiatives. Meanwhile, Latin America and the Middle East & Africa are expected to witness moderate growth, propelled by increasing investments in technology infrastructure.

Custom Application Development Analysis

Custom Application Development is a critical segment within the Java Development Services market, addressing the unique needs of businesses that require tailored software solutions. This segment has been witnessing robust growth, driven by enterprises seeking bespoke applications that cater to specific business processes. The flexibility and scalability of Java make it a prime choice for developing customized applications that can evolve with the business needs. Organizations across various sectors, including BFSI, healthcare, and retail, often prefer custom Java applications to ensure seamless integration with their existing systems and to achieve a competitive edge through unique functionalities.

The demand for Custom Application Development is further augmented by the increasing complexity of business operations and the need for enhanced operational efficiency. Java’s robust ecosystem, which includes a vast array of frameworks, libraries, and tools, enables developers to deliver high-quality, custom solutions efficiently. This capability is particularly valuable in industries such as manufacturing and IT, where businesses require sophisticated applications to manage complex workflows and large datasets. The ability to create applications that are both highly functional and user-friendly is a significant driver for this segment.
n
1M Chinese Coding Questions Dataset – Python/Java/C++
nexdata.ai
Updated Mar 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2025). 1M Chinese Coding Questions Dataset – Python/Java/C++ [Dataset]. https://www.nexdata.ai/datasets/llm/1738
Explore at:
Dataset updated
Mar 6, 2025
Dataset provided by
Nexdata
nexdata technology inc
Authors
Nexdata
Variables measured
Format, Content, Language, Data Size, Data Fields, Data Categories, Data processing
Description
This dataset contains 1 million Chinese programming questions with corresponding answers, detailed parses (explanations), and programming language labels. It includes a wide range of questions in C, C++, Python, Java, and JavaScript, making it ideal for training large language models (LLMs) on multilingual code understanding and generation. The questions cover fundamental to advanced topics, supporting AI applications such as code completion, bug fixing, and programming reasoning. This structured dataset enhances model performance in natural language programming tasks and helps reinforce code logic skills in AI systems. All data complies with international privacy regulations including GDPR, CCPA, and PIPL.
Z
Technical Leverage Dataset for Java Dependencies in Maven
data.niaid.nih.gov
zenodo.org
+1more
Updated Aug 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pashchencko, Ivan (2022). Technical Leverage Dataset for Java Dependencies in Maven [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6796848
Explore at:
Dataset updated
Aug 8, 2022
Dataset provided by
Pashchencko, Ivan
Massacci, Fabio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In finance, leverage is the ratio between assets borrowed from others and one's own assets. A matching situation is present in software: by using free open-source software (FOSS) libraries a developer leverages on other people's code to multiply the offered functionalities with a much smaller own codebase. In finance as in software, leverage magnifies profits when returns from borrowing exceed costs of integration, but it may also magnify losses, in particular in the presence of security vulnerabilities. We aim to understand the level of technical leverage in the FOSS ecosystem and whether it can be a potential source of security vulnerabilities. Also, we introduce two metrics change distance and change direction to capture the amount and the evolution of the dependency on third-party libraries. Our analysis published in [1] shows that small and medium libraries (less than 100KLoC) have disproportionately more leverage on FOSS dependencies in comparison to large libraries. We show that leverage pays off as leveraged libraries only add a 4% delay in the time interval between library releases while providing four times more code than their own. However, libraries with such leverage (i.e., 75% of libraries in our sample) also have 1.6 higher odds of being vulnerable in comparison to the libraries with lower leverage.

This dataset is the original dataset used in the publication [1]. It includes 8494 distinct library versions from the FOSS Maven-based Java libraries An online demo for computing the proposed metrics for real-world software libraries is also available under the following URL: https://techleverage.eu/.

The original publication is [1]. An executive summary of the results is avialble as the publication [2]. This work has been funded by the European Union with the project AssureMOSS (https://www.assuremoss.eu).

[1] Massacci, F., & Pashchenko, I. (2021, May). Technical leverage in a software ecosystem: Development opportunities and security risks. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) (pp. 1386-1397). IEEE.

[2] Massacci, F., & Pashchenko, I. (2021). Technical Leverage: Dependencies Are a Mixed Blessing. IEEE Secur. Priv., 19(3), 58-62.

Facebook

Twitter

Click to copy link

Link copied

Cite

Piotr Madanecki; Magdalena Bałut; Patrick G. Buckley; J. Renata Ochocka; Rafał Bartoszewski; David K. Crossman; Ludwine M. Messiaen; Arkadiusz Piotrowski (2023). High-Throughput Tabular Data Processor – Platform independent graphical tool for processing large data sets [Dataset]. http://doi.org/10.1371/journal.pone.0192858

High-Throughput Tabular Data Processor – Platform independent graphical tool for processing large data sets

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

pdfAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0192858

Dataset updated

Jun 2, 2023

Dataset provided by

PLOS ONE

Authors

Piotr Madanecki; Magdalena Bałut; Patrick G. Buckley; J. Renata Ochocka; Rafał Bartoszewski; David K. Crossman; Ludwine M. Messiaen; Arkadiusz Piotrowski

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

High-throughput technologies generate considerable amount of data which often requires bioinformatic expertise to analyze. Here we present High-Throughput Tabular Data Processor (HTDP), a platform independent Java program. HTDP works on any character-delimited column data (e.g. BED, GFF, GTF, PSL, WIG, VCF) from multiple text files and supports merging, filtering and converting of data that is produced in the course of high-throughput experiments. HTDP can also utilize itemized sets of conditions from external files for complex or repetitive filtering/merging tasks. The program is intended to aid global, real-time processing of large data sets using a graphical user interface (GUI). Therefore, no prior expertise in programming, regular expression, or command line usage is required of the user. Additionally, no a priori assumptions are imposed on the internal file composition. We demonstrate the flexibility and potential of HTDP in real-life research tasks including microarray and massively parallel sequencing, i.e. identification of disease predisposing variants in the next generation sequencing data as well as comprehensive concurrent analysis of microarray and sequencing results. We also show the utility of HTDP in technical tasks including data merge, reduction and filtering with external criteria files. HTDP was developed to address functionality that is missing or rudimentary in other GUI software for processing character-delimited column data from high-throughput technologies. Flexibility, in terms of input file handling, provides long term potential functionality in high-throughput analysis pipelines, as the program is not limited by the currently existing applications and data formats. HTDP is available as the Open Source software (https://github.com/pmadanecki/htdp).

Clear search

Close search

Google apps

Main menu

Language	Number of Samples
Java	153,119
Ruby	233,710
Go	137,998
JavaScript	373,598
Python	472,469
PHP	294,394

High-Throughput Tabular Data Processor – Platform independent graphical tool...

codeparrot-java-all

Data from: ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

Relevance and Redundancy ranking: Code and Supplementary material

Java, New York Age Cohorts Dataset: Children, Working Adults, and Seniors in...

About this dataset

Content

Inspiration

Recommended for further research

GitHub Java Corpus

Code smells and quality attributes dataset

cornstack-java-v1

Dataset for Mininig Off-by-One Errors

migration-bench-java-utg

CLUSTOM-CLOUD: In-Memory Data Grid-Based Software for Clustering 16S rRNA...

Dataset for Java, SD Census Bureau Income Distribution by Race

About this dataset

Content

Inspiration

Interested in deeper insights and visual analysis?

CommitBench

Data Statement for CommitBench

EXECUTIVE SUMMARY

CURATION RATIONALE

DOCUMENTATION FOR SOURCE DATASETS

LANGUAGE VARIETIES

SPEAKER DEMOGRAPHIC

ANNOTATOR DEMOGRAPHIC

SPEECH SITUATION AND CHARACTERISTICS

PREPROCESSING AND DATA FORMATTING

CAPTURE QUALITY

LIMITATIONS

METADATA

DISCLOSURES AND ETHICAL REVIEW

ABOUT THIS DOCUMENT

Constructed datasets.

Dataset of the publication: Large-scale Characterization of Java Streams

Data from: Lost in Translation: A Study of Bugs Introduced by Large Language...

Data from: E2EGit: A Dataset of End-to-End Web Tests in Open Source Projects...

Java Develop Service Market Report | Global Forecast From 2025 To 2033

Java Development Services Market Outlook

Custom Application Development Analysis

1M Chinese Coding Questions Dataset – Python/Java/C++

Technical Leverage Dataset for Java Dependencies in Maven

High-Throughput Tabular Data Processor – Platform independent graphical tool for processing large data sets