Facebook
TwitterIntroductionThis is the replication package for the paper "Automated Unit Test Generation via Chain of Thought Prompt and Reinforcement Learning".Organization of the Replication Packagecheckpoints.zip: fine-tuned models, including TestCTRL, TestCT, TestCT-no-cot, TestCT-intention, TestCT-input, TestCT-ti, CodeBERT-line, CodeT5-line, CodeGPT-line, CodeBERT-branch, CodeT5-branch, and CodeGPT-branch.dataset.zip: Datasets for fine-tuning and reinforcement learning, including the CoT dataset, reward dataset (reward folder), and the dataset for PPO optimization (rl folder).evaluation.zip: scripts for evaluating the generated tests, including CodeBLEU, syntactic correct rate, compilation passing rate, line coverage rate, and branch coverage rate.finetune.zip: scripts and configs for fine-tuning large language models for test generation.generated_test_result.zip: the generated tests.pretrain.zip: pre-trained models, including CodeLlama, CodeBERT, CodeT5, and CodeBERT.CoT_quality.zip: the example of evaluating CoT prompts.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
OverviewContext: A large number of bug reports are submitted in software projects on a daily basis. They are examined manually by practitioners. However, most of the bug reports are not related to the errors in the codebase (invalid bug reports), and cause a great waste of time and energy. Previous research have used various machine learning based techniques and language model based techniques to tackle this problem through auto-classification of bug reports. There exists a gap, however, in classification of bug reports using the large language models (LLMs) and techniques to improve the LLM performance for binary classification.Objective: The aim of this study is to apply various machine learning and natural language processing methods to classify bug reports as valid or invalid, then supply the predictions of these models to an LLM judge, along with similar bug reports and their labels, to enable it to make an informed prediction.Method: We first retrieved 10,000 real-world Firefox bug reports via the Bugzilla API, then divided them randomly into a training set and a test set. We trained three traditional classifiers (Naive Bayes, Random Forest, and SVM) on the training set, and fine-tuned five BERT-based models for classification. In parallel, we used LLMs with in-context examples given by the Retrieval Augmented Generation system established over the training set; and compared the results with zero-shot prompting. We have picked the best-performing LLM and used it as the judge LLM by providing it with the votes of the three diverse ML models, and top-5 semantically similar bug reports. We compared the performance of the judge LLM with the majority voting of the three chosen models.Results: The classical ML pipelines; Naive Bayes, Random Forest, and Linear SVM trained on TF-IDF features and achieved strong binary $F_1$ scores of 0.881, 0.894, and 0.878, respectively. Our suite of five fine-tuned BERT-based classifiers further improved performance, with F$_1$ scores for BERT base at 0.899, RoBERTa at 0.909, CodeBERT at 0.902, CodeBERT-Graph at 0.902, and the 128-token BERT variant at 0.899. In contrast, zero-shot LLM classification without retrieval saw $F_1$s ranging from 0.531 (GPT-4o-mini) to 0.737 (Llama-3.3-70B), highlighting the gap between out-of-the-box LLMs and specialized models. Introducing RAG-based few-shot prompting closed much of that gap, lifting LLM $F_1$s to 0.815 for GPT-o3-mini, 0.759 for GPT-o4-mini, and 0.797 for Llama, while GPT-4o-mini reached 0.729. Finally, our hybrid judge pipeline, combining top-5 similar bug reports, votes from RF, SVM, and RoBERTa, and reasoning by GPT-o3-mini, yielded an $F_1$ of 0.871, striking a balance between raw accuracy and human-readable explanations.Conclusions: Our evaluation confirms that specialized, fine-tuned classifiers, particularly RoBERTa and CodeBERT variants, remain the most cost-effective and highest-accuracy solutions for binary bug-report triage. Using RAG with LLMs substantially boosts classification over zero-shot baselines, although the scores do not surpass our top fine-tuned models. Nevertheless, their natural language rationales and actionable suggestions offer an explainability advantage that static classifiers cannot match. In practice, a hybrid ensemble where fast, accurate classifiers handle the bulk of cases and an LLM judge provides human-readable justification for edge cases appears to strike the best balance between performance, cost, and transparency.Research QuestionsThe study addresses the following research questions:RQ1: How do reasoning LLMs perform compared to prior ML and LLM classifiers when supported with few-shot prompting?RQ2: What is the effect of integrating an LLM judge into a majority-voting ensemble versus using voting alone?RQ3: How does few-shot prompting with RAG impact LLM performance relative to zero-shot prompting?Project StructureThe replication package consists of 4 folders. The data folder contains 2 files: training.csv, and bug_reports.csv. training.csv contains only 2 columns text and label, and it is the version we used when training the models. bug_reports.csv contains the columns and ids we have retrieved from Bugzilla. The code folder contains the .ipynb files we used when creating the dataset, training ML models, fine-tuning BERT-based models, and getting predictions from LLMs with both zero-shot and few-shot settings. The notebook for the unified pipeline is also included. The preds folder contains the predictions, explanations, and original labels of the bug reports from the test dataset. The metrics folder contains a single file which includes the metrics as JSON objects for all eighteen model configurations we have tested.Instruction for ReplicationThis replication package has the following folder structure:-code-data-preds-metricsThe code folder keepts the code we have used to create the dataset and preprocess it.It also includes the code we have used to train different models we have used in the study, and evaluate them.bert_models.ipynb contains the code for 5 different BERT-based models we have fine-tuned along with the code toprint the detailed scores for ease of evaluation.create_dataset.ipynb contains the code we have used for creating the dataset using the Bugzilla API.gpt_4o_mini.ipynb, gpt_o3_mini.ipynb, gpt_o4_mini.ipynb, and llama_70b.ipynb contain the code used to test the models in file names with zero-shot and few-shot configurations.The data folder contains the dataset we have collected and used in the study.bug_reports.csv is the file containing all information we obtained whether we used them in model training or not.training.csv is the version of the dataset we used when doing model training. It has only 2 labels, which simply the training process.The preds folder contains all predictions and explanations from the language models or hybrid judge pipeline.The metrics folder contains metrics.json, which includes all the model metrics as given in the paper. This file is handy for comparing the result of different models too.For the analysis performed in the review process, we have added another folder named review_analysis, which includes the scripts we have used to conduct additional analyses about the differences between the Judge LLM and majority voting, and the subsets where each approach is successful. The analysis also examines the aspects of bug reports that played an important role in being classified as valid by the LLM, including evidence-based patterns (such as file paths, urls, etc.), and potential biases such as text length correlation. There are 3 Python scripts for performing the analyses, 4 detailed markdown reports documenting the findings, and 5 visualizations illustrating model comparisons, voting patterns, and bias detection results. The key findings reveal that while the Judge LLM provides detailed natural language explanations, it exhibits a systematic text length bias (r=-0.360) and underperforms compared to the ML majority vote (81.5% vs 87.0% accuracy), though it offers superior explainability for human reviewers.
Facebook
TwitterPrevious studies that used data from Stack Overflow to develop predictive models often employed limited benchmarks of 3-5 models or adopted arbitrary selection methods. Despite being insightful, such approaches may not provide optimal results given their limited scope, suggesting the need to benchmark more models to avoid overlooking untested algorithms. Our study evaluates 21 algorithms across three tasks: predicting the number of question a user is likely to answer, their code quality violations, and their dropout status. We employed normalisation, standardisation, as well as logarithmic and power transformations paired with Bayesian hyperparameter optimisation and genetic algorithms. CodeBERT, a pre-trained language model for both natural and programming languages, was fine-tuned to classify user dropout given their posts (questions and answers) and code snippets. This replication package is provided for those interested in further examining our research methodology.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterIntroductionThis is the replication package for the paper "Automated Unit Test Generation via Chain of Thought Prompt and Reinforcement Learning".Organization of the Replication Packagecheckpoints.zip: fine-tuned models, including TestCTRL, TestCT, TestCT-no-cot, TestCT-intention, TestCT-input, TestCT-ti, CodeBERT-line, CodeT5-line, CodeGPT-line, CodeBERT-branch, CodeT5-branch, and CodeGPT-branch.dataset.zip: Datasets for fine-tuning and reinforcement learning, including the CoT dataset, reward dataset (reward folder), and the dataset for PPO optimization (rl folder).evaluation.zip: scripts for evaluating the generated tests, including CodeBLEU, syntactic correct rate, compilation passing rate, line coverage rate, and branch coverage rate.finetune.zip: scripts and configs for fine-tuning large language models for test generation.generated_test_result.zip: the generated tests.pretrain.zip: pre-trained models, including CodeLlama, CodeBERT, CodeT5, and CodeBERT.CoT_quality.zip: the example of evaluating CoT prompts.