Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Minimum Euclidean distance between real and synthetic data generated by SMOTE, WGAN-GP* and CA-GAN. No method generates exact copies of the real data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Internet of things (IoT) facilitates a variety of heterogeneous devices to be enabled with network connectivity via various network architectures to gather and exchange real-time information. On the other hand, the rise of IoT creates Distributed Denial of Services (DDoS) like security threats. The recent advancement of Software Defined-Internet of Things (SDIoT) architecture can provide better security solutions compared to the conventional networking approaches. Moreover, limited computing resources and heterogeneous network protocols are major challenges in the SDIoT ecosystem. Given these circumstances, it is essential to design a low-cost DDoS attack classifier. The current study aims to employ an improved feature selection (FS) technique which determines the most relevant features that can improve the detection rate and reduce the training time. At first, to overcome the data imbalance problem, Edited Nearest Neighbor-based Synthetic Minority Oversampling (SMOTE-ENN) was exploited. The study proposes SFMI, an FS method that combines Sequential Feature Selection (SFE) and Mutual Information (MI) techniques. The top k common features were extracted from the nominated features based on SFE and MI. Further, Principal component analysis (PCA) is employed to address multicollinearity issues in the dataset. Comprehensive experiments have been conducted on two benchmark datasets such as the KDDCup99, CIC IoT-2023 datasets. For classification purposes, Decision Tree, K-Nearest Neighbor, Gaussian Naive Bayes, Random Forest (RF), and Multilayer Perceptron classifiers were employed. The experimental results quantitatively demonstrate that the proposed SMOTE-ENN+SFMI+PCA with RF classifier achieves 99.97% accuracy and 99.39% precision with 10 features.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Self-Admitted Technical Debt (SATD) refers to circumstances where developers use code comments, issues, pull requests, or other textual artifacts to explain why the existing implementation is not optimal. Past research in detecting SATD has focused on either identifying SATD (classifying SATD instances as SATD or not) or categorizing SATD (labeling instances as SATD that pertain to requirements, design, code, test, etc.). However, the performance of such approaches remains suboptimal, particularly when dealing with specific types of SATD, such as test and requirement debt. This is mostly because the used datasets are extremely imbalanced.
In this study, we utilize a data augmentation strategy to address the problem of imbalanced data. We also employ a two-step approach to identify and categorize SATD on various datasets derived from different artifacts. Based on earlier research, a deep learning architecture called BiLSTM is utilized for the binary identification of SATD. The BERT architecture is then utilized to categorize different types of SATD. We provide the dataset of balanced classes as a contribution for future SATD researchers, and we also show that the performance of SATD identification and categorization using deep learning and our two-step approach is significantly better than baseline approaches.
Therefore, to showcase the effectiveness of our approach, we compared it against several existing approaches:
In accordance with the original dataset, the dataset comprises four distinct CSV files delineated by the artifacts under consideration in this study. Each CSV file encompasses a text column and a class, which indicate classifications denoting specific types of SATD, namely code/design debt (C/D), documentation debt (DOC), test debt (TES), and requirement debt (REQ) or Not-SATD.
├── SATD Keywords
│ ├── Keywords based on Source of Artifacts
│ │ ├── Code comment.txt
│ │ ├── Commit message.txt
│ │ ├── Issue section.txt
│ │ └── Pull section.txt
│ ├── Keywords based on Types of SATD
│ │ ├── code-design debt.txt
│ │ ├── documentation debt.txt
│ │ ├── requirement debt.txt
│ │ └── test debt.txt
├── src
│ ├── bert.py
│ ├── bilstm.py
│ └── preprocessing.py
├── data-augmentation-code_comments.csv
├── data-augmentation-commit_messages.csv
├── data-augmentation-issues.csv
├── data-augmentation-pull_requests.csv
└── Supplementary Material.docx
Requirements:
Source code comment | Issue section | Pull section | Commit message |
ant argouml columba emf hibernate jedit jfreechart jmeter jruby squirrel |
camel
chromium
gerrit
hadoop
hbase
impala
thrift
|
accumulo
activemq
activemq-artemis
airflow
ambari
apisix
apisix-dashboard
arrow
attic-apex-core
attic-apex-malhar
attic-stratos
avro
beam
bigtop
bookkeeper
brooklyn-server
calcite
camel
camel-k
camel-quarkus
camel-website
carbondata
cassandra
cloudstack
commons-lang
couchdb
cxf
daffodil
drill
druid
dubbo
echarts
fineract
flink
fluo
geode
geode-native
gobblin
griffin
groovy
guacamole-client
hadoop
hawq
hbase
helix
hive
hudi
iceberg
ignite
incubator-brooklyn
incubator-dolphinscheduler
incubator-doris
incubator-heron
incubator-hop
incubator-mxnet
incubator-pagespeed-ngx
incubator-pinot
incubator-weex
infrastructure-puppet
jena
jmeter
kafka
karaf
kylin
lucene-solr
madlib
myfaces-tobago
netbeans
netbeans-website
nifi
nifi-minifi-cpp
nutch
openwhisk
openwhisk-wskdeploy
orc
ozone
parquet-mr
phoenix
pulsar
qpid-dispatch
reef
rocketmq
samza
servicecomb-java-chassis
shardingsphere
shardingsphere-elasticjob
skywalking
spark
storm
streams
superset
systemds
tajo
thrift
tinkerpop
tomee
trafficcontrol
trafficserver
trafodion
tvm
usergrid
zeppelin
zookeeper
|
accumulo
activemq
activemq-artemis
airflow
ambari
apisix
apisix-dashboard
arrow
attic-apex-core
attic-apex-malhar
attic-stratos
avro
beam
bigtop
bookkeeper
brooklyn-server
calcite
camel
camel-k
camel-quarkus
camel-website
carbondata
cassandra
cloudstack
commons-lang
couchdb
cxf
daffodil
drill
druid
dubbo
echarts
fineract
flink
fluo
geode
geode-native
gobblin
griffin
groovy
guacamole-client
hadoop
hawq
hbase
helix
hive
hudi
iceberg
ignite
incubator-brooklyn
incubator-dolphinscheduler
incubator-doris
incubator-heron
incubator-hop
incubator-mxnet
incubator-pagespeed-ngx
incubator-pinot
incubator-weex
infrastructure-puppet
jena
jmeter
kafka
karaf
kylin
lucene-solr
madlib
myfaces-tobago
netbeans
netbeans-website
nifi
nifi-minifi-cpp
nutch
openwhisk
openwhisk-wskdeploy
orc
ozone
parquet-mr
phoenix
pulsar
qpid-dispatch
reef
rocketmq
samza
servicecomb-java-chassis
shardingsphere
shardingsphere-elasticjob
skywalking
spark
storm
streams
superset
systemds
tajo
thrift
tinkerpop
tomee
trafficcontrol
trafficserver
trafodion
tvm
usergrid
zeppelin
zookeeper
|
This dataset has undergone a data augmentation process using the AugGPT technique. Meanwhile, the original dataset can be downloaded via the following link: https://github.com/yikun-li/satd-different-sources-data
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Internet of things (IoT) facilitates a variety of heterogeneous devices to be enabled with network connectivity via various network architectures to gather and exchange real-time information. On the other hand, the rise of IoT creates Distributed Denial of Services (DDoS) like security threats. The recent advancement of Software Defined-Internet of Things (SDIoT) architecture can provide better security solutions compared to the conventional networking approaches. Moreover, limited computing resources and heterogeneous network protocols are major challenges in the SDIoT ecosystem. Given these circumstances, it is essential to design a low-cost DDoS attack classifier. The current study aims to employ an improved feature selection (FS) technique which determines the most relevant features that can improve the detection rate and reduce the training time. At first, to overcome the data imbalance problem, Edited Nearest Neighbor-based Synthetic Minority Oversampling (SMOTE-ENN) was exploited. The study proposes SFMI, an FS method that combines Sequential Feature Selection (SFE) and Mutual Information (MI) techniques. The top k common features were extracted from the nominated features based on SFE and MI. Further, Principal component analysis (PCA) is employed to address multicollinearity issues in the dataset. Comprehensive experiments have been conducted on two benchmark datasets such as the KDDCup99, CIC IoT-2023 datasets. For classification purposes, Decision Tree, K-Nearest Neighbor, Gaussian Naive Bayes, Random Forest (RF), and Multilayer Perceptron classifiers were employed. The experimental results quantitatively demonstrate that the proposed SMOTE-ENN+SFMI+PCA with RF classifier achieves 99.97% accuracy and 99.39% precision with 10 features.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The layers architecture stack of applied neural network based GRU method.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Minimum Euclidean distance between real and synthetic data generated by SMOTE, WGAN-GP* and CA-GAN. No method generates exact copies of the real data.