5 datasets found

f
Minimum Euclidean distance between real and synthetic data generated by...
plos.figshare.com
xls
Updated May 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raffaele Marchesi; Nicolo Micheletti; Nicholas I-Hsien Kuo; Sebastiano Barbieri; Giuseppe Jurman; Venet Osmani (2025). Minimum Euclidean distance between real and synthetic data generated by SMOTE, WGAN-GP* and CA-GAN. No method generates exact copies of the real data. [Dataset]. http://doi.org/10.1371/journal.pcbi.1013080.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1013080.t002
Dataset updated
May 27, 2025
Dataset provided by
PLOS Computational Biology
Authors
Raffaele Marchesi; Nicolo Micheletti; Nicholas I-Hsien Kuo; Sebastiano Barbieri; Giuseppe Jurman; Venet Osmani
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Minimum Euclidean distance between real and synthetic data generated by SMOTE, WGAN-GP* and CA-GAN. No method generates exact copies of the real data.
f
Evaluation of SFMI without SMOTE-ENN (in %).
figshare.com
xls
Updated Oct 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arati Behera; Kshira Sagar Sahoo; Tapas Kumara Mishra; Anand Nayyar; Muhammad Bilal (2024). Evaluation of SFMI without SMOTE-ENN (in %). [Dataset]. http://doi.org/10.1371/journal.pone.0309682.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0309682.t007
Dataset updated
Oct 17, 2024
Dataset provided by
PLOS ONE
Authors
Arati Behera; Kshira Sagar Sahoo; Tapas Kumara Mishra; Anand Nayyar; Muhammad Bilal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Internet of things (IoT) facilitates a variety of heterogeneous devices to be enabled with network connectivity via various network architectures to gather and exchange real-time information. On the other hand, the rise of IoT creates Distributed Denial of Services (DDoS) like security threats. The recent advancement of Software Defined-Internet of Things (SDIoT) architecture can provide better security solutions compared to the conventional networking approaches. Moreover, limited computing resources and heterogeneous network protocols are major challenges in the SDIoT ecosystem. Given these circumstances, it is essential to design a low-cost DDoS attack classifier. The current study aims to employ an improved feature selection (FS) technique which determines the most relevant features that can improve the detection rate and reduce the training time. At first, to overcome the data imbalance problem, Edited Nearest Neighbor-based Synthetic Minority Oversampling (SMOTE-ENN) was exploited. The study proposes SFMI, an FS method that combines Sequential Feature Selection (SFE) and Mutual Information (MI) techniques. The top k common features were extracted from the nominated features based on SFE and MI. Further, Principal component analysis (PCA) is employed to address multicollinearity issues in the dataset. Comprehensive experiments have been conducted on two benchmark datasets such as the KDDCup99, CIC IoT-2023 datasets. For classification purposes, Decision Tree, K-Nearest Neighbor, Gaussian Naive Bayes, Random Forest (RF), and Multilayer Perceptron classifiers were employed. The experimental results quantitatively demonstrate that the proposed SMOTE-ENN+SFMI+PCA with RF classifier achieves 99.97% accuracy and 99.39% precision with 10 features.

Replication Package of Deep Learning and Data Augmentation for Detecting...

zenodo.org

zip

Updated Apr 24, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Anonymous Anonymous; Anonymous Anonymous (2024). Replication Package of Deep Learning and Data Augmentation for Detecting Self-Admitted Technical Debt [Dataset]. http://doi.org/10.5281/zenodo.10521909

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.10521909

Dataset updated

Apr 24, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Anonymous Anonymous; Anonymous Anonymous

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Jan 17, 2024

Description

Self-Admitted Technical Debt (SATD) refers to circumstances where developers use code comments, issues, pull requests, or other textual artifacts to explain why the existing implementation is not optimal. Past research in detecting SATD has focused on either identifying SATD (classifying SATD instances as SATD or not) or categorizing SATD (labeling instances as SATD that pertain to requirements, design, code, test, etc.). However, the performance of such approaches remains suboptimal, particularly when dealing with specific types of SATD, such as test and requirement debt. This is mostly because the used datasets are extremely imbalanced.

In this study, we utilize a data augmentation strategy to address the problem of imbalanced data. We also employ a two-step approach to identify and categorize SATD on various datasets derived from different artifacts. Based on earlier research, a deep learning architecture called BiLSTM is utilized for the binary identification of SATD. The BERT architecture is then utilized to categorize different types of SATD. We provide the dataset of balanced classes as a contribution for future SATD researchers, and we also show that the performance of SATD identification and categorization using deep learning and our two-step approach is significantly better than baseline approaches.

Therefore, to showcase the effectiveness of our approach, we compared it against several existing approaches:

Natural Language Processing (NLP) and Matches task Annotation Tags (MAT) [Github]
eXtreme Gradient Boosting+Synthetic Minority Oversampling Technique (XGBoost+SMOTE) [Figshare]
eXtreme Gradient Boosting+Easy Data Augmentation (XGBoost+EDA) [Github]
MT-Text-CNN [Github]

Structure of the Replication Package:

In accordance with the original dataset, the dataset comprises four distinct CSV files delineated by the artifacts under consideration in this study. Each CSV file encompasses a text column and a class, which indicate classifications denoting specific types of SATD, namely code/design debt (C/D), documentation debt (DOC), test debt (TES), and requirement debt (REQ) or Not-SATD.

├── SATD Keywords

│ ├── Keywords based on Source of Artifacts

│ │ ├── Code comment.txt

│ │ ├── Commit message.txt

│ │ ├── Issue section.txt

│ │ └── Pull section.txt

│ ├── Keywords based on Types of SATD

│ │ ├── code-design debt.txt

│ │ ├── documentation debt.txt

│ │ ├── requirement debt.txt

│ │ └── test debt.txt

├── src

│ ├── bert.py

│ ├── bilstm.py

│ └── preprocessing.py

├── data-augmentation-code_comments.csv

├── data-augmentation-commit_messages.csv

├── data-augmentation-issues.csv

├── data-augmentation-pull_requests.csv

└── Supplementary Material.docx

Requirements:

glove

nltk

transformers

torch

tensorflow

keras

langdetect

inflect

inflection

Project sources for each artifact are as follows:

Source code comment	Issue section	Pull section	Commit message
ant argouml columba emf hibernate jedit jfreechart jmeter jruby squirrel	camel chromium gerrit hadoop hbase impala thrift	accumulo activemq activemq-artemis airflow ambari apisix apisix-dashboard arrow attic-apex-core attic-apex-malhar attic-stratos avro beam bigtop bookkeeper brooklyn-server calcite camel camel-k camel-quarkus camel-website carbondata cassandra cloudstack commons-lang couchdb cxf daffodil drill druid dubbo echarts fineract flink fluo geode geode-native gobblin griffin groovy guacamole-client hadoop hawq hbase helix hive hudi iceberg ignite incubator-brooklyn incubator-dolphinscheduler incubator-doris incubator-heron incubator-hop incubator-mxnet incubator-pagespeed-ngx incubator-pinot incubator-weex infrastructure-puppet jena jmeter kafka karaf kylin lucene-solr madlib myfaces-tobago netbeans netbeans-website nifi nifi-minifi-cpp nutch openwhisk openwhisk-wskdeploy orc ozone parquet-mr phoenix pulsar qpid-dispatch reef rocketmq samza servicecomb-java-chassis shardingsphere shardingsphere-elasticjob skywalking spark storm streams superset systemds tajo thrift tinkerpop tomee trafficcontrol trafficserver trafodion tvm usergrid zeppelin zookeeper	accumulo activemq activemq-artemis airflow ambari apisix apisix-dashboard arrow attic-apex-core attic-apex-malhar attic-stratos avro beam bigtop bookkeeper brooklyn-server calcite camel camel-k camel-quarkus camel-website carbondata cassandra cloudstack commons-lang couchdb cxf daffodil drill druid dubbo echarts fineract flink fluo geode geode-native gobblin griffin groovy guacamole-client hadoop hawq hbase helix hive hudi iceberg ignite incubator-brooklyn incubator-dolphinscheduler incubator-doris incubator-heron incubator-hop incubator-mxnet incubator-pagespeed-ngx incubator-pinot incubator-weex infrastructure-puppet jena jmeter kafka karaf kylin lucene-solr madlib myfaces-tobago netbeans netbeans-website nifi nifi-minifi-cpp nutch openwhisk openwhisk-wskdeploy orc ozone parquet-mr phoenix pulsar qpid-dispatch reef rocketmq samza servicecomb-java-chassis shardingsphere shardingsphere-elasticjob skywalking spark storm streams superset systemds tajo thrift tinkerpop tomee trafficcontrol trafficserver trafodion tvm usergrid zeppelin zookeeper

This dataset has undergone a data augmentation process using the AugGPT technique. Meanwhile, the original dataset can be downloaded via the following link: https://github.com/yikun-li/satd-different-sources-data

f
Evaluation of BFE with SMOTE-ENN (in %).
plos.figshare.com
xls
Updated Oct 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arati Behera; Kshira Sagar Sahoo; Tapas Kumara Mishra; Anand Nayyar; Muhammad Bilal (2024). Evaluation of BFE with SMOTE-ENN (in %). [Dataset]. http://doi.org/10.1371/journal.pone.0309682.t009
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0309682.t009
Dataset updated
Oct 17, 2024
Dataset provided by
PLOS ONE
Authors
Arati Behera; Kshira Sagar Sahoo; Tapas Kumara Mishra; Anand Nayyar; Muhammad Bilal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Internet of things (IoT) facilitates a variety of heterogeneous devices to be enabled with network connectivity via various network architectures to gather and exchange real-time information. On the other hand, the rise of IoT creates Distributed Denial of Services (DDoS) like security threats. The recent advancement of Software Defined-Internet of Things (SDIoT) architecture can provide better security solutions compared to the conventional networking approaches. Moreover, limited computing resources and heterogeneous network protocols are major challenges in the SDIoT ecosystem. Given these circumstances, it is essential to design a low-cost DDoS attack classifier. The current study aims to employ an improved feature selection (FS) technique which determines the most relevant features that can improve the detection rate and reduce the training time. At first, to overcome the data imbalance problem, Edited Nearest Neighbor-based Synthetic Minority Oversampling (SMOTE-ENN) was exploited. The study proposes SFMI, an FS method that combines Sequential Feature Selection (SFE) and Mutual Information (MI) techniques. The top k common features were extracted from the nominated features based on SFE and MI. Further, Principal component analysis (PCA) is employed to address multicollinearity issues in the dataset. Comprehensive experiments have been conducted on two benchmark datasets such as the KDDCup99, CIC IoT-2023 datasets. For classification purposes, Decision Tree, K-Nearest Neighbor, Gaussian Naive Bayes, Random Forest (RF), and Multilayer Perceptron classifiers were employed. The experimental results quantitatively demonstrate that the proposed SMOTE-ENN+SFMI+PCA with RF classifier achieves 99.97% accuracy and 99.39% precision with 10 features.
f
The layers architecture stack of applied neural network based GRU method.
figshare.com
xls
Updated Aug 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Iqra Akhtar; Shahid Atiq; Muhammad Umair Shahid; Ali Raza; Nagwan Abdel Samee; Maali Alabdulhafith (2024). The layers architecture stack of applied neural network based GRU method. [Dataset]. http://doi.org/10.1371/journal.pone.0309459.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0309459.t003
Dataset updated
Aug 28, 2024
Dataset provided by
PLOS ONE
Authors
Iqra Akhtar; Shahid Atiq; Muhammad Umair Shahid; Ali Raza; Nagwan Abdel Samee; Maali Alabdulhafith
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The layers architecture stack of applied neural network based GRU method.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Raffaele Marchesi; Nicolo Micheletti; Nicholas I-Hsien Kuo; Sebastiano Barbieri; Giuseppe Jurman; Venet Osmani (2025). Minimum Euclidean distance between real and synthetic data generated by SMOTE, WGAN-GP* and CA-GAN. No method generates exact copies of the real data. [Dataset]. http://doi.org/10.1371/journal.pcbi.1013080.t002

Minimum Euclidean distance between real and synthetic data generated by SMOTE, WGAN-GP* and CA-GAN. No method generates exact copies of the real data.

Explore at:

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pcbi.1013080.t002

Dataset updated

May 27, 2025

Dataset provided by

PLOS Computational Biology

Authors

Raffaele Marchesi; Nicolo Micheletti; Nicholas I-Hsien Kuo; Sebastiano Barbieri; Giuseppe Jurman; Venet Osmani

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Minimum Euclidean distance between real and synthetic data generated by SMOTE, WGAN-GP* and CA-GAN. No method generates exact copies of the real data.

Clear search

Close search

Google apps

Main menu

Minimum Euclidean distance between real and synthetic data generated by...

Evaluation of SFMI without SMOTE-ENN (in %).

Replication Package of Deep Learning and Data Augmentation for Detecting...

Evaluation of BFE with SMOTE-ENN (in %).

The layers architecture stack of applied neural network based GRU method.

Minimum Euclidean distance between real and synthetic data generated by SMOTE, WGAN-GP* and CA-GAN. No method generates exact copies of the real data.