5 datasets found
  1. f

    Minimum Euclidean distance between real and synthetic data generated by...

    • plos.figshare.com
    xls
    Updated May 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raffaele Marchesi; Nicolo Micheletti; Nicholas I-Hsien Kuo; Sebastiano Barbieri; Giuseppe Jurman; Venet Osmani (2025). Minimum Euclidean distance between real and synthetic data generated by SMOTE, WGAN-GP* and CA-GAN. No method generates exact copies of the real data. [Dataset]. http://doi.org/10.1371/journal.pcbi.1013080.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 27, 2025
    Dataset provided by
    PLOS Computational Biology
    Authors
    Raffaele Marchesi; Nicolo Micheletti; Nicholas I-Hsien Kuo; Sebastiano Barbieri; Giuseppe Jurman; Venet Osmani
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Minimum Euclidean distance between real and synthetic data generated by SMOTE, WGAN-GP* and CA-GAN. No method generates exact copies of the real data.

  2. f

    Evaluation of SFMI without SMOTE-ENN (in %).

    • figshare.com
    xls
    Updated Oct 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arati Behera; Kshira Sagar Sahoo; Tapas Kumara Mishra; Anand Nayyar; Muhammad Bilal (2024). Evaluation of SFMI without SMOTE-ENN (in %). [Dataset]. http://doi.org/10.1371/journal.pone.0309682.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 17, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Arati Behera; Kshira Sagar Sahoo; Tapas Kumara Mishra; Anand Nayyar; Muhammad Bilal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Internet of things (IoT) facilitates a variety of heterogeneous devices to be enabled with network connectivity via various network architectures to gather and exchange real-time information. On the other hand, the rise of IoT creates Distributed Denial of Services (DDoS) like security threats. The recent advancement of Software Defined-Internet of Things (SDIoT) architecture can provide better security solutions compared to the conventional networking approaches. Moreover, limited computing resources and heterogeneous network protocols are major challenges in the SDIoT ecosystem. Given these circumstances, it is essential to design a low-cost DDoS attack classifier. The current study aims to employ an improved feature selection (FS) technique which determines the most relevant features that can improve the detection rate and reduce the training time. At first, to overcome the data imbalance problem, Edited Nearest Neighbor-based Synthetic Minority Oversampling (SMOTE-ENN) was exploited. The study proposes SFMI, an FS method that combines Sequential Feature Selection (SFE) and Mutual Information (MI) techniques. The top k common features were extracted from the nominated features based on SFE and MI. Further, Principal component analysis (PCA) is employed to address multicollinearity issues in the dataset. Comprehensive experiments have been conducted on two benchmark datasets such as the KDDCup99, CIC IoT-2023 datasets. For classification purposes, Decision Tree, K-Nearest Neighbor, Gaussian Naive Bayes, Random Forest (RF), and Multilayer Perceptron classifiers were employed. The experimental results quantitatively demonstrate that the proposed SMOTE-ENN+SFMI+PCA with RF classifier achieves 99.97% accuracy and 99.39% precision with 10 features.

  3. Replication Package of Deep Learning and Data Augmentation for Detecting...

    • zenodo.org
    zip
    Updated Apr 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous Anonymous; Anonymous Anonymous (2024). Replication Package of Deep Learning and Data Augmentation for Detecting Self-Admitted Technical Debt [Dataset]. http://doi.org/10.5281/zenodo.10521909
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 24, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous Anonymous; Anonymous Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 17, 2024
    Description

    Self-Admitted Technical Debt (SATD) refers to circumstances where developers use code comments, issues, pull requests, or other textual artifacts to explain why the existing implementation is not optimal. Past research in detecting SATD has focused on either identifying SATD (classifying SATD instances as SATD or not) or categorizing SATD (labeling instances as SATD that pertain to requirements, design, code, test, etc.). However, the performance of such approaches remains suboptimal, particularly when dealing with specific types of SATD, such as test and requirement debt. This is mostly because the used datasets are extremely imbalanced.

    In this study, we utilize a data augmentation strategy to address the problem of imbalanced data. We also employ a two-step approach to identify and categorize SATD on various datasets derived from different artifacts. Based on earlier research, a deep learning architecture called BiLSTM is utilized for the binary identification of SATD. The BERT architecture is then utilized to categorize different types of SATD. We provide the dataset of balanced classes as a contribution for future SATD researchers, and we also show that the performance of SATD identification and categorization using deep learning and our two-step approach is significantly better than baseline approaches.

    Therefore, to showcase the effectiveness of our approach, we compared it against several existing approaches:

    1. Natural Language Processing (NLP) and Matches task Annotation Tags (MAT) [Github]
    2. eXtreme Gradient Boosting+Synthetic Minority Oversampling Technique (XGBoost+SMOTE) [Figshare]
    3. eXtreme Gradient Boosting+Easy Data Augmentation (XGBoost+EDA) [Github]
    4. MT-Text-CNN [Github]

    Structure of the Replication Package:

    In accordance with the original dataset, the dataset comprises four distinct CSV files delineated by the artifacts under consideration in this study. Each CSV file encompasses a text column and a class, which indicate classifications denoting specific types of SATD, namely code/design debt (C/D), documentation debt (DOC), test debt (TES), and requirement debt (REQ) or Not-SATD.

    ├── SATD Keywords
    │ ├── Keywords based on Source of Artifacts
    │ │ ├── Code comment.txt
    │ │ ├── Commit message.txt
    │ │ ├── Issue section.txt
    │ │ └── Pull section.txt
    │ ├── Keywords based on Types of SATD
    │ │ ├── code-design debt.txt
    │ │ ├── documentation debt.txt
    │ │ ├── requirement debt.txt
    │ │ └── test debt.txt
    ├── src
    │ ├── bert.py
    │ ├── bilstm.py
    │ └── preprocessing.py
    ├── data-augmentation-code_comments.csv
    ├── data-augmentation-commit_messages.csv
    ├── data-augmentation-issues.csv
    ├── data-augmentation-pull_requests.csv
    └── Supplementary Material.docx

    Requirements:

    nltk
    transformers
    torch
    tensorflow
    keras
    langdetect
    inflect
    inflection
    Project sources for each artifact are as follows:
    Source code commentIssue sectionPull sectionCommit message
    ant
    argouml
    columba
    emf
    hibernate
    jedit
    jfreechart
    jmeter
    jruby
    squirrel
    camel
    chromium
    gerrit
    hadoop
    hbase
    impala
    thrift
    accumulo
    activemq
    activemq-artemis
    airflow
    ambari
    apisix
    apisix-dashboard
    arrow
    attic-apex-core
    attic-apex-malhar
    attic-stratos
    avro
    beam
    bigtop
    bookkeeper
    brooklyn-server
    calcite
    camel
    camel-k
    camel-quarkus
    camel-website
    carbondata
    cassandra
    cloudstack
    commons-lang
    couchdb
    cxf
    daffodil
    drill
    druid
    dubbo
    echarts
    fineract
    flink
    fluo
    geode
    geode-native
    gobblin
    griffin
    groovy
    guacamole-client
    hadoop
    hawq
    hbase
    helix
    hive
    hudi
    iceberg
    ignite
    incubator-brooklyn
    incubator-dolphinscheduler
    incubator-doris
    incubator-heron
    incubator-hop
    incubator-mxnet
    incubator-pagespeed-ngx
    incubator-pinot
    incubator-weex
    infrastructure-puppet
    jena
    jmeter
    kafka
    karaf
    kylin
    lucene-solr
    madlib
    myfaces-tobago
    netbeans
    netbeans-website
    nifi
    nifi-minifi-cpp
    nutch
    openwhisk
    openwhisk-wskdeploy
    orc
    ozone
    parquet-mr
    phoenix
    pulsar
    qpid-dispatch
    reef
    rocketmq
    samza
    servicecomb-java-chassis
    shardingsphere
    shardingsphere-elasticjob
    skywalking
    spark
    storm
    streams
    superset
    systemds
    tajo
    thrift
    tinkerpop
    tomee
    trafficcontrol
    trafficserver
    trafodion
    tvm
    usergrid
    zeppelin
    zookeeper
    accumulo
    activemq
    activemq-artemis
    airflow
    ambari
    apisix
    apisix-dashboard
    arrow
    attic-apex-core
    attic-apex-malhar
    attic-stratos
    avro
    beam
    bigtop
    bookkeeper
    brooklyn-server
    calcite
    camel
    camel-k
    camel-quarkus
    camel-website
    carbondata
    cassandra
    cloudstack
    commons-lang
    couchdb
    cxf
    daffodil
    drill
    druid
    dubbo
    echarts
    fineract
    flink
    fluo
    geode
    geode-native
    gobblin
    griffin
    groovy
    guacamole-client
    hadoop
    hawq
    hbase
    helix
    hive
    hudi
    iceberg
    ignite
    incubator-brooklyn
    incubator-dolphinscheduler
    incubator-doris
    incubator-heron
    incubator-hop
    incubator-mxnet
    incubator-pagespeed-ngx
    incubator-pinot
    incubator-weex
    infrastructure-puppet
    jena
    jmeter
    kafka
    karaf
    kylin
    lucene-solr
    madlib
    myfaces-tobago
    netbeans
    netbeans-website
    nifi
    nifi-minifi-cpp
    nutch
    openwhisk
    openwhisk-wskdeploy
    orc
    ozone
    parquet-mr
    phoenix
    pulsar
    qpid-dispatch
    reef
    rocketmq
    samza
    servicecomb-java-chassis
    shardingsphere
    shardingsphere-elasticjob
    skywalking
    spark
    storm
    streams
    superset
    systemds
    tajo
    thrift
    tinkerpop
    tomee
    trafficcontrol
    trafficserver
    trafodion
    tvm
    usergrid
    zeppelin
    zookeeper

    This dataset has undergone a data augmentation process using the AugGPT technique. Meanwhile, the original dataset can be downloaded via the following link: https://github.com/yikun-li/satd-different-sources-data

  4. f

    Evaluation of BFE with SMOTE-ENN (in %).

    • plos.figshare.com
    xls
    Updated Oct 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arati Behera; Kshira Sagar Sahoo; Tapas Kumara Mishra; Anand Nayyar; Muhammad Bilal (2024). Evaluation of BFE with SMOTE-ENN (in %). [Dataset]. http://doi.org/10.1371/journal.pone.0309682.t009
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 17, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Arati Behera; Kshira Sagar Sahoo; Tapas Kumara Mishra; Anand Nayyar; Muhammad Bilal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Internet of things (IoT) facilitates a variety of heterogeneous devices to be enabled with network connectivity via various network architectures to gather and exchange real-time information. On the other hand, the rise of IoT creates Distributed Denial of Services (DDoS) like security threats. The recent advancement of Software Defined-Internet of Things (SDIoT) architecture can provide better security solutions compared to the conventional networking approaches. Moreover, limited computing resources and heterogeneous network protocols are major challenges in the SDIoT ecosystem. Given these circumstances, it is essential to design a low-cost DDoS attack classifier. The current study aims to employ an improved feature selection (FS) technique which determines the most relevant features that can improve the detection rate and reduce the training time. At first, to overcome the data imbalance problem, Edited Nearest Neighbor-based Synthetic Minority Oversampling (SMOTE-ENN) was exploited. The study proposes SFMI, an FS method that combines Sequential Feature Selection (SFE) and Mutual Information (MI) techniques. The top k common features were extracted from the nominated features based on SFE and MI. Further, Principal component analysis (PCA) is employed to address multicollinearity issues in the dataset. Comprehensive experiments have been conducted on two benchmark datasets such as the KDDCup99, CIC IoT-2023 datasets. For classification purposes, Decision Tree, K-Nearest Neighbor, Gaussian Naive Bayes, Random Forest (RF), and Multilayer Perceptron classifiers were employed. The experimental results quantitatively demonstrate that the proposed SMOTE-ENN+SFMI+PCA with RF classifier achieves 99.97% accuracy and 99.39% precision with 10 features.

  5. f

    The layers architecture stack of applied neural network based GRU method.

    • figshare.com
    xls
    Updated Aug 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iqra Akhtar; Shahid Atiq; Muhammad Umair Shahid; Ali Raza; Nagwan Abdel Samee; Maali Alabdulhafith (2024). The layers architecture stack of applied neural network based GRU method. [Dataset]. http://doi.org/10.1371/journal.pone.0309459.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Aug 28, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Iqra Akhtar; Shahid Atiq; Muhammad Umair Shahid; Ali Raza; Nagwan Abdel Samee; Maali Alabdulhafith
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The layers architecture stack of applied neural network based GRU method.

  6. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Raffaele Marchesi; Nicolo Micheletti; Nicholas I-Hsien Kuo; Sebastiano Barbieri; Giuseppe Jurman; Venet Osmani (2025). Minimum Euclidean distance between real and synthetic data generated by SMOTE, WGAN-GP* and CA-GAN. No method generates exact copies of the real data. [Dataset]. http://doi.org/10.1371/journal.pcbi.1013080.t002

Minimum Euclidean distance between real and synthetic data generated by SMOTE, WGAN-GP* and CA-GAN. No method generates exact copies of the real data.

Related Article
Explore at:
xlsAvailable download formats
Dataset updated
May 27, 2025
Dataset provided by
PLOS Computational Biology
Authors
Raffaele Marchesi; Nicolo Micheletti; Nicholas I-Hsien Kuo; Sebastiano Barbieri; Giuseppe Jurman; Venet Osmani
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Minimum Euclidean distance between real and synthetic data generated by SMOTE, WGAN-GP* and CA-GAN. No method generates exact copies of the real data.

Search
Clear search
Close search
Google apps
Main menu