4 datasets found
  1. Supplementary Information for Owen et al. 2025. What is 'accuracy'?...

    • zenodo.org
    bin
    Updated Jul 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bianca Owen; Bianca Owen (2025). Supplementary Information for Owen et al. 2025. What is 'accuracy'? Rethinking machine learning classifier performance metrics for highly imbalanced, high variance, zero-inflated species count data [Dataset]. http://doi.org/10.5281/zenodo.15913476
    Explore at:
    binAvailable download formats
    Dataset updated
    Jul 15, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Bianca Owen; Bianca Owen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Microsoft Excel file showing the datasets and working for all findings in Owen et al. 2025. What is ‘accuracy’? Rethinking machine learning classifier performance metrics for highly imbalanced, high variance, zero-inflated species count data

  2. Credit Card Fraud Detection Dataset

    • kaggle.com
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ghanshyam Saini (2025). Credit Card Fraud Detection Dataset [Dataset]. https://www.kaggle.com/datasets/ghnshymsaini/credit-card-fraud-detection-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 15, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ghanshyam Saini
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Credit Card Fraud Detection Dataset (European Cardholders, September 2013)

    As a data contributor, I'm sharing this crucial dataset focused on the detection of fraudulent credit card transactions. Recognizing these illicit activities is paramount for protecting customers and the integrity of financial systems.

    About the Dataset:

    This dataset encompasses credit card transactions made by European cardholders during a two-day period in September 2013. It presents a real-world scenario with a significant class imbalance, where fraudulent transactions are considerably less frequent than legitimate ones. Out of a total of 284,807 transactions, only 492 are instances of fraud, representing a mere 0.172% of the entire dataset.

    Content of the Data:

    Due to confidentiality concerns, the majority of the input features in this dataset have undergone a Principal Component Analysis (PCA) transformation. This means the original meaning and context of features V1, V2, ..., V28 are not directly provided. However, these principal components capture the variance in the underlying transaction data.

    The only features that have not been transformed by PCA are:

    • Time: Numerical. Represents the number of seconds elapsed between each transaction and the very first transaction recorded in the dataset.
    • Amount: Numerical. The transaction amount in Euros (€). This feature could be valuable for cost-sensitive learning approaches.

    The target variable for this classification task is:

    • Class: Integer. Takes the value 1 in the case of a fraudulent transaction and 0 otherwise.

    Important Note on Evaluation:

    Given the substantial class imbalance (far more legitimate transactions than fraudulent ones), traditional accuracy metrics based on the confusion matrix can be misleading. It is strongly recommended to evaluate models using the Area Under the Precision-Recall Curve (AUPRC), as this metric is more sensitive to the performance on the minority class (fraudulent transactions).

    How to Use This Dataset:

    1. Download the dataset file (likely in CSV format).
    2. Load the data using libraries like Pandas.
    3. Understand the class imbalance: Be aware that fraudulent transactions are rare.
    4. Explore the features: Analyze the distributions of 'Time', 'Amount', and the PCA-transformed features (V1-V28).
    5. Address the class imbalance: Consider using techniques like oversampling the minority class, undersampling the majority class, or using specialized algorithms designed for imbalanced datasets.
    6. Build and train binary classification models to predict the 'Class' variable.
    7. Evaluate your models using AUPRC to get a meaningful assessment of performance in detecting fraud.

    Acknowledgements and Citation:

    This dataset has been collected and analyzed through a research collaboration between Worldline and the Machine Learning Group (MLG) of ULB (Université Libre de Bruxelles).

    When using this dataset in your research or projects, please cite the following works as appropriate:

    • Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015.
    • Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon.
    • Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE.
    • Andrea Dal Pozzolo. Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi).
    • Fabrizio Carcillo, Andrea Dal Pozzolo, Yann-Aël Le Borgne, Olivier Caelen, Yannis Mazzer, Gianluca Bontempi. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier.
    • Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Gianluca Bontempi. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing.
    • Bertrand Lebichot, Yann-Aël Le Borgne, Liyun He, Frederic Oblé, Gianluca Bontempi Deep-Learning Domain Adaptation Techniques for Credit Cards Fraud Detection, INNSBDDL 2019: Recent Advances in Big Data and Deep Learning, pp 78-88, 2019.
    • Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Frederic Oblé, Gianluca Bontempi *Combining Unsupervised and Supervised...
  3. f

    Diabetes-Prediction-Analysis dataset (8).

    • plos.figshare.com
    xls
    Updated Jul 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wangyouchen Zhang; Zhenhua Xia; Guoqing Cai; Junhao Wang; Xutao Dong (2025). Diabetes-Prediction-Analysis dataset (8). [Dataset]. http://doi.org/10.1371/journal.pone.0327120.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jul 8, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Wangyouchen Zhang; Zhenhua Xia; Guoqing Cai; Junhao Wang; Xutao Dong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    To improve the effectiveness of diabetes risk prediction, this study proposes a novel method based on focal active learning strategies combined with machine learning models. Existing machine learning models often suffer from poor performance on imbalanced medical datasets, where minority class instances such as diabetic cases are underrepresented. Our proposed Focal Active Learning method selectively samples informative instances to mitigate this imbalance, leading to better prediction outcomes with fewer labeled samples. The method integrates SHAP (SHapley Additive Explanations) to quantify feature importance and applies attention mechanisms to dynamically adjust feature weights, enhancing model interpretability and performance in predicting diabetes risk. To address the issue of imbalanced classification in diabetes datasets, we employed a clustering-based method to identify representative data points (called foci), and iteratively constructed a smaller labeled dataset (sub-pool) around them using similarity-based sampling. This method aims to overcome common challenges, such as poor performance on minority classes and limited generalization, by enabling more efficient data utilization and reducing labeling costs. The experimental results demonstrated that our approach significantly improved the evaluation metrics for diabetes risk prediction, achieving an accuracy of 97.41% and a recall rate of 94.70%, clearly outperforming traditional models that typically achieve 95% accuracy and 92% recall. Additionally, the model’s generalization ability was further validated on the public PIMA Indians Diabetes DataBase, outperforming traditional models in both accuracy and recall. This approach can enhance early diabetes screening in clinical settings, helping healthcare professionals reduce diagnostic errors and optimize resource allocation.

  4. f

    CDR dataset features information.

    • plos.figshare.com
    xls
    Updated May 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    An Tong; Bochao Chen; Zhe Wang; Jiawei Gao; Chi Kin Lam (2025). CDR dataset features information. [Dataset]. http://doi.org/10.1371/journal.pone.0322004.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 30, 2025
    Dataset provided by
    PLOS ONE
    Authors
    An Tong; Bochao Chen; Zhe Wang; Jiawei Gao; Chi Kin Lam
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In recent years, the number of telecom frauds has increased significantly, causing substantial losses to people’s daily lives. With technological advancements, telecom fraud methods have also become more sophisticated, making fraudsters harder to detect as they often imitate normal users and exhibit highly similar features. Traditional graph neural network (GNN) methods aggregate the features of neighboring nodes, which makes it difficult to distinguish between fraudsters and normal users when their features are highly similar. To address this issue, we proposed a spatio-temporal graph attention network (GDFGAT) with feature difference-based weight updates. We conducted comprehensive experiments on our method on a real telecom fraud dataset. Our method obtained an accuracy of 93.28%, f1 score of 92.08%, precision rate of 93.51%, recall rate of 90.97%, and AUC value of 94.53%. The results showed that our method (GDFGAT) is better than the classical method, the latest methods and the baseline model in many metrics; each metric improved by nearly 2%. In addition, we also conducted experiments on the imbalanced datasets: Amazon and YelpChi. The results showed that our model GDFGAT performed better than the baseline model in some metrics.

  5. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Bianca Owen; Bianca Owen (2025). Supplementary Information for Owen et al. 2025. What is 'accuracy'? Rethinking machine learning classifier performance metrics for highly imbalanced, high variance, zero-inflated species count data [Dataset]. http://doi.org/10.5281/zenodo.15913476
Organization logo

Supplementary Information for Owen et al. 2025. What is 'accuracy'? Rethinking machine learning classifier performance metrics for highly imbalanced, high variance, zero-inflated species count data

Explore at:
binAvailable download formats
Dataset updated
Jul 15, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Bianca Owen; Bianca Owen
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Microsoft Excel file showing the datasets and working for all findings in Owen et al. 2025. What is ‘accuracy’? Rethinking machine learning classifier performance metrics for highly imbalanced, high variance, zero-inflated species count data

Search
Clear search
Close search
Google apps
Main menu