46 datasets found
  1. Phishing Awareness Dataset for security breaches

    • kaggle.com
    Updated Apr 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rasika Ekanayaka @ devLK (2025). Phishing Awareness Dataset for security breaches [Dataset]. https://www.kaggle.com/datasets/rasikaekanayakadevlk/phishing-awareness-dataset-for-security-breaches/versions/1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 7, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rasika Ekanayaka @ devLK
    Description

    🛡️ Simulated Phishing Interaction Dataset

    Overview:
    This dataset captures user interactions with potentially malicious emails, simulating scenarios relevant to phishing detection and human-centric security analysis. Each row represents a unique email event, enriched with behavioral, technical, and contextual metadata.

    🔍 Use Cases

    • Phishing Click Prediction
      Predict if a user will click a link based on hover time, device type, and email domain.

    • User Risk Profiling
      Build behavior models: e.g., do mobile users report threats less often?

    • Language & Localization Patterns
      Evaluate phishing success rates by language and region.

    • Realistic Red Teaming Simulations
      Use as a training or benchmarking set for phishing email simulations.

    📉 Example Insights

    • Users with hover_time_ms < 1000 are 60% more likely to click malicious links.
    • Emails in Japanese and German had a higher click-through rate, especially on mobile.
    • Edge and Opera browsers had a lower phishing report rate compared to Firefox and Chrome.

    Sample Code Snippet

    import pandas as pd
    
    df = pd.read_csv("phishing_email_behavior.csv")
    clicked_ratio = df.groupby("device_type")["clicked_link"].value_counts(normalize=True).unstack()
    print(clicked_ratio)
    
  2. u

    Don't Take the Bait: Recognize and Avoid Phishing Attacks

    • data.urbandatacentre.ca
    • datasets.ai
    • +2more
    Updated Oct 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Don't Take the Bait: Recognize and Avoid Phishing Attacks [Dataset]. https://data.urbandatacentre.ca/dataset/gov-canada-2bbfd0ea-1757-488e-89bf-8ad90c521a52
    Explore at:
    Dataset updated
    Oct 1, 2024
    License

    Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
    License information was derived automatically

    Description

    Phishing is an attack where a scammer calls you, texts or emails you, or uses social media to trick you into clicking a malicious link, downloading malware, or sharing sensitive information. Phishing attempts are often generic mass messages, but the message appears to be legitimate and from a trusted source (e.g. from a bank, courier company).

  3. Three common types of phishing scams - Get Cyber Safe 2021

    • datasets.ai
    • ouvert.canada.ca
    • +1more
    21
    Updated Sep 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Communications Security Establishment Canada | Centre de la sécurité des télécommunications Canada (2024). Three common types of phishing scams - Get Cyber Safe 2021 [Dataset]. https://datasets.ai/datasets/46785d5c-21e4-4b4a-a499-ed3445fac760
    Explore at:
    21Available download formats
    Dataset updated
    Sep 11, 2024
    Dataset provided by
    Communications Security Establishment Canadahttps://cyber.gc.ca/en/
    Authors
    Communications Security Establishment Canada | Centre de la sécurité des télécommunications Canada
    Description

    Be aware of the common types of phishing scams that are out there.

  4. u

    What are the most common forms of phishing? An overview - Get Cyber Safe...

    • beta.data.urbandatacentre.ca
    • data.urbandatacentre.ca
    Updated Oct 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). What are the most common forms of phishing? An overview - Get Cyber Safe 2021 - Catalogue - Canadian Urban Data Catalogue (CUDC) [Dataset]. https://beta.data.urbandatacentre.ca/dataset/gov-canada-65e7c86c-77a2-4283-b493-ea56f36ea36b
    Explore at:
    Dataset updated
    Oct 22, 2024
    License

    Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
    License information was derived automatically

    Area covered
    Canada
    Description

    A quick overview of the most common types of phishing campaigns that cyber criminals use to steal your information.

  5. Enron Fraud Email Dataset

    • kaggle.com
    Updated Dec 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Advaith S Rao (2023). Enron Fraud Email Dataset [Dataset]. https://www.kaggle.com/datasets/advaithsrao/enron-fraud-email-dataset/versions/1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 28, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Advaith S Rao
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. The data has been made public and presents a diverse set of email information ranging from internal, marketing emails to spam and fraud attempts.

    In the early 2000s, Leslie Kaelbling at MIT purchased the dataset and noted that, though the dataset contained scam emails, it also had several integrity problems. The dataset was updated later, but it becomes key to ensure privacy in the data while it is used to train a deep neural network model.

    Though the Enron Email Dataset contains over 500K emails, one of the problems with the dataset is the availability of labeled frauds in the dataset. Label annotation is done to detect an umbrella of fraud emails accurately. Since, fraud emails fall into several types such as Phishing, Financial, Romance, Subscription, and Nigerian Prince scams, there have to be multiple heuristics used to label all types of fraudulent emails effectively.

    To tackle this problem, heuristics have been used to label the Enron data corpus using email signals, and automated labeling has been performed using simple ML models on other smaller email datasets available online. These fraud annotation techniques are discussed in detail below.

    To perform fraud annotation on the Enron dataset as well as provide more fraud examples for modeling, two more fraud data sources have been used, Phishing Email Dataset: https://www.kaggle.com/dsv/6090437 Social Engineering Dataset: http://aclweb.org/aclwiki

    Label Annotation

    To label the Enron email dataset two signals are used to filter suspicious emails and label them into fraud and non-fraud classes. Automated ML labeling Email Signals

    Automated ML Labeling

    The following heuristics are used to annotate labels for Enron email data using the other two data sources,

    Phishing Model Annotation: A high-precision SVM model trained on the Phishing mails dataset, which is used to annotate the Phishing Label on the Enron Dataset.

    Social Engineering Model Annotation: A high-precision SVM model trained on the Social Engineering mails dataset, which is used to annotate the Social Engineering Label on the Enron Dataset.

    The two ML Annotator models use Term Frequency Inverse Document Frequency (TF-IDF) to embed the input text and make use of SVM models with Gaussian Kernel.

    If either of the models predicted that an email was a fraud, the mail metadata was checked for several email signals. If these heuristics meet the requirements of a high-probability fraud email, we label it as a fraud email.

    Email Signals

    Email Signal-based heuristics are used to filter and target suspicious emails for fraud labeling specifically. The signals used were,

    Person Of Interest: There is a publicly available list of email addresses of employees who were liable for the massive data leak at Enron. These user mailboxes have a higher chance of containing quality fraud emails.

    Suspicious Folders: The Enron data is dumped into several folders for every employee. Folders consist of inbox, deleted_items, junk, calendar, etc. A set of folders with a higher chance of containing fraud emails, such as Deleted Items and Junk.

    Sender Type: The sender type was categorized as ‘Internal’ and ‘External’ based on their email address.

    Low Communication: A threshold of 4 emails based on the table below was used to define Low Communication. A user qualifies as a Low-Comm sender if their emails are below this threshold. Mails sent from low-comm senders have been assigned with a high probability of being a fraud.

    Contains Replies and Forwards: If an email contains forwards or replies, a low probability was assigned for it to be a fraud email.

    Manual Inspection

    To ensure high-quality labels, the mismatch examples from ML Annotation have been manually inspected for Enron dataset relabeling.

    Dataset Breakdown

    FraudNon-Fraud
    2327445090

    Citations

    Enron Dataset Title: Enron Email Dataset URL: https://www.cs.cmu.edu/~enron/ Publisher: MIT, CMU Author: Leslie Kaelbling, William W. Cohen Year: 2015

    Phishing Email Detection Dataset Title: Phishing Email Detection URL: https://www.kaggle.com/dsv/6090437 DOI: 10.34740/KAGGLE/DSV/6090437 Publisher: Kaggle Author: Subhadeep Chakraborty Year: 2023

    CLAIR Fraud Email Collection Title: CLAIR collection of fraud email URL: http://aclweb.org/aclwiki Author: Radev, D. Year: 2008

  6. u

    Three common types of phishing scams - Get Cyber Safe 2021 - Catalogue -...

    • data.urbandatacentre.ca
    • beta.data.urbandatacentre.ca
    Updated Oct 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Three common types of phishing scams - Get Cyber Safe 2021 - Catalogue - Canadian Urban Data Catalogue (CUDC) [Dataset]. https://data.urbandatacentre.ca/dataset/gov-canada-46785d5c-21e4-4b4a-a499-ed3445fac760
    Explore at:
    Dataset updated
    Oct 1, 2024
    License

    Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
    License information was derived automatically

    Area covered
    Canada
    Description

    Be aware of the common types of phishing scams that are out there.

  7. h

    Phishing_Link_Pattern_Dataset

    • huggingface.co
    Updated Jul 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sunny thakur (2025). Phishing_Link_Pattern_Dataset [Dataset]. https://huggingface.co/datasets/darkknight25/Phishing_Link_Pattern_Dataset
    Explore at:
    Dataset updated
    Jul 26, 2025
    Authors
    Sunny thakur
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Phishing Link Pattern Dataset

      Overview
    

    This dataset provides a comprehensive collection of URLs labeled as either legitimate or phishing, designed for machine learning, cybersecurity analysis, and penetration testing. It includes 1000 entries (IDs 1–1000) covering popular brands across multiple top-level domains (TLDs) such as .es, .de, and .co.uk. The dataset captures advanced features like domain entropy, subdomain count, and suspicious keywords to aid in phishing… See the full description on the dataset page: https://huggingface.co/datasets/darkknight25/Phishing_Link_Pattern_Dataset.

  8. Data from: Password Reset Dataset

    • kaggle.com
    Updated Oct 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HariSellowpay (2023). Password Reset Dataset [Dataset]. https://www.kaggle.com/datasets/harisellowpay/password-reset-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 3, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    HariSellowpay
    Description

    The dataset is designed to simulate password-related events, creating a synthetic representation of actions related to password management. It includes fields like timestamp, action, event type, location, IP address, password, hour, and time difference.

    • The dataset comprises 50,000 records representing a variety of password-related events.
    • A list of commonly used passwords is incorporated to mimic real-world scenarios.
    • Timestamps are spread throughout the current year.
    • Features like 'hour' and 'time_difference' are derived to provide additional insights into the temporal aspects of the events.

    This synthetic dataset can be used for training and testing machine learning models related to cyber security, anomaly detection, or password management. It allows researchers and practitioners to experiment with data resembling real-world scenarios without compromising actual user information.

  9. What are the most common forms of phishing? An overview - Get Cyber Safe...

    • open.canada.ca
    • ouvert.canada.ca
    html
    Updated Mar 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Communications Security Establishment Canada (2023). What are the most common forms of phishing? An overview - Get Cyber Safe 2021 [Dataset]. https://open.canada.ca/data/info/65e7c86c-77a2-4283-b493-ea56f36ea36b
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Mar 8, 2023
    Dataset provided by
    Communications Security Establishment Canadahttps://cyber.gc.ca/en/
    License

    Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
    License information was derived automatically

    Description

    A quick overview of the most common types of phishing campaigns that cyber criminals use to steal your information.

  10. b

    Scam Survivors Sextortion Reports - Datasets - data.bris

    • data.bris.ac.uk
    Updated Dec 19, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Scam Survivors Sextortion Reports - Datasets - data.bris [Dataset]. https://data.bris.ac.uk/data/dataset/mmtun4gufpdb2tmmrcpos4shq
    Explore at:
    Dataset updated
    Dec 19, 2023
    Description

    This dataset contains over 41,000 posts from the sextortion reporting board of scamsurvivors.com, as collected on the 14th of July, 2023. The data was collected and is shared with the approval of the Scam Survivors administrator. Of these posts, 23,705 were automatically identified as following a common structured report format, and the reported answers to specific questions were extracted into a tabular CSV format, which was then further processed to clean and standardise responses. The data does not contain identifiable or demographic victim information, as the reports are anonymous at source, but does include details of sextortion offenders' (purported) names and ages, as well as their online presence, meeting locations, conversation platforms, interaction dynamics, payment demands and some victim reflection on incidents. This dataset has been created as part of the REPHRAIN project (https://www.rephrain.ac.uk/).

  11. f

    creditcard Dataset

    • figshare.com
    csv
    Updated Jun 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammad Shanaa; Sherief Abdallah (2025). creditcard Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.29270873.v1
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 9, 2025
    Dataset provided by
    figshare
    Authors
    Mohammad Shanaa; Sherief Abdallah
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Title: Credit Card Transactions Dataset for Fraud Detection (Used in: A Hybrid Anomaly Detection Framework Combining Supervised and Unsupervised Learning)Description:This dataset, commonly known as creditcard.csv, contains anonymized credit card transactions made by European cardholders in September 2013. It includes 284,807 transactions, with 492 labeled as fraudulent. Due to confidentiality constraints, features have been transformed using PCA, except for 'Time' and 'Amount'.This dataset was used in the research article titled "A Hybrid Anomaly Detection Framework Combining Supervised and Unsupervised Learning for Credit Card Fraud Detection". The study proposes an ensemble model integrating techniques such as Autoencoders, Isolation Forest, Local Outlier Factor, and supervised classifiers including XGBoost and Random Forest, aiming to improve the detection of rare fraudulent patterns while maintaining efficiency and scalability.Key Features:30 numerical input features (V1–V28, Time, Amount)Class label indicating fraud (1) or normal (0)Imbalanced class distribution typical in real-world fraud detectionUse Case:Ideal for benchmarking and evaluating anomaly detection and classification algorithms in highly imbalanced data scenarios.Source:Originally published by the Machine Learning Group at Université Libre de Bruxelles.https://www.kaggle.com/mlg-ulb/creditcardfraudLicense:This dataset is distributed for academic and research purposes only. Please cite the original source when using the dataset.

  12. Z

    Data from: Set of obfuscated spam dataset by using LeetSpeak transformations...

    • data.niaid.nih.gov
    • portalcientifico.uvigo.gal
    • +1more
    Updated Mar 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    José Ramón Méndez (2022). Set of obfuscated spam dataset by using LeetSpeak transformations [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6373652
    Explore at:
    Dataset updated
    Mar 22, 2022
    Dataset provided by
    José Ramón Méndez
    Enaitz Ezpeleta
    Vitor Basto Fernandes
    Urko Zurutuza
    Xabier Vidriales
    Iñaki Velez de Mendizabal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The usage of LeetSpeak and other text hiding tricks is often used by spammers in the distribution of unsolicited contents. To evaluate deobfuscation techniques and their impact on spam content classification, we preprocessed several popular public datasets to partially obfuscate the text. The datasets transformed are:

    YouTube Spam Collection [2, 3] which is available on https://www.dt.fee.unicamp.br/~tiago/youtubespamcollection/.

    a subset of YouTube Comments [4, 5] which is available on http://mlg.ucd.ie/yt/.

    CSDMC2010 which is available on http://csmining.org/index.php/spam-email-datasets-.html.

    TREC2007 which is available on https://plg.uwaterloo.ca/~gvcormac/treccorpus07/

  13. D

    Ai Based Fraud Detection Tools Market Report | Global Forecast From 2025 To...

    • dataintelo.com
    csv, pdf, pptx
    Updated Oct 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2024). Ai Based Fraud Detection Tools Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/ai-based-fraud-detection-tools-market
    Explore at:
    pptx, pdf, csvAvailable download formats
    Dataset updated
    Oct 16, 2024
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    AI-Based Fraud Detection Tools Market Outlook



    The global AI-based fraud detection tools market size was valued at approximately USD 6.5 billion in 2023 and is projected to reach USD 22.8 billion by 2032, growing at a robust CAGR of 15.1% during the forecast period. The significant growth factors driving this market include the increasing sophistication of fraudulent activities, the growing adoption of AI and machine learning technologies in various sectors, and the heightened demand for real-time fraud detection solutions.



    One of the primary growth factors for the AI-based fraud detection tools market is the rising complexity of fraudulent activities. In today's digital age, fraudsters are employing increasingly sophisticated techniques to breach security systems, making traditional detection methods inadequate. AI-based solutions, which leverage advanced algorithms and machine learning, are capable of analyzing large volumes of data to identify patterns and anomalies indicative of fraud. This capability is crucial for organizations seeking to protect their assets and maintain customer trust in an environment where cyber threats are continually evolving.



    Another significant growth driver is the widespread adoption of AI and machine learning technologies across various industries. Businesses are recognizing the potential of these technologies to enhance their fraud detection capabilities, leading to increased investments in AI-driven solutions. The banking and financial services sector, in particular, has been at the forefront of adopting AI-based fraud detection tools to combat financial crimes such as identity theft, credit card fraud, and money laundering. Furthermore, the retail and e-commerce sectors are increasingly implementing these tools to safeguard against fraudulent transactions and account takeovers.



    The growing demand for real-time fraud detection solutions is also propelling the market forward. Traditional fraud detection systems often rely on rule-based approaches that can be slow and reactive, allowing fraudulent activities to go undetected until significant damage has been done. In contrast, AI-based solutions can process and analyze data in real-time, enabling organizations to identify and respond to threats rapidly. This real-time capability is essential for minimizing losses and mitigating risks, particularly in sectors where the speed of transactions is critical, such as online retail and financial services.



    Regionally, North America currently dominates the AI-based fraud detection tools market, owing to the high adoption rate of advanced technologies and the presence of major industry players. However, other regions like Asia Pacific and Europe are also experiencing significant growth. Asia Pacific, in particular, is expected to exhibit the highest CAGR during the forecast period, driven by the increasing digitization of economies, rising internet penetration, and the growing awareness of cybersecurity threats. Europe is also witnessing substantial growth due to stringent regulatory requirements and the increasing focus on data privacy and security.



    Component Analysis



    The AI-based fraud detection tools market can be segmented by component into software, hardware, and services. The software segment is expected to hold the largest market share during the forecast period. This dominance can be attributed to the continuous advancements in AI algorithms and machine learning models, which enhance the accuracy and efficiency of fraud detection systems. Furthermore, the software solutions are designed to be scalable and easily integrated into existing systems, making them an attractive option for organizations of all sizes.



    Hardware components, though not as dominant as software, play a crucial role in the deployment of AI-based fraud detection systems. High-performance computing hardware, including GPUs and specialized AI processors, are essential for handling the large datasets and complex computations required for real-time fraud detection. As the demand for more powerful and efficient hardware grows, this segment is expected to see steady growth, particularly in large enterprises that require robust infrastructure to support their AI initiatives.



    The services segment, encompassing consulting, integration, and maintenance services, is also poised for significant growth. Organizations often lack the in-house expertise required to develop and implement AI-based fraud detection systems, leading to an increased reliance on external service providers. These services help organizations to customize and opti

  14. Spam Email Detection Model

    • kaggle.com
    Updated Aug 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Usamakhanswati (2024). Spam Email Detection Model [Dataset]. http://doi.org/10.34740/kaggle/ds/5524456
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 10, 2024
    Dataset provided by
    Kaggle
    Authors
    Usamakhanswati
    Description

    In a bustling digital landscape where businesses and individuals alike rely on email communication, a hidden threat lurks—spam emails. These unsolicited and often malicious messages clog inboxes, steal valuable time, and even endanger sensitive data. The need for a powerful shield against this growing menace has never been more urgent.

    Enter the Spam Email Detection Model—a cutting-edge creation designed to bring order to the chaos of modern email communication. Imagine a business owner named Sarah, whose company relies heavily on email for client communication, order processing, and customer support. Every day, her inbox is flooded with hundreds of emails, many of which are nothing but spam. These emails not only waste her time but also pose a risk to her company's security. The Spam Email Detection Model is a state-of-the-art solution designed to combat the ever-growing threat of spam emails with unparalleled accuracy and efficiency. Leveraging advanced machine learning algorithms, this model achieves a remarkable 99.9% accuracy rate, far surpassing the industry standard of 50%. It intelligently distinguishes between legitimate emails and spam, learning and adapting to new patterns to ensure ongoing protection.

    Designed for seamless integration, the model can be easily implemented into any existing email system, providing businesses with a robust defense against unsolicited messages and potential security threats. Its user-friendly interface allows for effortless control and customization, making it a versatile tool for businesses of all sizes.

    By dramatically reducing the time wasted on managing spam and enhancing email security, the Spam Email Detection Model empowers businesses to focus on what truly matters, offering peace of mind in a world where digital communication is vital.

  15. f

    Data_Sheet_1_Lumen: A machine learning framework to expose influence cues in...

    • figshare.com
    pdf
    Updated Jun 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hanyu Shi; Mirela Silva; Luiz Giovanini; Daniel Capecci; Lauren Czech; Juliana Fernandes; Daniela Oliveira (2023). Data_Sheet_1_Lumen: A machine learning framework to expose influence cues in texts.PDF [Dataset]. http://doi.org/10.3389/fcomp.2022.929515.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 16, 2023
    Dataset provided by
    Frontiers
    Authors
    Hanyu Shi; Mirela Silva; Luiz Giovanini; Daniel Capecci; Lauren Czech; Juliana Fernandes; Daniela Oliveira
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Phishing and disinformation are popular social engineering attacks with attackers invariably applying influence cues in texts to make them more appealing to users. We introduce Lumen, a learning-based framework that exposes influence cues in text: (i) persuasion, (ii) framing, (iii) emotion, (iv) objectivity/subjectivity, (v) guilt/blame, and (vi) use of emphasis. Lumen was trained with a newly developed dataset of 3K texts comprised of disinformation, phishing, hyperpartisan news, and mainstream news. Evaluation of Lumen in comparison to other learning models showed that Lumen and LSTM presented the best F1-micro score, but Lumen yielded better interpretability. Our results highlight the promise of ML to expose influence cues in text, toward the goal of application in automatic labeling tools to improve the accuracy of human-based detection and reduce the likelihood of users falling for deceptive online content.

  16. Credit Card Fraud Detection Dataset

    • kaggle.com
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ghanshyam Saini (2025). Credit Card Fraud Detection Dataset [Dataset]. https://www.kaggle.com/datasets/ghnshymsaini/credit-card-fraud-detection-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 15, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ghanshyam Saini
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Credit Card Fraud Detection Dataset (European Cardholders, September 2013)

    As a data contributor, I'm sharing this crucial dataset focused on the detection of fraudulent credit card transactions. Recognizing these illicit activities is paramount for protecting customers and the integrity of financial systems.

    About the Dataset:

    This dataset encompasses credit card transactions made by European cardholders during a two-day period in September 2013. It presents a real-world scenario with a significant class imbalance, where fraudulent transactions are considerably less frequent than legitimate ones. Out of a total of 284,807 transactions, only 492 are instances of fraud, representing a mere 0.172% of the entire dataset.

    Content of the Data:

    Due to confidentiality concerns, the majority of the input features in this dataset have undergone a Principal Component Analysis (PCA) transformation. This means the original meaning and context of features V1, V2, ..., V28 are not directly provided. However, these principal components capture the variance in the underlying transaction data.

    The only features that have not been transformed by PCA are:

    • Time: Numerical. Represents the number of seconds elapsed between each transaction and the very first transaction recorded in the dataset.
    • Amount: Numerical. The transaction amount in Euros (€). This feature could be valuable for cost-sensitive learning approaches.

    The target variable for this classification task is:

    • Class: Integer. Takes the value 1 in the case of a fraudulent transaction and 0 otherwise.

    Important Note on Evaluation:

    Given the substantial class imbalance (far more legitimate transactions than fraudulent ones), traditional accuracy metrics based on the confusion matrix can be misleading. It is strongly recommended to evaluate models using the Area Under the Precision-Recall Curve (AUPRC), as this metric is more sensitive to the performance on the minority class (fraudulent transactions).

    How to Use This Dataset:

    1. Download the dataset file (likely in CSV format).
    2. Load the data using libraries like Pandas.
    3. Understand the class imbalance: Be aware that fraudulent transactions are rare.
    4. Explore the features: Analyze the distributions of 'Time', 'Amount', and the PCA-transformed features (V1-V28).
    5. Address the class imbalance: Consider using techniques like oversampling the minority class, undersampling the majority class, or using specialized algorithms designed for imbalanced datasets.
    6. Build and train binary classification models to predict the 'Class' variable.
    7. Evaluate your models using AUPRC to get a meaningful assessment of performance in detecting fraud.

    Acknowledgements and Citation:

    This dataset has been collected and analyzed through a research collaboration between Worldline and the Machine Learning Group (MLG) of ULB (Université Libre de Bruxelles).

    When using this dataset in your research or projects, please cite the following works as appropriate:

    • Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015.
    • Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon.
    • Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE.
    • Andrea Dal Pozzolo. Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi).
    • Fabrizio Carcillo, Andrea Dal Pozzolo, Yann-Aël Le Borgne, Olivier Caelen, Yannis Mazzer, Gianluca Bontempi. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier.
    • Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Gianluca Bontempi. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing.
    • Bertrand Lebichot, Yann-Aël Le Borgne, Liyun He, Frederic Oblé, Gianluca Bontempi Deep-Learning Domain Adaptation Techniques for Credit Cards Fraud Detection, INNSBDDL 2019: Recent Advances in Big Data and Deep Learning, pp 78-88, 2019.
    • Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Frederic Oblé, Gianluca Bontempi *Combining Unsupervised and Supervised...
  17. Most common scams in Singapore 2023

    • statista.com
    Updated Jun 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Most common scams in Singapore 2023 [Dataset]. https://www.statista.com/statistics/981340/leading-types-of-scams-singapore/
    Explore at:
    Dataset updated
    Jun 27, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2023
    Area covered
    Singapore
    Description

    In 2023, job scams were the most common type of scam in Singapore, with around ***** cases reported. E-commerce scams also represented a prevalent form of fraud in the country, with over ***** cases reported.

    Phishing threat in Singapore In Singapore, around *********** different phishing URLs with a .SG domain were detected in 2022. The highest number of phishing URLs was recorded the previous year, with around ***********. Phishing attacks can take many forms, such as corporate e-mail compromise (CEC), mass phishing, or smishing. These phishing e-mails represent a crucial risk for businesses. They can also lead to ransomware infections, which have also increased in recent years.

    Data breaches Companies and governments are increasingly relying on technology to collect, analyze, and store personal data. This can lead to potential risks when such data is affected by cyber incidents. In Singapore, the number of exposed data points per thousand people reached ** in 2022. Over the same period, around ************ data sets were reported as leaked in the country.

  18. f

    Data in Figures

    • plos.figshare.com
    • figshare.com
    zip
    Updated Jul 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yicheng Long (2025). Data in Figures [Dataset]. http://doi.org/10.1371/journal.pone.0327476.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 17, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Yicheng Long
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Amid substantial capital influx and the rapid evolution of online user groups, the increasing complexity of user behavior poses significant challenges to cybersecurity, particularly in the domain of vulnerability prediction. This study aims to enhance the accuracy and practical applicability of cyberspace vulnerability prediction. By incorporating the dynamics of user behavioral changes and the logic of platform scaling driven by investment, two representative cybersecurity datasets are selected for analysis: the Canadian Institute for Cybersecurity Intrusion Detection System 2017 and the Network-Based Intrusion Detection Evaluation Dataset 2015. A standardized data preprocessing pipeline is constructed, including redundancy elimination, feature selection, and sample balancing, to ensure data representativeness and compatibility. To address the limited adaptability of traditional support vector machine (SVM) models in identifying nonlinear attacks, this study introduces a distribution-driven, dynamically adaptive kernel optimization approach. This method adjusts kernel parameters or switches kernel functions in real time according to the statistical characteristics of input data, thereby improving the model’s generalization capability and responsiveness in complex attack scenarios. Performance evaluations are conducted on both datasets using cross-validation. The results show that, compared to traditional models, the improved SVM achieves an 11.2% increase in prediction accuracy. Furthermore, the model demonstrates a 22.2% improvement in computational efficiency, measured as the ratio of prediction count to processing time. It also exhibits lower false positive rates and greater stability in detecting common cyberattacks such as distributed denial of service, phishing, and malware. In addition, this study analyzes user behavioral variations under different levels of attack pressure based on network access activity. Findings indicate that during periods of high platform load, attack frequency is positively correlated with users’ defensive behavior, confirming a potential causal sequence of “capital influx—user expansion—increased attack exposure.” This study offers a practical modeling framework and empirical foundation for improving predictive performance and enhancing users’ sense of cybersecurity.

  19. G

    Medical Claims Fraud Scenarios

    • gomask.ai
    csv
    Updated Jul 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GoMask.ai (2025). Medical Claims Fraud Scenarios [Dataset]. https://gomask.ai/marketplace/datasets/medical-claims-fraud-scenarios
    Explore at:
    csv(Unknown)Available download formats
    Dataset updated
    Jul 12, 2025
    Dataset provided by
    GoMask.ai
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    notes, claim_id, claim_date, fraud_flag, patient_id, fraud_score, patient_age, provider_id, review_date, reviewer_id, and 10 more
    Description

    This dataset contains simulated health insurance claims with detailed patient, provider, and service information, including flags and scores for a variety of fraud scenarios. It is designed to support the development and evaluation of fraud detection algorithms, audit workflows, and compliance monitoring in healthcare insurance. The dataset enables analysis of both common and rare fraudulent patterns for improved anomaly detection.

  20. A

    ‘Fraud detection bank dataset 20K records binary ’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Fraud detection bank dataset 20K records binary ’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-fraud-detection-bank-dataset-20k-records-binary-6287/e0c752fd/?iid=019-351&v=presentation
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Fraud detection bank dataset 20K records binary ’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/volodymyrgavrysh/fraud-detection-bank-dataset-20k-records-binary on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    Banks are often exposed to fraud transactions and constantly improve systems to track them.

    Content

    Bank dataset that contains 20k+ transactions with 112 features (numerical)

    --- Original source retains full ownership of the source dataset ---

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rasika Ekanayaka @ devLK (2025). Phishing Awareness Dataset for security breaches [Dataset]. https://www.kaggle.com/datasets/rasikaekanayakadevlk/phishing-awareness-dataset-for-security-breaches/versions/1
Organization logo

Phishing Awareness Dataset for security breaches

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 7, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rasika Ekanayaka @ devLK
Description

🛡️ Simulated Phishing Interaction Dataset

Overview:
This dataset captures user interactions with potentially malicious emails, simulating scenarios relevant to phishing detection and human-centric security analysis. Each row represents a unique email event, enriched with behavioral, technical, and contextual metadata.

🔍 Use Cases

  • Phishing Click Prediction
    Predict if a user will click a link based on hover time, device type, and email domain.

  • User Risk Profiling
    Build behavior models: e.g., do mobile users report threats less often?

  • Language & Localization Patterns
    Evaluate phishing success rates by language and region.

  • Realistic Red Teaming Simulations
    Use as a training or benchmarking set for phishing email simulations.

📉 Example Insights

  • Users with hover_time_ms < 1000 are 60% more likely to click malicious links.
  • Emails in Japanese and German had a higher click-through rate, especially on mobile.
  • Edge and Opera browsers had a lower phishing report rate compared to Firefox and Chrome.

Sample Code Snippet

import pandas as pd

df = pd.read_csv("phishing_email_behavior.csv")
clicked_ratio = df.groupby("device_type")["clicked_link"].value_counts(normalize=True).unstack()
print(clicked_ratio)
Search
Clear search
Close search
Google apps
Main menu