100+ datasets found
  1. s

    Highest number of spam e-mails sent daily 2024, by country

    • statista.com
    Updated Nov 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Highest number of spam e-mails sent daily 2024, by country [Dataset]. https://www.statista.com/statistics/1270488/spam-emails-sent-daily-by-country/
    Explore at:
    Dataset updated
    Nov 25, 2025
    Dataset authored and provided by
    Statista
    Time period covered
    Dec 8, 2024
    Area covered
    Worldwide
    Description

    As of December 8, 2024, China and the United States were the countries with the highest number of spam emails sent within one day worldwide, with around 7.8 billion. Ranking third and fourth were India and the Japan, with around 7.6 billion. Internet and e-mail users around the world Between 2019 and 2024, the number of email users globally increased from 3.9 billion to 4.4 billion. Moreover, this number is expected to increase up to 4.8 billion in 2027. Considering the fact that China and India had the highest number of internet users in the world in 2023, with over 1.2 billion and 1.1 billion users respectively, e-mail usage is less popular in these countries than in the United States or Germany, for example. Most popular online activities in the U.S. Not only did the United States have the highest number of daily emails and spam emails sent as of October 2021, it was actually the most popular online activity among internet users in 2019. In fact, 90.9 percent of respondents said they were email users, more than search users, social network users, or digital video viewers.

  2. S

    Email Spam Statistics 2026: Shocking Insights and Real Risks

    • sqmagazine.co.uk
    Updated Sep 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SQ Magazine (2025). Email Spam Statistics 2026: Shocking Insights and Real Risks [Dataset]. https://sqmagazine.co.uk/spam-statistics/
    Explore at:
    Dataset updated
    Sep 10, 2025
    Dataset authored and provided by
    SQ Magazine
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2024 - Dec 31, 2026
    Area covered
    Worldwide, Earth
    Description

    Explore key Spam Statistics to protect your inbox and business from hidden threats with clear, powerful insights that drive smarter defenses.

  3. Spam email Dataset

    • kaggle.com
    zip
    Updated Sep 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    _w1998 (2023). Spam email Dataset [Dataset]. https://www.kaggle.com/datasets/jackksoncsie/spam-email-dataset
    Explore at:
    zip(2994798 bytes)Available download formats
    Dataset updated
    Sep 1, 2023
    Authors
    _w1998
    License

    http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html

    Description

    Dataset Name: Spam Email Dataset

    Description: This dataset contains a collection of email text messages, labeled as either spam or not spam. Each email message is associated with a binary label, where "1" indicates that the email is spam, and "0" indicates that it is not spam. The dataset is intended for use in training and evaluating spam email classification models.

    Columns:

    text (Text): This column contains the text content of the email messages. It includes the body of the emails along with any associated subject lines or headers.

    spam_or_not (Binary): This column contains binary labels to indicate whether an email is spam or not. "1" represents spam, while "0" represents not spam.

    Usage: This dataset can be used for various Natural Language Processing (NLP) tasks, such as text classification and spam detection. Researchers and data scientists can train and evaluate machine learning models using this dataset to build effective spam email filters.

  4. Spam: share of global e-mail traffic monthly 2014-2025

    • statista.com
    Updated Feb 19, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2026). Spam: share of global e-mail traffic monthly 2014-2025 [Dataset]. https://www.statista.com/statistics/420391/spam-email-traffic-share/
    Explore at:
    Dataset updated
    Feb 19, 2026
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Jan 2014 - Dec 2025
    Area covered
    Worldwide
    Description

    Spam messages accounted for **** percent of e-mail traffic in December 2025. Russia generated the largest share of unsolicited spam e-mails, with **** percent of global spam e-mails originating from the country. Spam worldwide It is almost impossible to think about e-mail without considering the issue of spam, which usually includes billions of promotional e-mails marketers send daily. As of December 2024, China and the United States had the highest number of spam e-mails sent daily. While many e-mail users believe such content belongs in their spam folder, marketing e-mails are generally harmless if annoying to the user.

  5. Spam or real Email dataset

    • kaggle.com
    zip
    Updated Jul 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdul Wadood (2025). Spam or real Email dataset [Dataset]. https://www.kaggle.com/datasets/abdulwadood11220/spam-or-real-email-dataset
    Explore at:
    zip(11322 bytes)Available download formats
    Dataset updated
    Jul 4, 2025
    Authors
    Abdul Wadood
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    📄 Dataset Description This dataset contains 5,000 sample emails labeled as either "spam" or "ham" (not spam). It is designed to help build and evaluate machine learning models for spam detection using natural language processing (NLP) techniques.

    The data is synthetically generated to reflect realistic spam and ham email patterns, including promotional content, phishing alerts, reminders, and casual conversations.

    📁 Files Included train.csv Contains 4,000 labeled email samples used to train a model. Columns:

    label: Spam classification (spam or ham)

    text: The content of the email

    test.csv Contains 1,000 unlabeled email samples used for testing/prediction. Columns:

    text: The content of the email

    Note: You can evaluate your model on this test set using a private test_labels.csv if needed.

    ✅ Use Cases Binary text classification (Spam vs. Ham)

    NLP preprocessing and vectorization (TF-IDF, CountVectorizer, embeddings)

    Model training (Naive Bayes, Logistic Regression, SVM, Transformers)

    Evaluation metrics (Accuracy, Precision, Recall, F1-score)

    📊 Suggested Evaluation Workflow Train model on train.csv

    Predict on test.csv

    Evaluate predictions if test_labels.csv is available (optional)

  6. Spam share of global email traffic 2011-2023

    • statista.com
    Updated Jan 8, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2026). Spam share of global email traffic 2011-2023 [Dataset]. https://www.statista.com/statistics/420400/spam-email-traffic-share-annual/
    Explore at:
    Dataset updated
    Jan 8, 2026
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Worldwide
    Description

    In 2023, nearly 45.6 percent of all e-mails worldwide were identified as spam, down from almost 49 percent in 2022. While remaining a big part of the e-mail traffic, since 2011, the share of spam e-mails has decreased significantly. In 2023, the highest volume of spam e-mails was registered in May, approximately 50 percent of e-mail traffic worldwide.

  7. d

    Statistics on spam email interception

    • data.gov.tw
    csv
    Updated Mar 1, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ministry of Digital Affairs (2026). Statistics on spam email interception [Dataset]. https://data.gov.tw/en/datasets/6443
    Explore at:
    csvAvailable download formats
    Dataset updated
    Mar 1, 2026
    Dataset authored and provided by
    Ministry of Digital Affairs
    License

    https://data.gov.tw/licensehttps://data.gov.tw/license

    Description

    Internet service providers provide statistics on spam email blocking every month.

  8. Leading countries of origin of spam e-mails 2025

    • statista.com
    Updated Feb 19, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2026). Leading countries of origin of spam e-mails 2025 [Dataset]. https://www.statista.com/statistics/263086/countries-of-origin-of-spam/
    Explore at:
    Dataset updated
    Feb 19, 2026
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2025
    Area covered
    Worldwide
    Description

    In 2025, Russia ranked first by its share of unsolicited spam e-mails. Overall, **** percent of global spam e-mails originated from IPs in Russia. Mainland China ranked second, with **** percent. The United States followed, accounting for ***** percent of global unsolicited spam e-mails during the measured period.

  9. Data from: Spam Mail Data Set

    • kaggle.com
    zip
    Updated May 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ChetanKR (2024). Spam Mail Data Set [Dataset]. https://www.kaggle.com/datasets/chetankrjnnce/spam-mail-data-set
    Explore at:
    zip(217758 bytes)Available download formats
    Dataset updated
    May 12, 2024
    Authors
    ChetanKR
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by ChetanKR

    Released under Apache 2.0

    Contents

  10. Email Spam

    • kaggle.com
    zip
    Updated May 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    khashayar Ahmadi (2024). Email Spam [Dataset]. https://www.kaggle.com/datasets/khashayarahmadi/email-spam
    Explore at:
    zip(4530251 bytes)Available download formats
    Dataset updated
    May 9, 2024
    Authors
    khashayar Ahmadi
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Email spam is a type of unsolicited electronic mail (email) that is sent in bulk to a large number of recipients. Spam is often used to send viruses, malware, and phishing scams. It can also be used to promote products or services.

    Email spam data is a collection of emails that have been labeled as spam or not spam. This data can be used to train and test spam filters, as well as to study the characteristics of spam emails.

    Email spam data typically includes the following fields:

    Email: The full text of the email, including the subject and body. category: spam /non-spam. Body: The body of the email. Email spam data can be collected from a variety of sources, including:

    Public datasets: Datasets of spam emails that have been made available for research purposes. Email spam data is a valuable resource for researchers and practitioners who are working on spam filtering and email classification.

    Here are some of the ways that email spam data can be used:

    To train and test spam filters: Spam filters can be trained on email spam data to learn the characteristics of spam emails. This allows the filters to more accurately identify spam emails in the future. To study the characteristics of spam emails: Email spam data can be used to study the characteristics of spam emails, such as the language used, the types of attachments, and the sender's email address. This information can help researchers to develop better spam filters and to understand the motivations of spammers. To develop new spam filtering techniques: Email spam data can be used to develop new spam filtering techniques. For example, researchers can use machine learning to develop algorithms that can automatically identify spam emails. Email spam data is an important resource for researchers and practitioners who are working on spam filtering and email classification.

  11. Spam Emails

    • kaggle.com
    zip
    Updated Oct 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdallah Wagih Ibrahim (2023). Spam Emails [Dataset]. https://www.kaggle.com/datasets/abdallahwagih/spam-emails
    Explore at:
    zip(212432 bytes)Available download formats
    Dataset updated
    Oct 9, 2023
    Authors
    Abdallah Wagih Ibrahim
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Overview: This dataset contains a collection of emails, categorized into two classes: "Spam" and "Non-Spam" (often referred to as "Ham"). These emails have been carefully curated and labeled to aid in the development of spam email detection models. Whether you are interested in email filtering, natural language processing, or machine learning, this dataset can serve as a valuable resource for training and evaluation.

    Context: Spam emails continue to be a significant issue, with malicious actors attempting to deceive users with unsolicited, fraudulent, or harmful messages. This dataset is designed to facilitate research, development, and testing of algorithms and models aimed at accurately identifying and filtering spam emails, helping protect users from various threats.

    Content: The dataset includes the following features: Message: The content of the email, including the subject line and message body. Category: Categorizes each email as either "Spam" or "Ham" (Non-Spam).

    Potential Use Cases: - Email Filtering: Develop and evaluate email filtering systems that automatically classify incoming emails as spam or non-spam. - Natural Language Processing (NLP): Use the email text for text classification, topic modeling, and sentiment analysis. - Machine Learning: Create machine learning models for spam detection, potentially employing various algorithms and techniques. - Feature Engineering: Explore email content features that contribute to spam classification accuracy. - Data Analysis: Investigate patterns and trends in spam email content and characteristics.

    License: Please note that this dataset is for research and analysis purposes only and may be subject to copyright and data use restrictions. Ensure compliance with relevant policies when using this data.

  12. h

    spam-messages

    • huggingface.co
    Updated Aug 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Shenoda (2025). spam-messages [Dataset]. https://huggingface.co/datasets/mshenoda/spam-messages
    Explore at:
    Dataset updated
    Aug 24, 2025
    Authors
    Michael Shenoda
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    The dataset is composed of messages labeled by ham or spam, merged from three data sources:

    SMS Spam Collection https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset Telegram Spam Ham https://huggingface.co/datasets/thehamkercat/telegram-spam-ham/tree/main Enron Spam: https://huggingface.co/datasets/SetFit/enron_spam/tree/main (only used message column and labels)

    The prepare script for enron is available at… See the full description on the dataset page: https://huggingface.co/datasets/mshenoda/spam-messages.

  13. t

    Spam Mails Dataset - FAIR experiment

    • test.researchdata.tuwien.ac.at
    application/x-hdf5 +3
    Updated Apr 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicolas Bernal; Nicolas Bernal; Nicolas Bernal; Nicolas Bernal (2025). Spam Mails Dataset - FAIR experiment [Dataset]. http://doi.org/10.70124/0e1sf-saz86
    Explore at:
    application/x-hdf5, png, txt, csvAvailable download formats
    Dataset updated
    Apr 25, 2025
    Dataset provided by
    TU Wien
    Authors
    Nicolas Bernal; Nicolas Bernal; Nicolas Bernal; Nicolas Bernal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context

    The Spam Mail dataset is a collection of 5.171 emails that have been classified as spam or ham (non-spam). This dataset was originally created in 2006 for research purposes in the field of spam detection and filtering using machine learning techniques, specifically a Naive Bayes classifier as described in the paper "Spam Filtering with Naive Bayes - Which Naive Bayes?" by Metsis, Androutsopoulos, and Paliouras.
    The data was created using mainly the inbox of 6 users of the company "Enron" for the "ham" emails, and the "spam" emails were collected from various sources, including the SpamAssassin corpus, the Honeypot project, the spam collection of Bruce Guenter, and spam collected by the authors themselves.
    The emails were preprocessed to remove any html tags, and emails with non-latin characters were removed to avoid any possible bias since all "ham" emails are written with latin characters.
    The original data can be found in CSV format on Kaggle at: https://www.kaggle.com/datasets/venky73/spam-mails-dataset/data

    Project description

    In this project we will use the Spam Mail dataset to train a Neural Network model to classify emails as spam or ham. The dataset will be further preprocessed to remove any unnecessary characters like stopwords and punctuation.
    The emails will also be tokenized and converted into a format suitable for training the model, but this last step will be performed in the code itself so it is not included in the dataset.

    Files

    In this repository you will find the following files:
    - README.md: Project overview, dataset source, structure, and dependency information.
    - confusion_matrix.png: A confusion matrix that shows the performance of the model on the test set.
    - evaluation_metrics.txt: Text summary of evaluation metrics: accuracy, precision, recall, and F1-score.
    - test_predictions.csv: A CSV file that contains the predictions of the model on the test set.
    - top_spam_words.png: A bar chart showing the top 10 most frequent words in correctly predicted spam emails.
    - spam_classifier.h5: The trained model file, which can be used to make predictions on new emails.
  14. h

    generated-e-mail-spam

    • huggingface.co
    Updated Sep 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unique Data (2023). generated-e-mail-spam [Dataset]. https://huggingface.co/datasets/UniqueData/generated-e-mail-spam
    Explore at:
    Dataset updated
    Sep 23, 2023
    Authors
    Unique Data
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    The dataset consists of a CSV file containing of 300 generated email spam messages. Each row in the file represents a separate email message, its title and text. The dataset aims to facilitate the analysis and detection of spam emails. The dataset can be used for various purposes, such as training machine learning algorithms to classify and filter spam emails, studying spam email patterns, or analyzing text-based features of spam messages.

  15. a

    SMS Spam Collection Data Set

    • academictorrents.com
    bittorrent
    Updated Nov 28, 2015
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tiago A. Almeida and José María Gómez Hidalgo (2015). SMS Spam Collection Data Set [Dataset]. https://academictorrents.com/details/25932ba42d983dd7b4474d8f59ab56cdc25d9107
    Explore at:
    bittorrent(695379)Available download formats
    Dataset updated
    Nov 28, 2015
    Dataset authored and provided by
    Tiago A. Almeida and José María Gómez Hidalgo
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    ==Data Set Information: This corpus has been collected from free or free for research sources at the Internet: -> A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. The Grumbletext Web site is: [Web Link]. -> A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were

  16. h

    sms_spam

    • huggingface.co
    Updated Aug 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UC Irvine (2023). sms_spam [Dataset]. https://huggingface.co/datasets/ucirvine/sms_spam
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 28, 2023
    Dataset authored and provided by
    UC Irvine
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    Dataset Card for [Dataset Name]

      Dataset Summary
    

    The SMS Spam Collection v.1 is a public set of SMS labeled messages that have been collected for mobile phone spam research. It has one collection composed by 5,574 English, real and non-enconded messages, tagged according being legitimate (ham) or spam.

      Supported Tasks and Leaderboards
    

    [More Information Needed]

      Languages
    

    English

      Dataset Structure
    
    
    
    
    
      Data Instances
    

    [More Information… See the full description on the dataset page: https://huggingface.co/datasets/ucirvine/sms_spam.

  17. S

    Scam Statistics 2026: How Much Money’s Lost and What’s Coming Next

    • sqmagazine.co.uk
    Updated Oct 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SQ Magazine (2025). Scam Statistics 2026: How Much Money’s Lost and What’s Coming Next [Dataset]. https://sqmagazine.co.uk/scam-statistics/
    Explore at:
    Dataset updated
    Oct 6, 2025
    Dataset authored and provided by
    SQ Magazine
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2024 - Dec 31, 2026
    Area covered
    Earth, Worldwide
    Description

    Discover key scam statistics, including fraud types, victim demographics, financial losses, digital scam trends, and prevention tips!

  18. Z

    Data from: Persuasion Sentences in Spam Email (PerSentSE)

    • data-staging.niaid.nih.gov
    • portalcientifico.unileon.es
    Updated Jan 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jáñez-Martino, Francisco; Barrón-Cedeño, Alberto; ALAIZ-RODRÍGUEZ, ROCÍO; González-Castro, Víctor (2025). Persuasion Sentences in Spam Email (PerSentSE) [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_14585763
    Explore at:
    Dataset updated
    Jan 8, 2025
    Dataset provided by
    University of Bologna
    Universidad de León
    Authors
    Jáñez-Martino, Francisco; Barrón-Cedeño, Alberto; ALAIZ-RODRÍGUEZ, ROCÍO; González-Castro, Víctor
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    How to Access:

    To access this dataset, please contact Francisco Janez via email at francisco.janez@unileon.es. Access will be granted based on specific requests.

    Purpose:The PerSentSE corpus was developed to study persuasive techniques in spam emails. It includes 130 emails randomly selected from the SpamArchive2122 dataset, which contains over 20,000 spam emails in English.

    Methodology:

    Segmentation: Emails were divided into sentences using the NLTK library.

    Annotation: Eight persuasive techniques, along with a "non-persuasion" class, were identified. Two expert annotators labeled an initial subset of emails to measure inter-annotator agreement, achieving a final acceptable level (γ = 0.63).

    Corpus Statistics:

    Total sentences: 1,075

    Persuasive sentences: 216 (20.1%)

    Persuasion Distribution by Email Sections (Table 7):

    Subject lines: 35.59% persuasive, with an average of 1.62 techniques.

    Greeting section: 54.17% persuasive, averaging 1.46 techniques.

    Email body: 82.46% persuasive, with 5.51 techniques on average.

    Farewell section: 31.43% persuasive, averaging 1.45 techniques.

    Co-occurrence of Techniques (Figure 2):Some persuasive techniques frequently appeared together:

    Appeal to Fear/Prejudice with Loaded Language: 25 instances.

    Exaggeration/Minimization with Loaded Language: 24 instances.

    Appeal to Fear/Prejudice with Exaggeration/Minimization: 20 instances.

    Findings:The body section of emails concentrates the highest number of persuasive elements, contrary to earlier studies focusing on subject lines alone. This suggests that spam emails rely heavily on persuasive content in their main text.

  19. h

    all-scam-spam

    • huggingface.co
    Updated Sep 2, 2002
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fred Zhang (2002). all-scam-spam [Dataset]. https://huggingface.co/datasets/FredZhang7/all-scam-spam
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 2, 2002
    Authors
    Fred Zhang
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is a large corpus of 42,619 preprocessed text messages and emails sent by humans in 43 languages. is_spam=1 means spam and is_spam=0 means ham. 1040 rows of balanced data, consisting of casual conversations and scam emails in ≈10 languages, were manually collected and annotated by me, with some help from ChatGPT.

      Some preprcoessing algorithms
    

    spam_assassin.js, followed by spam_assassin.py enron_spam.py

      Data composition
    
    
    
    
    
    
    
    
      Description
    

    To make the text… See the full description on the dataset page: https://huggingface.co/datasets/FredZhang7/all-scam-spam.

  20. m

    A Balanced Dataset for Spam and Smishing Detection using Large Language...

    • data.mendeley.com
    Updated Jul 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Miriam Munoz (2025). A Balanced Dataset for Spam and Smishing Detection using Large Language Models (LLMs) [Dataset]. http://doi.org/10.17632/vmg875v4xs.1
    Explore at:
    Dataset updated
    Jul 4, 2025
    Authors
    Miriam Munoz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains 10,191 labeled SMS messages for training and testing spam and smishing detection machine learning models. A large language model (LLM) was trained to create this dataset. Structure

    This dataset contains five columns: • LABEL: A categorical value indicating the type of message. The values are: o Ham: Benign (non-malicious) message o Spam: Unsolicited or junk message o Smishing: SMS phishing message to deceive recipients into giving away their sensitive personal information • TEXT: The content of the message • URL: Indicates whether a URL is present in the message (Yes/No) • EMAIL: Indicates whether an email address is present in the message (Yes/No) • PHONE: Indicates whether a phone number is present in the message (Yes/No)

    Key Features The dataset is balanced to prevent bias in classification tasks: • ham: 3,397 messages • spam: 3,397 messages • smishing: 3,397 messages

    Source and Citation The following publicly available dataset is used for training of the LLM: Mishra, Sandhya; Soni, Devpriya (2022), “SMS PHISHING DATASET FOR MACHINE LEARNING AND PATTERN RECOGNITION”, Mendeley Data, V1, doi: 10.17632/f45bkkt8pr.1

    Use Cases • Text classification research • Phishing and fraud detection models • LLM fine-tuning or prompt engineering for safety and content moderation • Educational demonstrations in cybersecurity, machine learning (ML) or natural language processing (NLP)

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Statista (2025). Highest number of spam e-mails sent daily 2024, by country [Dataset]. https://www.statista.com/statistics/1270488/spam-emails-sent-daily-by-country/

Highest number of spam e-mails sent daily 2024, by country

Explore at:
5 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Nov 25, 2025
Dataset authored and provided by
Statista
Time period covered
Dec 8, 2024
Area covered
Worldwide
Description

As of December 8, 2024, China and the United States were the countries with the highest number of spam emails sent within one day worldwide, with around 7.8 billion. Ranking third and fourth were India and the Japan, with around 7.6 billion. Internet and e-mail users around the world Between 2019 and 2024, the number of email users globally increased from 3.9 billion to 4.4 billion. Moreover, this number is expected to increase up to 4.8 billion in 2027. Considering the fact that China and India had the highest number of internet users in the world in 2023, with over 1.2 billion and 1.1 billion users respectively, e-mail usage is less popular in these countries than in the United States or Germany, for example. Most popular online activities in the U.S. Not only did the United States have the highest number of daily emails and spam emails sent as of October 2021, it was actually the most popular online activity among internet users in 2019. In fact, 90.9 percent of respondents said they were email users, more than search users, social network users, or digital video viewers.

Search
Clear search
Close search
Google apps
Main menu