Facebook
TwitterAs of December 8, 2024, China and the United States were the countries with the highest number of spam emails sent within one day worldwide, with around 7.8 billion. Ranking third and fourth were India and the Japan, with around 7.6 billion. Internet and e-mail users around the world Between 2019 and 2024, the number of email users globally increased from 3.9 billion to 4.4 billion. Moreover, this number is expected to increase up to 4.8 billion in 2027. Considering the fact that China and India had the highest number of internet users in the world in 2023, with over 1.2 billion and 1.1 billion users respectively, e-mail usage is less popular in these countries than in the United States or Germany, for example. Most popular online activities in the U.S. Not only did the United States have the highest number of daily emails and spam emails sent as of October 2021, it was actually the most popular online activity among internet users in 2019. In fact, 90.9 percent of respondents said they were email users, more than search users, social network users, or digital video viewers.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Explore key Spam Statistics to protect your inbox and business from hidden threats with clear, powerful insights that drive smarter defenses.
Facebook
Twitterhttp://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
Dataset Name: Spam Email Dataset
Description: This dataset contains a collection of email text messages, labeled as either spam or not spam. Each email message is associated with a binary label, where "1" indicates that the email is spam, and "0" indicates that it is not spam. The dataset is intended for use in training and evaluating spam email classification models.
Columns:
text (Text): This column contains the text content of the email messages. It includes the body of the emails along with any associated subject lines or headers.
spam_or_not (Binary): This column contains binary labels to indicate whether an email is spam or not. "1" represents spam, while "0" represents not spam.
Usage: This dataset can be used for various Natural Language Processing (NLP) tasks, such as text classification and spam detection. Researchers and data scientists can train and evaluate machine learning models using this dataset to build effective spam email filters.
Facebook
TwitterSpam messages accounted for **** percent of e-mail traffic in December 2025. Russia generated the largest share of unsolicited spam e-mails, with **** percent of global spam e-mails originating from the country. Spam worldwide It is almost impossible to think about e-mail without considering the issue of spam, which usually includes billions of promotional e-mails marketers send daily. As of December 2024, China and the United States had the highest number of spam e-mails sent daily. While many e-mail users believe such content belongs in their spam folder, marketing e-mails are generally harmless if annoying to the user.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
📄 Dataset Description This dataset contains 5,000 sample emails labeled as either "spam" or "ham" (not spam). It is designed to help build and evaluate machine learning models for spam detection using natural language processing (NLP) techniques.
The data is synthetically generated to reflect realistic spam and ham email patterns, including promotional content, phishing alerts, reminders, and casual conversations.
📁 Files Included train.csv Contains 4,000 labeled email samples used to train a model. Columns:
label: Spam classification (spam or ham)
text: The content of the email
test.csv Contains 1,000 unlabeled email samples used for testing/prediction. Columns:
text: The content of the email
Note: You can evaluate your model on this test set using a private test_labels.csv if needed.
✅ Use Cases Binary text classification (Spam vs. Ham)
NLP preprocessing and vectorization (TF-IDF, CountVectorizer, embeddings)
Model training (Naive Bayes, Logistic Regression, SVM, Transformers)
Evaluation metrics (Accuracy, Precision, Recall, F1-score)
📊 Suggested Evaluation Workflow Train model on train.csv
Predict on test.csv
Evaluate predictions if test_labels.csv is available (optional)
Facebook
TwitterIn 2023, nearly 45.6 percent of all e-mails worldwide were identified as spam, down from almost 49 percent in 2022. While remaining a big part of the e-mail traffic, since 2011, the share of spam e-mails has decreased significantly. In 2023, the highest volume of spam e-mails was registered in May, approximately 50 percent of e-mail traffic worldwide.
Facebook
Twitterhttps://data.gov.tw/licensehttps://data.gov.tw/license
Internet service providers provide statistics on spam email blocking every month.
Facebook
TwitterIn 2025, Russia ranked first by its share of unsolicited spam e-mails. Overall, **** percent of global spam e-mails originated from IPs in Russia. Mainland China ranked second, with **** percent. The United States followed, accounting for ***** percent of global unsolicited spam e-mails during the measured period.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by ChetanKR
Released under Apache 2.0
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Email spam is a type of unsolicited electronic mail (email) that is sent in bulk to a large number of recipients. Spam is often used to send viruses, malware, and phishing scams. It can also be used to promote products or services.
Email spam data is a collection of emails that have been labeled as spam or not spam. This data can be used to train and test spam filters, as well as to study the characteristics of spam emails.
Email spam data typically includes the following fields:
Email: The full text of the email, including the subject and body. category: spam /non-spam. Body: The body of the email. Email spam data can be collected from a variety of sources, including:
Public datasets: Datasets of spam emails that have been made available for research purposes. Email spam data is a valuable resource for researchers and practitioners who are working on spam filtering and email classification.
Here are some of the ways that email spam data can be used:
To train and test spam filters: Spam filters can be trained on email spam data to learn the characteristics of spam emails. This allows the filters to more accurately identify spam emails in the future. To study the characteristics of spam emails: Email spam data can be used to study the characteristics of spam emails, such as the language used, the types of attachments, and the sender's email address. This information can help researchers to develop better spam filters and to understand the motivations of spammers. To develop new spam filtering techniques: Email spam data can be used to develop new spam filtering techniques. For example, researchers can use machine learning to develop algorithms that can automatically identify spam emails. Email spam data is an important resource for researchers and practitioners who are working on spam filtering and email classification.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Overview: This dataset contains a collection of emails, categorized into two classes: "Spam" and "Non-Spam" (often referred to as "Ham"). These emails have been carefully curated and labeled to aid in the development of spam email detection models. Whether you are interested in email filtering, natural language processing, or machine learning, this dataset can serve as a valuable resource for training and evaluation.
Context: Spam emails continue to be a significant issue, with malicious actors attempting to deceive users with unsolicited, fraudulent, or harmful messages. This dataset is designed to facilitate research, development, and testing of algorithms and models aimed at accurately identifying and filtering spam emails, helping protect users from various threats.
Content: The dataset includes the following features: Message: The content of the email, including the subject line and message body. Category: Categorizes each email as either "Spam" or "Ham" (Non-Spam).
Potential Use Cases: - Email Filtering: Develop and evaluate email filtering systems that automatically classify incoming emails as spam or non-spam. - Natural Language Processing (NLP): Use the email text for text classification, topic modeling, and sentiment analysis. - Machine Learning: Create machine learning models for spam detection, potentially employing various algorithms and techniques. - Feature Engineering: Explore email content features that contribute to spam classification accuracy. - Data Analysis: Investigate patterns and trends in spam email content and characteristics.
License: Please note that this dataset is for research and analysis purposes only and may be subject to copyright and data use restrictions. Ensure compliance with relevant policies when using this data.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset
The dataset is composed of messages labeled by ham or spam, merged from three data sources:
SMS Spam Collection https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset Telegram Spam Ham https://huggingface.co/datasets/thehamkercat/telegram-spam-ham/tree/main Enron Spam: https://huggingface.co/datasets/SetFit/enron_spam/tree/main (only used message column and labels)
The prepare script for enron is available at… See the full description on the dataset page: https://huggingface.co/datasets/mshenoda/spam-messages.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The dataset consists of a CSV file containing of 300 generated email spam messages. Each row in the file represents a separate email message, its title and text. The dataset aims to facilitate the analysis and detection of spam emails. The dataset can be used for various purposes, such as training machine learning algorithms to classify and filter spam emails, studying spam email patterns, or analyzing text-based features of spam messages.
Facebook
Twitterhttps://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
==Data Set Information: This corpus has been collected from free or free for research sources at the Internet: -> A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. The Grumbletext Web site is: [Web Link]. -> A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Dataset Card for [Dataset Name]
Dataset Summary
The SMS Spam Collection v.1 is a public set of SMS labeled messages that have been collected for mobile phone spam research. It has one collection composed by 5,574 English, real and non-enconded messages, tagged according being legitimate (ham) or spam.
Supported Tasks and Leaderboards
[More Information Needed]
Languages
English
Dataset Structure
Data Instances
[More Information… See the full description on the dataset page: https://huggingface.co/datasets/ucirvine/sms_spam.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Discover key scam statistics, including fraud types, victim demographics, financial losses, digital scam trends, and prevention tips!
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
How to Access:
To access this dataset, please contact Francisco Janez via email at francisco.janez@unileon.es. Access will be granted based on specific requests.
Purpose:The PerSentSE corpus was developed to study persuasive techniques in spam emails. It includes 130 emails randomly selected from the SpamArchive2122 dataset, which contains over 20,000 spam emails in English.
Methodology:
Segmentation: Emails were divided into sentences using the NLTK library.
Annotation: Eight persuasive techniques, along with a "non-persuasion" class, were identified. Two expert annotators labeled an initial subset of emails to measure inter-annotator agreement, achieving a final acceptable level (γ = 0.63).
Corpus Statistics:
Total sentences: 1,075
Persuasive sentences: 216 (20.1%)
Persuasion Distribution by Email Sections (Table 7):
Subject lines: 35.59% persuasive, with an average of 1.62 techniques.
Greeting section: 54.17% persuasive, averaging 1.46 techniques.
Email body: 82.46% persuasive, with 5.51 techniques on average.
Farewell section: 31.43% persuasive, averaging 1.45 techniques.
Co-occurrence of Techniques (Figure 2):Some persuasive techniques frequently appeared together:
Appeal to Fear/Prejudice with Loaded Language: 25 instances.
Exaggeration/Minimization with Loaded Language: 24 instances.
Appeal to Fear/Prejudice with Exaggeration/Minimization: 20 instances.
Findings:The body section of emails concentrates the highest number of persuasive elements, contrary to earlier studies focusing on subject lines alone. This suggests that spam emails rely heavily on persuasive content in their main text.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This is a large corpus of 42,619 preprocessed text messages and emails sent by humans in 43 languages. is_spam=1 means spam and is_spam=0 means ham. 1040 rows of balanced data, consisting of casual conversations and scam emails in ≈10 languages, were manually collected and annotated by me, with some help from ChatGPT.
Some preprcoessing algorithms
spam_assassin.js, followed by spam_assassin.py enron_spam.py
Data composition
Description
To make the text… See the full description on the dataset page: https://huggingface.co/datasets/FredZhang7/all-scam-spam.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 10,191 labeled SMS messages for training and testing spam and smishing detection machine learning models. A large language model (LLM) was trained to create this dataset. Structure
This dataset contains five columns: • LABEL: A categorical value indicating the type of message. The values are: o Ham: Benign (non-malicious) message o Spam: Unsolicited or junk message o Smishing: SMS phishing message to deceive recipients into giving away their sensitive personal information • TEXT: The content of the message • URL: Indicates whether a URL is present in the message (Yes/No) • EMAIL: Indicates whether an email address is present in the message (Yes/No) • PHONE: Indicates whether a phone number is present in the message (Yes/No)
Key Features The dataset is balanced to prevent bias in classification tasks: • ham: 3,397 messages • spam: 3,397 messages • smishing: 3,397 messages
Source and Citation The following publicly available dataset is used for training of the LLM: Mishra, Sandhya; Soni, Devpriya (2022), “SMS PHISHING DATASET FOR MACHINE LEARNING AND PATTERN RECOGNITION”, Mendeley Data, V1, doi: 10.17632/f45bkkt8pr.1
Use Cases • Text classification research • Phishing and fraud detection models • LLM fine-tuning or prompt engineering for safety and content moderation • Educational demonstrations in cybersecurity, machine learning (ML) or natural language processing (NLP)
Facebook
TwitterAs of December 8, 2024, China and the United States were the countries with the highest number of spam emails sent within one day worldwide, with around 7.8 billion. Ranking third and fourth were India and the Japan, with around 7.6 billion. Internet and e-mail users around the world Between 2019 and 2024, the number of email users globally increased from 3.9 billion to 4.4 billion. Moreover, this number is expected to increase up to 4.8 billion in 2027. Considering the fact that China and India had the highest number of internet users in the world in 2023, with over 1.2 billion and 1.1 billion users respectively, e-mail usage is less popular in these countries than in the United States or Germany, for example. Most popular online activities in the U.S. Not only did the United States have the highest number of daily emails and spam emails sent as of October 2021, it was actually the most popular online activity among internet users in 2019. In fact, 90.9 percent of respondents said they were email users, more than search users, social network users, or digital video viewers.