Spam Classification Dataset
Overview
The Spam Classification Dataset contains a collection of SMS messages labeled as either "spam" or "ham" (non-spam). This dataset is designed for binary text classification tasks, where the goal is to classify an SMS message as either spam or non-spam based on its content.
Dataset Structure
The dataset is provided as a single CSV file named spam.csv. It contains 5,572 entries, with each entry corresponding to an SMS message.… See the full description on the dataset page: https://huggingface.co/datasets/mltrev23/spam-classify.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Explore our comprehensive Spam SMS Classification Dataset designed for NLP and machine learning research.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This is a csv file containing related information of 5157 randomly picked email files and their respective labels for spam or not-spam classification. The csv file contains 5157 rows, each row for each email. There are 2 columns. The first column indicates Email category (spam or ham), The second column indicates the email sent.
Anik3t/spam-classification-new dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Email Spam Classification
The dataset consists of a collection of emails categorized into two major classes: spam and not spam. It is designed to facilitate the development and evaluation of spam detection or email filtering systems. The spam emails in the dataset are typically unsolicited and unwanted messages that aim to promote products or services, spread malware, or deceive recipients for various malicious purposes. These emails often contain misleading subject lines… See the full description on the dataset page: https://huggingface.co/datasets/UniqueData/email-spam-classification.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Enhanced Email Spam Detection Dataset contains a diverse collection of 10,000 email samples enriched with metadata and linguistic features to support binary spam classification. This cleaned and feature-engineered dataset is ideal for building machine learning models, conducting exploratory data analysis, and developing NLP-based spam filters.
It is especially useful for data scientists, security researchers, and NLP practitioners interested in understanding spam patterns and creating robust email classification systems.
✅ Key Features of the Dataset: ✅ Balanced and labeled dataset for spam (1) and non-spam (0) emails ✅ Cleaned structure with no missing values ✅ Includes sender domain, subject lines, and content-based metrics ✅ Engineered features like punctuation ratio, word length, and spam word flags ✅ Suitable for binary classification, model benchmarking, and text pattern analysis
This dataset provides a strong foundation for spam detection models, enabling pattern discovery across various email features such as urgency cues, promotional language, and sender behavior.
This dataset consists of 10,000 structured email records with various derived features that quantify the likelihood of spam. It has been enhanced by extracting numeric and textual indicators from the email content and headers.
File Type: CSV Data Rows: 10,000 Data Fields: Email metadata, content metrics, keyword flags, and label column
Column Name | Description |
---|---|
id | Unique identifier for each email. |
label | Binary label (1 = spam, 0 = not spam). |
subject | Subject line of the email. |
sender_domain | Domain of the email sender. |
has_url | Indicates if the email contains a URL. |
email_length | Number of characters in the email. |
word_count | Total word count in the email. |
char_count | Total character count excluding spaces. |
digit_count | Number of numeric digits in the email. |
uppercase_words | Number of fully uppercase words. |
exclamations | Number of exclamation marks used. |
avg_word_length | Average length of words in the email. |
punc_ratio | Ratio of punctuation marks to characters. |
has_noreply | Indicates presence of 'noreply' in sender or text. |
has_free | Binary flag for the word "free". |
has_win | Binary flag for the word "win". |
has_winner | Binary flag for the word "winner". |
has_click | Binary flag for the word "click". |
has_offer | Binary flag for promotional "offer". |
has_urgent | Binary flag for urgency words like "urgent". |
has_limited | Indicates limited-time offers or phrases. |
has_buy | Binary flag for commercial intent ("buy"). |
has_now | Indicates time-sensitive prompts ("now"). |
has_money | Binary flag for monetary terms ("money"). |
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The email spam dataset is a collection of email messages labeled as either spam or non-spam (ham). It includes the full text of each email along with its classification, making it a valuable resource for developing and testing spam detection algorithms. The dataset is designed for researchers and data scientists working on natural language processing (NLP), machine learning, and email filtering systems.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
medium
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
📲 SMS OTP Spam Dataset
A synthetic dataset of 10,000 OTP-style SMS messages for spam classification tasks. The dataset includes both valid and spam-like messages, with labels for message validity and delivery status.
📊 Dataset Summary
Total samples: 10,000 Valid messages: 90% Not valid (spam-like): 10% Status types: delivered, failed, spam, bounced, expired
Each entry includes:
phone_id: Synthetic phone number sms_text: Message content label: valid or not valid… See the full description on the dataset page: https://huggingface.co/datasets/alusci/sms-otp-spam-dataset.
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
The problem with email image spam classification is known from the year 2005. There are several approaches to this task. Lately, those approaches use convolutional neural networks (CNN). We propose a novel approach to the image spam classification task. Our approach is based on CNN and transfer learning, namely Resnet v1 used for semantic feature extraction and one layer Feedforward Neural Network for classification. We have shown that this approach can achieve state-of-the-art performance on publicly available datasets. 99% F1-score on two datasets [dredze 2007, Princeton] and 96% F1-score on the combination of these datasets. Due to the availability of GPUs, this approach may be used for just-in-time classification in anti-spam systems handling huge amounts of emails. We have observed also that mentioned publicly available datasets are no longer representative. We overcame this limitation by using a much richer dataset from a one-week long real traffic of the freemail provider Email.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This is a csv file containing 83446 records of email which are labelled as either spam or not-spam. It is formed by combining the 2007 TREC Public Spam Corpus and Enron-Spam Dataset.
Code for combining and processing the two datasets: https://github.com/PuruSinghvi/Spam-Email-Classifier/blob/main/Combining%20Datasets.ipynb
A spam email classifier has been trained and built using this dataset.
It can be found here: https://github.com/PuruSinghvi/Spam-Email-Classifier
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Spam Detection Dataset
This is the dataset for spam classification task. It contains:
'train' subset with 8175 samples 'validation' subset with 1362 samples 'test' subset with 1636 samples
Source and modifications
This dataset is cloned from Deysi/spam-detection-dataset with the following added processing:
Convert 'string' to 'id' label that allows to be used and trained directly with transformer's trainer Split the original 'test' dataset (2725 samples) into 2… See the full description on the dataset page: https://huggingface.co/datasets/tanquangduong/spam-detection-dataset-splits.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for Dataset Name
This dataset can serve as an example and golden example dataset for LLM assistant few shot prompts and for evaluation and validation.
Dataset Details
Dataset Sources [optional]
Repository: [More Information Needed] Paper [optional]: [More Information Needed] Demo [optional]: [More Information Needed]
Uses
Direct Use
[More Information Needed]
Out-of-Scope Use
[More Information Needed]… See the full description on the dataset page: https://huggingface.co/datasets/vijayagrawal/spam-email-classification.
This dataset was created by LuccaSilvaMedeiros
This dataset was created by Akalya Subramanian
Text messaging is widely regarded as one of the most frequently used methods of text communication among mobile phone users. The affordability and widespread use of texting make it an attractive option for advertisers and spammers. Unfortunately, this has resulted in a large volume of spam messages received by users, lowering their overall satisfaction with mobile phone usage. A significant obstacle in the field of Persian SMS spam classification and removal is the lack of a sufficient and diverse database of both legitimate (ham) and spam SMS messages in Persian. To solve the issue of the limited database of ham and spam Persian SMS messages, in this research a comprehensive database was established by gathering 4,389 text messages from various sources and labeling them.
سالاری, محمدحسین و شایگان, محمدامین,1402,بهبود تشخیص هرزپیامک در پیامک های فارسی با ارائه یک پایگاه داده جامع,اولین کنفرانس ملی آنالیز داده ها,یاسوج,,,https://civilica.com/doc/1670746
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is taken from spamassassin's publiccorpus website which had lots of email instances for both spam and hams. I have added the email along with label where label= 0 is ham and 1 is Spam. There is just one .csv file containing all instances.
Feel free to drop any suggestions and feedbacks.
Upvotes are appreciated.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Huggingface Hub: link
The SMS Spam Collection v.1 is a set of SMS messages that have been collected and labeled as either spam or not spam. This dataset contains 5574 English, real, and non-encoded messages. The SMS messages are thought-provoking and eye-catching. The dataset is useful for mobile phone spam research
- This dataset could be used to train a machine learning model to classify SMS messages as spam or not spam.
- This dataset could be used to develop a tool that can automatically identify and block spam messages.
- This dataset could be used to study the characteristics of spam messages and develop strategies for identifying and avoiding them
_This dataset is used to train a machine learning model to classify SMS messages as spam or not spam.
The SMS Spam Collection v.1 is a public set of SMS labeled messages that have been collected for mobile phone spam research. This dataset contains 5574 English, real, and non-encoded messages, tagged as being legitimate (ham) or spam. The dataset has been collected from various sources and is released under the CC BY-SA 4.0 license by Kaggle user Almeida et al._
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:--------------|:------------------------------------------------------------------------------| | sms | The text of the SMS message. (String) | | label | The label for the SMS message, indicating whether it is ham or spam. (String) |
Spam Classification Dataset
Overview
The Spam Classification Dataset contains a collection of SMS messages labeled as either "spam" or "ham" (non-spam). This dataset is designed for binary text classification tasks, where the goal is to classify an SMS message as either spam or non-spam based on its content.
Dataset Structure
The dataset is provided as a single CSV file named spam.csv. It contains 5,572 entries, with each entry corresponding to an SMS message.… See the full description on the dataset page: https://huggingface.co/datasets/mltrev23/spam-classify.