82 datasets found
  1. h

    spam-classify

    • huggingface.co
    Updated Sep 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Trevor (2024). spam-classify [Dataset]. https://huggingface.co/datasets/mltrev23/spam-classify
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 12, 2024
    Authors
    Trevor
    Description

    Spam Classification Dataset

      Overview
    

    The Spam Classification Dataset contains a collection of SMS messages labeled as either "spam" or "ham" (non-spam). This dataset is designed for binary text classification tasks, where the goal is to classify an SMS message as either spam or non-spam based on its content.

      Dataset Structure
    

    The dataset is provided as a single CSV file named spam.csv. It contains 5,572 entries, with each entry corresponding to an SMS message.… See the full description on the dataset page: https://huggingface.co/datasets/mltrev23/spam-classify.

  2. g

    Spam SMS Classification Dataset

    • gts.ai
    json
    Updated Oct 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GTS (2024). Spam SMS Classification Dataset [Dataset]. https://gts.ai/dataset-download/spam-sms-classification-dataset/
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Oct 10, 2024
    Dataset provided by
    GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
    Authors
    GTS
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Explore our comprehensive Spam SMS Classification Dataset designed for NLP and machine learning research.

  3. Spam email classification

    • kaggle.com
    Updated Sep 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yousef Mohamed (2023). Spam email classification [Dataset]. https://www.kaggle.com/datasets/yousefmohamed20/spam-email-detection
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 21, 2023
    Dataset provided by
    Kaggle
    Authors
    Yousef Mohamed
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is a csv file containing related information of 5157 randomly picked email files and their respective labels for spam or not-spam classification. The csv file contains 5157 rows, each row for each email. There are 2 columns. The first column indicates Email category (spam or ham), The second column indicates the email sent.

  4. h

    spam-classification-new

    • huggingface.co
    Updated Oct 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aniket (2024). spam-classification-new [Dataset]. https://huggingface.co/datasets/Anik3t/spam-classification-new
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 1, 2024
    Authors
    Aniket
    Description

    Anik3t/spam-classification-new dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. h

    email-spam-classification

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unique Data, email-spam-classification [Dataset]. https://huggingface.co/datasets/UniqueData/email-spam-classification
    Explore at:
    Authors
    Unique Data
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Email Spam Classification

    The dataset consists of a collection of emails categorized into two major classes: spam and not spam. It is designed to facilitate the development and evaluation of spam detection or email filtering systems. The spam emails in the dataset are typically unsolicited and unwanted messages that aim to promote products or services, spread malware, or deceive recipients for various malicious purposes. These emails often contain misleading subject lines… See the full description on the dataset page: https://huggingface.co/datasets/UniqueData/email-spam-classification.

  6. Email Spam Classification : Cleaned & Feature-Rich

    • kaggle.com
    Updated Jul 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gaurav Kumar (2025). Email Spam Classification : Cleaned & Feature-Rich [Dataset]. https://www.kaggle.com/datasets/gauravkumar2525/email-spam-classification-cleaned-and-feature-rich
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 25, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Gaurav Kumar
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    📘 ABOUT

    The Enhanced Email Spam Detection Dataset contains a diverse collection of 10,000 email samples enriched with metadata and linguistic features to support binary spam classification. This cleaned and feature-engineered dataset is ideal for building machine learning models, conducting exploratory data analysis, and developing NLP-based spam filters.

    It is especially useful for data scientists, security researchers, and NLP practitioners interested in understanding spam patterns and creating robust email classification systems.

    Key Features of the Dataset: ✅ Balanced and labeled dataset for spam (1) and non-spam (0) emails ✅ Cleaned structure with no missing values ✅ Includes sender domain, subject lines, and content-based metrics ✅ Engineered features like punctuation ratio, word length, and spam word flags ✅ Suitable for binary classification, model benchmarking, and text pattern analysis

    This dataset provides a strong foundation for spam detection models, enabling pattern discovery across various email features such as urgency cues, promotional language, and sender behavior.

    📂 FILE INFORMATION

    This dataset consists of 10,000 structured email records with various derived features that quantify the likelihood of spam. It has been enhanced by extracting numeric and textual indicators from the email content and headers.

    File Type: CSV Data Rows: 10,000 Data Fields: Email metadata, content metrics, keyword flags, and label column

    📊 COLUMNS DESCRIPTION

    Column NameDescription
    idUnique identifier for each email.
    labelBinary label (1 = spam, 0 = not spam).
    subjectSubject line of the email.
    sender_domainDomain of the email sender.
    has_urlIndicates if the email contains a URL.
    email_lengthNumber of characters in the email.
    word_countTotal word count in the email.
    char_countTotal character count excluding spaces.
    digit_countNumber of numeric digits in the email.
    uppercase_wordsNumber of fully uppercase words.
    exclamationsNumber of exclamation marks used.
    avg_word_lengthAverage length of words in the email.
    punc_ratioRatio of punctuation marks to characters.
    has_noreplyIndicates presence of 'noreply' in sender or text.
    has_freeBinary flag for the word "free".
    has_winBinary flag for the word "win".
    has_winnerBinary flag for the word "winner".
    has_clickBinary flag for the word "click".
    has_offerBinary flag for promotional "offer".
    has_urgentBinary flag for urgency words like "urgent".
    has_limitedIndicates limited-time offers or phrases.
    has_buyBinary flag for commercial intent ("buy").
    has_nowIndicates time-sensitive prompts ("now").
    has_moneyBinary flag for monetary terms ("money").
  7. Spam Emails

    • kaggle.com
    Updated Dec 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Noey (2024). Spam Emails [Dataset]. https://www.kaggle.com/datasets/noeyislearning/spam-emails
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 20, 2024
    Dataset provided by
    Kaggle
    Authors
    Noey
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The email spam dataset is a collection of email messages labeled as either spam or non-spam (ham). It includes the full text of each email along with its classification, making it a valuable resource for developing and testing spam detection algorithms. The dataset is designed for researchers and data scientists working on natural language processing (NLP), machine learning, and email filtering systems.

    Key Features

    • Email Text: Full content of each email message.
    • Spam Classification: Binary label indicating whether the email is spam (1) or not (0).
    • Large Dataset: Contains 5,728 observations for robust analysis.
    • Tab-Delimited Format: Easy to import and process in various data analysis tools.
    • Diverse Content: Emails cover a wide range of topics and styles, reflecting real-world spam and non-spam examples.
  8. i

    Prioritization

    • ieee-dataport.org
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sai Shibu (2023). Prioritization [Dataset]. https://ieee-dataport.org/documents/v2x-message-classification-prioritization-and-spam-detection-dataset
    Explore at:
    Dataset updated
    May 23, 2023
    Authors
    Sai Shibu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    medium

  9. h

    sms-otp-spam-dataset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alessandro Lusci, sms-otp-spam-dataset [Dataset]. https://huggingface.co/datasets/alusci/sms-otp-spam-dataset
    Explore at:
    Authors
    Alessandro Lusci
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    📲 SMS OTP Spam Dataset

    A synthetic dataset of 10,000 OTP-style SMS messages for spam classification tasks. The dataset includes both valid and spam-like messages, with labels for message validity and delivery status.

      📊 Dataset Summary
    

    Total samples: 10,000 Valid messages: 90% Not valid (spam-like): 10% Status types: delivered, failed, spam, bounced, expired

    Each entry includes:

    phone_id: Synthetic phone number sms_text: Message content label: valid or not valid… See the full description on the dataset page: https://huggingface.co/datasets/alusci/sms-otp-spam-dataset.

  10. a

    Email.cz image spam dataset v1

    • academictorrents.com
    bittorrent
    Updated Dec 30, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vit Listik (2019). Email.cz image spam dataset v1 [Dataset]. https://academictorrents.com/details/06f2389082e9c034fa4a73aaee00131a27e388b6
    Explore at:
    bittorrent(2660566545)Available download formats
    Dataset updated
    Dec 30, 2019
    Dataset authored and provided by
    Vit Listik
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    The problem with email image spam classification is known from the year 2005. There are several approaches to this task. Lately, those approaches use convolutional neural networks (CNN). We propose a novel approach to the image spam classification task. Our approach is based on CNN and transfer learning, namely Resnet v1 used for semantic feature extraction and one layer Feedforward Neural Network for classification. We have shown that this approach can achieve state-of-the-art performance on publicly available datasets. 99% F1-score on two datasets [dredze 2007, Princeton] and 96% F1-score on the combination of these datasets. Due to the availability of GPUs, this approach may be used for just-in-time classification in anti-spam systems handling huge amounts of emails. We have observed also that mentioned publicly available datasets are no longer representative. We overcame this limitation by using a much richer dataset from a one-week long real traffic of the freemail provider Email.

  11. Spam Email Classification Dataset

    • kaggle.com
    Updated Nov 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Puru Singhvi (2023). Spam Email Classification Dataset [Dataset]. https://www.kaggle.com/datasets/purusinghvi/email-spam-classification-dataset/versions/1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 6, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Puru Singhvi
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Introduction

    This is a csv file containing 83446 records of email which are labelled as either spam or not-spam. It is formed by combining the 2007 TREC Public Spam Corpus and Enron-Spam Dataset.

    Columns

    1. label
      • '1' indicates that the email is classified as spam.
      • '0' denotes that the email is legitimate (ham).
    2. text
      • This column contains the actual content of the email messages.

    Sources

    1. 2007 TREC Public Spam Corpus
    2. Enron-Spam Dataset

    Code for combining and processing the two datasets: https://github.com/PuruSinghvi/Spam-Email-Classifier/blob/main/Combining%20Datasets.ipynb

    Spam Email Classifier

    A spam email classifier has been trained and built using this dataset.
    It can be found here: https://github.com/PuruSinghvi/Spam-Email-Classifier

  12. h

    spam-detection-dataset-splits

    • huggingface.co
    Updated Nov 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tan Quang DUONG (2023). spam-detection-dataset-splits [Dataset]. https://huggingface.co/datasets/tanquangduong/spam-detection-dataset-splits
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 30, 2023
    Authors
    Tan Quang DUONG
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Spam Detection Dataset

    This is the dataset for spam classification task. It contains:

    'train' subset with 8175 samples 'validation' subset with 1362 samples 'test' subset with 1636 samples

      Source and modifications
    

    This dataset is cloned from Deysi/spam-detection-dataset with the following added processing:

    Convert 'string' to 'id' label that allows to be used and trained directly with transformer's trainer Split the original 'test' dataset (2725 samples) into 2… See the full description on the dataset page: https://huggingface.co/datasets/tanquangduong/spam-detection-dataset-splits.

  13. t

    Spam Dataset - Dataset - LDM

    • service.tib.eu
    Updated Jan 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Spam Dataset - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/spam-dataset
    Explore at:
    Dataset updated
    Jan 2, 2025
    Description

    The spam dataset is a dataset used for spam classification.

  14. h

    spam-email-classification

    • huggingface.co
    Updated Oct 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vijay Agrawal (2024). spam-email-classification [Dataset]. https://huggingface.co/datasets/vijayagrawal/spam-email-classification
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 11, 2024
    Authors
    Vijay Agrawal
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for Dataset Name

    This dataset can serve as an example and golden example dataset for LLM assistant few shot prompts and for evaluation and validation.

      Dataset Details
    
    
    
    
    
      Dataset Sources [optional]
    

    Repository: [More Information Needed] Paper [optional]: [More Information Needed] Demo [optional]: [More Information Needed]

      Uses
    
    
    
    
    
    
    
      Direct Use
    

    [More Information Needed]

      Out-of-Scope Use
    

    [More Information Needed]… See the full description on the dataset page: https://huggingface.co/datasets/vijayagrawal/spam-email-classification.

  15. SPAM - Classification - UFMG - cybersecurity

    • kaggle.com
    Updated Sep 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LuccaSilvaMedeiros (2023). SPAM - Classification - UFMG - cybersecurity [Dataset]. https://www.kaggle.com/datasets/luccasilvamedeiros/spam-classification-ufmg-cybersecurity/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 19, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    LuccaSilvaMedeiros
    Description

    Dataset

    This dataset was created by LuccaSilvaMedeiros

    Contents

  16. t

    spambase - Dataset - LDM

    • service.tib.eu
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). spambase - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/spambase
    Explore at:
    Dataset updated
    Dec 16, 2024
    Description

    The dataset is a spam classification dataset containing 4,600 emails labeled as spam or not.

  17. Dataset for Email Spam Classification (NLP)

    • kaggle.com
    Updated Apr 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akalya Subramanian (2021). Dataset for Email Spam Classification (NLP) [Dataset]. https://www.kaggle.com/akalyasubramanian/dataset-for-email-spam-classification-nlp/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 9, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Akalya Subramanian
    Description

    Dataset

    This dataset was created by Akalya Subramanian

    Contents

  18. PSSC dataset: Improving spam detection in Persian SMS by providing a...

    • zenodo.org
    Updated Nov 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammadhossein Salari; Mohammad Amin Shayegan; Mohammadhossein Salari; Mohammad Amin Shayegan (2023). PSSC dataset: Improving spam detection in Persian SMS by providing a comprehensive dataset [Dataset]. http://doi.org/10.5281/zenodo.7832188
    Explore at:
    Dataset updated
    Nov 24, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mohammadhossein Salari; Mohammad Amin Shayegan; Mohammadhossein Salari; Mohammad Amin Shayegan
    Description

    Text messaging is widely regarded as one of the most frequently used methods of text communication among mobile phone users. The affordability and widespread use of texting make it an attractive option for advertisers and spammers. Unfortunately, this has resulted in a large volume of spam messages received by users, lowering their overall satisfaction with mobile phone usage. A significant obstacle in the field of Persian SMS spam classification and removal is the lack of a sufficient and diverse database of both legitimate (ham) and spam SMS messages in Persian. To solve the issue of the limited database of ham and spam Persian SMS messages, in this research a comprehensive database was established by gathering 4,389 text messages from various sources and labeling them.


    سالاری, محمدحسین و شایگان, محمدامین,1402,بهبود تشخیص هرزپیامک در پیامک های فارسی با ارائه یک پایگاه داده جامع,اولین کنفرانس ملی آنالیز داده ها,یاسوج,,,https://civilica.com/doc/1670746

  19. Spam and Ham Email Classification Dataset

    • kaggle.com
    Updated Jun 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sai_ash (2025). Spam and Ham Email Classification Dataset [Dataset]. https://www.kaggle.com/datasets/saiash/spam-and-ham-email-classification-dataset/discussion?sort=undefined
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 28, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    sai_ash
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is taken from spamassassin's publiccorpus website which had lots of email instances for both spam and hams. I have added the email along with label where label= 0 is ham and 1 is Spam. There is just one .csv file containing all instances.
    Feel free to drop any suggestions and feedbacks.
    Upvotes are appreciated.

  20. SMS Spam Collection (Text Classification)

    • kaggle.com
    Updated Nov 21, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). SMS Spam Collection (Text Classification) [Dataset]. https://www.kaggle.com/datasets/thedevastator/sms-spam-collection-a-more-diverse-dataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 21, 2022
    Dataset provided by
    Kaggle
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    SMS Spam Collection (Text Classification)

    SMS labeled messages that have been collected for mobile phone spam research

    Source

    Huggingface Hub: link

    About this dataset

    The SMS Spam Collection v.1 is a set of SMS messages that have been collected and labeled as either spam or not spam. This dataset contains 5574 English, real, and non-encoded messages. The SMS messages are thought-provoking and eye-catching. The dataset is useful for mobile phone spam research

    How to use the dataset

    Research Ideas

    • This dataset could be used to train a machine learning model to classify SMS messages as spam or not spam.
    • This dataset could be used to develop a tool that can automatically identify and block spam messages.
    • This dataset could be used to study the characteristics of spam messages and develop strategies for identifying and avoiding them

    Acknowledgements

    _This dataset is used to train a machine learning model to classify SMS messages as spam or not spam.

    The SMS Spam Collection v.1 is a public set of SMS labeled messages that have been collected for mobile phone spam research. This dataset contains 5574 English, real, and non-encoded messages, tagged as being legitimate (ham) or spam. The dataset has been collected from various sources and is released under the CC BY-SA 4.0 license by Kaggle user Almeida et al._

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:--------------|:------------------------------------------------------------------------------| | sms | The text of the SMS message. (String) | | label | The label for the SMS message, indicating whether it is ham or spam. (String) |

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Trevor (2024). spam-classify [Dataset]. https://huggingface.co/datasets/mltrev23/spam-classify

spam-classify

mltrev23/spam-classify

Explore at:
61 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 12, 2024
Authors
Trevor
Description

Spam Classification Dataset

  Overview

The Spam Classification Dataset contains a collection of SMS messages labeled as either "spam" or "ham" (non-spam). This dataset is designed for binary text classification tasks, where the goal is to classify an SMS message as either spam or non-spam based on its content.

  Dataset Structure

The dataset is provided as a single CSV file named spam.csv. It contains 5,572 entries, with each entry corresponding to an SMS message.… See the full description on the dataset page: https://huggingface.co/datasets/mltrev23/spam-classify.

Search
Clear search
Close search
Google apps
Main menu