63 datasets found
  1. Data from: Spam email Dataset

    • kaggle.com
    Updated Sep 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    _w1998 (2023). Spam email Dataset [Dataset]. https://www.kaggle.com/datasets/jackksoncsie/spam-email-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 1, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    _w1998
    License

    http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html

    Description

    Dataset Name: Spam Email Dataset

    Description: This dataset contains a collection of email text messages, labeled as either spam or not spam. Each email message is associated with a binary label, where "1" indicates that the email is spam, and "0" indicates that it is not spam. The dataset is intended for use in training and evaluating spam email classification models.

    Columns:

    text (Text): This column contains the text content of the email messages. It includes the body of the emails along with any associated subject lines or headers.

    spam_or_not (Binary): This column contains binary labels to indicate whether an email is spam or not. "1" represents spam, while "0" represents not spam.

    Usage: This dataset can be used for various Natural Language Processing (NLP) tasks, such as text classification and spam detection. Researchers and data scientists can train and evaluate machine learning models using this dataset to build effective spam email filters.

  2. h

    spam-classify

    • huggingface.co
    Updated Sep 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Trevor (2024). spam-classify [Dataset]. https://huggingface.co/datasets/mltrev23/spam-classify
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 12, 2024
    Authors
    Trevor
    Description

    Spam Classification Dataset

      Overview
    

    The Spam Classification Dataset contains a collection of SMS messages labeled as either "spam" or "ham" (non-spam). This dataset is designed for binary text classification tasks, where the goal is to classify an SMS message as either spam or non-spam based on its content.

      Dataset Structure
    

    The dataset is provided as a single CSV file named spam.csv. It contains 5,572 entries, with each entry corresponding to an SMS message.… See the full description on the dataset page: https://huggingface.co/datasets/mltrev23/spam-classify.

  3. SMS Spam Dataset

    • kaggle.com
    Updated Aug 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KUCEV ROMAN (2023). SMS Spam Dataset [Dataset]. https://www.kaggle.com/datasets/tapakah68/spam-text-messages-dataset/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 1, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    KUCEV ROMAN
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Spam Text Messages Dataset

    The SMS spam dataset contains a collection of text messages. The dataset includes a diverse range of spam messages, including promotional offers, fraudulent schemes, phishing attempts, and other forms of unsolicited communication.

    💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on TrainingData to buy the dataset

    Each SMS message is represented as a string of text, and each entry in the dataset also has a link to the corresponding screenshot. The dataset's content represents real-life examples of spam messages that users encounter in their everyday communication.

    The dataset's possible applications:

    • spam detection
    • fraud detection
    • customer support automation
    • trend and sentiment analysis
    • educational purposes
    • network security

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F618942%2Fbb49e9783917bb524ecd11b085375a28%2FMacBook%20Air%20-%201%20(2).png?generation=1689765543776590&alt=media" alt="">

    Content

    • images: includes screenshots of spam messages
    • .csv file: contains information about the dataset

    File with the extension .csv

    includes the following information:

    • image: link to the screenshot with the spam message,
    • text: text of the spam message

    Spam messages might be collected in accordance with your requirements.

    💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset.

    TrainingData provides high-quality data annotation tailored to your needs

    keywords: sms spam collection, labeled messages, mobile phone spam, spam sms dataset, sms spam classification, spam or not-spam, spam sms database, spam detection system, sma spamming data set, spam filtering system, spambase, feature extraction, spam ham email dataset, classifier, machine learning algorithms, cybersecurity, text dataset, sentiment analysis, llm dataset, language modeling, large language models, text classification, text mining dataset, natural language texts, nlp, nlp open-source dataset, text data

  4. g

    Spam SMS Classification Dataset

    • gts.ai
    json
    Updated Oct 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GTS (2024). Spam SMS Classification Dataset [Dataset]. https://gts.ai/dataset-download/spam-sms-classification-dataset/
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Oct 10, 2024
    Dataset provided by
    GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
    Authors
    GTS
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Explore our comprehensive Spam SMS Classification Dataset designed for NLP and machine learning research.

  5. h

    spam-classification-new

    • huggingface.co
    Updated Oct 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aniket (2024). spam-classification-new [Dataset]. https://huggingface.co/datasets/Anik3t/spam-classification-new
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 1, 2024
    Authors
    Aniket
    Description

    Anik3t/spam-classification-new dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. P

    SMS Spam Collection Data Set Dataset

    • paperswithcode.com
    Updated Mar 13, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). SMS Spam Collection Data Set Dataset [Dataset]. https://paperswithcode.com/dataset/sms-spam-collection-data-set
    Explore at:
    Dataset updated
    Mar 13, 2022
    Description

    This corpus has been collected from free or free for research sources at the Internet:

    A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. A list of 450 SMS ham messages collected from Caroline Tag's PhD Thesis. the SMS Spam Corpus v.0.1 Big. It has 1,002 SMS ham messages and 322 spam messages.

  7. h

    sms-otp-spam-dataset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alessandro Lusci, sms-otp-spam-dataset [Dataset]. https://huggingface.co/datasets/alusci/sms-otp-spam-dataset
    Explore at:
    Authors
    Alessandro Lusci
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    📲 SMS OTP Spam Dataset

    A synthetic dataset of 10,000 OTP-style SMS messages for spam classification tasks. The dataset includes both valid and spam-like messages, with labels for message validity and delivery status.

      📊 Dataset Summary
    

    Total samples: 10,000 Valid messages: 90% Not valid (spam-like): 10% Status types: delivered, failed, spam, bounced, expired

    Each entry includes:

    phone_id: Synthetic phone number sms_text: Message content label: valid or not valid… See the full description on the dataset page: https://huggingface.co/datasets/alusci/sms-otp-spam-dataset.

  8. Spam classification

    • kaggle.com
    Updated Jan 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SSWssw (2024). Spam classification [Dataset]. https://www.kaggle.com/datasets/sswssw/spam-classification
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 8, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    SSWssw
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Helloquant

    Released under Apache 2.0

    Contents

  9. i

    Prioritization

    • ieee-dataport.org
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sai Shibu (2023). Prioritization [Dataset]. https://ieee-dataport.org/documents/v2x-message-classification-prioritization-and-spam-detection-dataset
    Explore at:
    Dataset updated
    May 23, 2023
    Authors
    Sai Shibu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    medium

  10. a

    Email.cz image spam dataset v1

    • academictorrents.com
    bittorrent
    Updated Dec 30, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vit Listik (2019). Email.cz image spam dataset v1 [Dataset]. https://academictorrents.com/details/06f2389082e9c034fa4a73aaee00131a27e388b6
    Explore at:
    bittorrent(2660566545)Available download formats
    Dataset updated
    Dec 30, 2019
    Dataset authored and provided by
    Vit Listik
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    The problem with email image spam classification is known from the year 2005. There are several approaches to this task. Lately, those approaches use convolutional neural networks (CNN). We propose a novel approach to the image spam classification task. Our approach is based on CNN and transfer learning, namely Resnet v1 used for semantic feature extraction and one layer Feedforward Neural Network for classification. We have shown that this approach can achieve state-of-the-art performance on publicly available datasets. 99% F1-score on two datasets [dredze 2007, Princeton] and 96% F1-score on the combination of these datasets. Due to the availability of GPUs, this approach may be used for just-in-time classification in anti-spam systems handling huge amounts of emails. We have observed also that mentioned publicly available datasets are no longer representative. We overcame this limitation by using a much richer dataset from a one-week long real traffic of the freemail provider Email.

  11. Email Spam Classification from SHANTANU DHAKAD

    • kaggle.com
    Updated Oct 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neil David (2022). Email Spam Classification from SHANTANU DHAKAD [Dataset]. https://www.kaggle.com/neildavid/email-spam-classification-from-shantanu-dhakad/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 9, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Neil David
    Description

    Dataset

    This dataset was created by Neil David

    Contents

  12. h

    spam-detection-dataset-splits

    • huggingface.co
    Updated Nov 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tan Quang DUONG (2023). spam-detection-dataset-splits [Dataset]. https://huggingface.co/datasets/tanquangduong/spam-detection-dataset-splits
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 30, 2023
    Authors
    Tan Quang DUONG
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Spam Detection Dataset

    This is the dataset for spam classification task. It contains:

    'train' subset with 8175 samples 'validation' subset with 1362 samples 'test' subset with 1636 samples

      Source and modifications
    

    This dataset is cloned from Deysi/spam-detection-dataset with the following added processing:

    Convert 'string' to 'id' label that allows to be used and trained directly with transformer's trainer Split the original 'test' dataset (2725 samples) into 2… See the full description on the dataset page: https://huggingface.co/datasets/tanquangduong/spam-detection-dataset-splits.

  13. SMS Spam Collection (Text Classification)

    • kaggle.com
    Updated Nov 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). SMS Spam Collection (Text Classification) [Dataset]. https://www.kaggle.com/datasets/thedevastator/sms-spam-collection-a-more-diverse-dataset/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 21, 2022
    Dataset provided by
    Kaggle
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    SMS Spam Collection (Text Classification)

    SMS labeled messages that have been collected for mobile phone spam research

    Source

    Huggingface Hub: link

    About this dataset

    The SMS Spam Collection v.1 is a set of SMS messages that have been collected and labeled as either spam or not spam. This dataset contains 5574 English, real, and non-encoded messages. The SMS messages are thought-provoking and eye-catching. The dataset is useful for mobile phone spam research

    How to use the dataset

    Research Ideas

    • This dataset could be used to train a machine learning model to classify SMS messages as spam or not spam.
    • This dataset could be used to develop a tool that can automatically identify and block spam messages.
    • This dataset could be used to study the characteristics of spam messages and develop strategies for identifying and avoiding them

    Acknowledgements

    _This dataset is used to train a machine learning model to classify SMS messages as spam or not spam.

    The SMS Spam Collection v.1 is a public set of SMS labeled messages that have been collected for mobile phone spam research. This dataset contains 5574 English, real, and non-encoded messages, tagged as being legitimate (ham) or spam. The dataset has been collected from various sources and is released under the CC BY-SA 4.0 license by Kaggle user Almeida et al._

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:--------------|:------------------------------------------------------------------------------| | sms | The text of the SMS message. (String) | | label | The label for the SMS message, indicating whether it is ham or spam. (String) |

  14. o

    Spam Classification for Basic NLP

    • opendatabay.com
    .csv
    Updated Jun 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Spam Classification for Basic NLP [Dataset]. https://www.opendatabay.com/data/ai-ml/f19689b1-9c34-4cdd-a05c-703242085c7a
    Explore at:
    .csvAvailable download formats
    Dataset updated
    Jun 12, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Data Science and Analytics
    Description

    This Data consist of raw mail messages which is suitable for the NLP pre-processing like Tokenizing, Removing Stop words, Stemming and Parsing HTML tags. All the above steps are very important for someone who enters into NLP world. The dataset also goes hand-in-hand with NLP libraries like Vectorizer etc.

    Original Data Source: Spam Classification for Basic NLP

  15. Spam Detection Dataset

    • kaggle.com
    Updated Apr 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AJ (2025). Spam Detection Dataset [Dataset]. https://www.kaggle.com/datasets/smayanj/spam-detection-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 12, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    AJ
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This is a synthetic dataset for training and testing spam detection models. It contains 20,000 email samples, and each sample is described by five features and one label.

    Features:

    1. num_links

      • Type: Integer
      • Meaning: Number of links present in the email body
      • Generated using a Poisson distribution with an average (λ) of 1.5
      • Assumption: More links often mean higher chances of spam
    2. num_words

      • Type: Integer
      • Meaning: Total number of words in the email
      • Randomly picked between 20 and 200
      • Assumption: Short or overly long emails might look suspicious, but this is more of a neutral feature
    3. has_offer

      • Type: Binary (0 or 1)
      • Meaning: Whether the email contains the word “offer”
      • Simulated using a binomial distribution (30% chance of being 1)
      • Assumption: Marketing language like “offer” is common in spam
    4. sender_score

      • Type: Float between 0 and 1
      • Meaning: A simulated reputation score of the email sender
      • Normally distributed around 0.7, clipped to stay between 0 and 1
      • Assumption: A low sender score means the sender is less trustworthy (and more likely to send spam)
    5. all_caps

      • Type: Binary (0 or 1)
      • Meaning: Whether the subject line is written in ALL CAPS
      • Simulated with a 10% chance of being 1
      • Assumption: All-caps subject lines are usually attention-grabbing and common in spam

    Target:

    1. is_spam
      • Type: Binary (0 or 1)
      • Meaning: Whether the email is spam
      • Generated using a rule-based formula:
        • Spam probability increases if:
        • Links > 2
        • It contains an “offer”
        • Sender score < 0.4
        • Subject is in all caps
        • These factors are combined with different weights
        • A little noise is added using Gaussian randomness to simulate real-world uncertainty
        • Emails are labeled as spam if the final probability crosses 0.5

    Why this dataset is useful:

    • You can try binary classification algorithms like Logistic Regression, Decision Trees, Random Forests, or Neural Networks.
    • It's great for feature importance analysis—you can check which features most affect spam prediction.
    • You can test model robustness using noisy, rule-based labels.
    • Good for building and evaluating explainable AI models since the rules are known.
  16. o

    Spam Mail Prediction Dataset

    • opendatabay.com
    .csv
    Updated Jun 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Spam Mail Prediction Dataset [Dataset]. https://www.opendatabay.com/data/dataset/080d396c-0650-452b-9bef-d6bb3fa9366e
    Explore at:
    .csvAvailable download formats
    Dataset updated
    Jun 6, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Fraud Detection & Risk Management
    Description

    The dataset consists of a collection of emails categorized into two major classes: spam and not spam. It is designed to facilitate the development and evaluation of spam detection or email filtering systems.

    The spam emails in the dataset are typically unsolicited and unwanted messages that aim to promote products or services, spread malware, or deceive recipients for various malicious purposes. These emails often contain misleading subject lines, excessive use of advertisements, unauthorized links, or attempts to collect personal information.

    The non-spam emails in the dataset are genuine and legitimate messages sent by individuals or organizations. They may include personal or professional communication, newsletters, transaction receipts, or any other non-malicious content.

    The dataset encompasses emails of varying lengths, languages, and writing styles, reflecting the inherent heterogeneity of email communication. This diversity aids in training algorithms that can generalize well to different types of emails, making them robust against different spammer tactics and variations in non-spam email content.

    Original Data Source: Spam Mail Prediction Dataset

  17. f

    CNN architectures.

    • figshare.com
    xls
    Updated Dec 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Angom Buboo Singh; Khumanthem Manglem Singh (2023). CNN architectures. [Dataset]. http://doi.org/10.1371/journal.pone.0291037.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 14, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Angom Buboo Singh; Khumanthem Manglem Singh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Image spam is a type of spam that contains text information inserted in an image file. Traditional classification systems based on feature engineering require manual extraction of certain quantitative and qualitative image features for classification. However, these systems are often not robust to adversarial attacks. In contrast, classification pipelines that use convolutional neural network (CNN) models automatically extract features from images. This approach has been shown to achieve high accuracies even on challenge datasets that are designed to defeat the purpose of classification. We propose a method for improving the performance of CNN models for image spam classification. Our method uses the concept of error level analysis (ELA) as a pre-processing step. ELA is a technique for detecting image tampering by analyzing the error levels of the image pixels. We show that ELA can be used to improve the accuracy of CNN models for image spam classification, even on challenge datasets. Our results demonstrate that the application of ELA as a pre-processing technique in our proposed model can significantly improve the results of the classification tasks on image spam datasets.

  18. h

    spam-email-classification

    • huggingface.co
    Updated Oct 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vijay Agrawal (2024). spam-email-classification [Dataset]. https://huggingface.co/datasets/vijayagrawal/spam-email-classification
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 11, 2024
    Authors
    Vijay Agrawal
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for Dataset Name

    This dataset can serve as an example and golden example dataset for LLM assistant few shot prompts and for evaluation and validation.

      Dataset Details
    
    
    
    
    
      Dataset Sources [optional]
    

    Repository: [More Information Needed] Paper [optional]: [More Information Needed] Demo [optional]: [More Information Needed]

      Uses
    
    
    
    
    
    
    
      Direct Use
    

    [More Information Needed]

      Out-of-Scope Use
    

    [More Information Needed]… See the full description on the dataset page: https://huggingface.co/datasets/vijayagrawal/spam-email-classification.

  19. Data from: Spam E-mail Dataset

    • kaggle.com
    Updated Nov 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Huỳnh Nguyên Phúc (2023). Spam E-mail Dataset [Dataset]. https://www.kaggle.com/datasets/hunhnguynphc/spam-e-mail-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 12, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Huỳnh Nguyên Phúc
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    1. Title: SPAM E-mail Database

    2. Sources: - Creators: Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt (Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA 94304) - Donor: George Forman (gforman at nospam hpl.hp.com, 650-857-7835) - Generated: June-July 1999

    3. Past Usage: - Hewlett-Packard Internal-only Technical Report. External forthcoming. - Used to determine whether a given email is spam or not. - Approximately 7% misclassification error. - Emphasis on minimizing false positives (marking good mail as spam) due to their undesirability. - Even with the insistence on zero false positives in the training/testing set, 20-25% of the spam passed through the filter.

    4. Relevant Information: - The concept of "spam" is diverse, encompassing advertisements, make-money-fast schemes, chain letters, and pornography. - Spam emails were collected from the postmaster and individuals who reported spam. - Non-spam emails were collected from work and personal sources, with the words 'george' and the area code '650' indicating non-spam. - Non-spam indicators like 'george' and '650' need to be handled carefully or require a broad collection of non-spam for a general-purpose spam filter. - Background information on spam: Cranor, Lorrie F., LaMacchia, Brian A. "Spam!" Communications of the ACM, 41(8):74-83, 1998.

    5. Number of Instances: 4601 (1813 Spam = 39.4%)

    6. Number of Attributes: 58 (57 continuous, 1 nominal class label)

    7. Attribute Information: - 48 continuous real [0,100] attributes of type word_freq_WORD: Percentage of words in the email that match the specified word. - 6 continuous real [0,100] attributes of type char_freq_CHAR: Percentage of characters in the email that match the specified character. - 1 continuous real [1,...] attribute of type capital_run_length_average: Average length of uninterrupted sequences of capital letters. - 1 continuous integer [1,...] attribute of type capital_run_length_longest: Length of the longest uninterrupted sequence of capital letters. - 1 continuous integer [1,...] attribute of type capital_run_length_total: Sum of the length of uninterrupted sequences of capital letters. - 1 nominal {0,1} class attribute of type spam: Denotes whether the email was considered spam (1) or not (0), i.e., unsolicited commercial e-mail.

    8. Missing Attribute Values: None

    9. Class Distribution: - Spam: 1813 (39.4%) - Non-Spam: 2788 (60.6%)

    10. Attribute Statistics (Min, Max, Average, Std.Dev, Coeff.Var_%): - Detailed statistics provided for each of the 58 attributes.

    11. Additional Information: - Documentation available in the file 'spambase.DOCUMENTATION' at the UCI Machine Learning Repository: Link

  20. Z

    Data from: Set of obfuscated spam dataset by using LeetSpeak transformations...

    • data.niaid.nih.gov
    • portalcientifico.uvigo.gal
    • +1more
    Updated Mar 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xabier Vidriales (2022). Set of obfuscated spam dataset by using LeetSpeak transformations [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6373652
    Explore at:
    Dataset updated
    Mar 22, 2022
    Dataset provided by
    Vitor Basto Fernandes
    Enaitz Ezpeleta
    José Ramón Méndez
    Urko Zurutuza
    Xabier Vidriales
    Iñaki Velez de Mendizabal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The usage of LeetSpeak and other text hiding tricks is often used by spammers in the distribution of unsolicited contents. To evaluate deobfuscation techniques and their impact on spam content classification, we preprocessed several popular public datasets to partially obfuscate the text. The datasets transformed are:

    YouTube Spam Collection [2, 3] which is available on https://www.dt.fee.unicamp.br/~tiago/youtubespamcollection/.

    a subset of YouTube Comments [4, 5] which is available on http://mlg.ucd.ie/yt/.

    CSDMC2010 which is available on http://csmining.org/index.php/spam-email-datasets-.html.

    TREC2007 which is available on https://plg.uwaterloo.ca/~gvcormac/treccorpus07/

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
_w1998 (2023). Spam email Dataset [Dataset]. https://www.kaggle.com/datasets/jackksoncsie/spam-email-dataset
Organization logo

Data from: Spam email Dataset

This dataset contains a collection of email text messages, spam or not spam.

Related Article
Explore at:
7 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 1, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
_w1998
License

http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html

Description

Dataset Name: Spam Email Dataset

Description: This dataset contains a collection of email text messages, labeled as either spam or not spam. Each email message is associated with a binary label, where "1" indicates that the email is spam, and "0" indicates that it is not spam. The dataset is intended for use in training and evaluating spam email classification models.

Columns:

text (Text): This column contains the text content of the email messages. It includes the body of the emails along with any associated subject lines or headers.

spam_or_not (Binary): This column contains binary labels to indicate whether an email is spam or not. "1" represents spam, while "0" represents not spam.

Usage: This dataset can be used for various Natural Language Processing (NLP) tasks, such as text classification and spam detection. Researchers and data scientists can train and evaluate machine learning models using this dataset to build effective spam email filters.

Search
Clear search
Close search
Google apps
Main menu