http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
Dataset Name: Spam Email Dataset
Description: This dataset contains a collection of email text messages, labeled as either spam or not spam. Each email message is associated with a binary label, where "1" indicates that the email is spam, and "0" indicates that it is not spam. The dataset is intended for use in training and evaluating spam email classification models.
Columns:
text (Text): This column contains the text content of the email messages. It includes the body of the emails along with any associated subject lines or headers.
spam_or_not (Binary): This column contains binary labels to indicate whether an email is spam or not. "1" represents spam, while "0" represents not spam.
Usage: This dataset can be used for various Natural Language Processing (NLP) tasks, such as text classification and spam detection. Researchers and data scientists can train and evaluate machine learning models using this dataset to build effective spam email filters.
Spam Classification Dataset
Overview
The Spam Classification Dataset contains a collection of SMS messages labeled as either "spam" or "ham" (non-spam). This dataset is designed for binary text classification tasks, where the goal is to classify an SMS message as either spam or non-spam based on its content.
Dataset Structure
The dataset is provided as a single CSV file named spam.csv. It contains 5,572 entries, with each entry corresponding to an SMS message.… See the full description on the dataset page: https://huggingface.co/datasets/mltrev23/spam-classify.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The SMS spam dataset contains a collection of text messages. The dataset includes a diverse range of spam messages, including promotional offers, fraudulent schemes, phishing attempts, and other forms of unsolicited communication.
Each SMS message is represented as a string of text, and each entry in the dataset also has a link to the corresponding screenshot. The dataset's content represents real-life examples of spam messages that users encounter in their everyday communication.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F618942%2Fbb49e9783917bb524ecd11b085375a28%2FMacBook%20Air%20-%201%20(2).png?generation=1689765543776590&alt=media" alt="">
includes the following information:
keywords: sms spam collection, labeled messages, mobile phone spam, spam sms dataset, sms spam classification, spam or not-spam, spam sms database, spam detection system, sma spamming data set, spam filtering system, spambase, feature extraction, spam ham email dataset, classifier, machine learning algorithms, cybersecurity, text dataset, sentiment analysis, llm dataset, language modeling, large language models, text classification, text mining dataset, natural language texts, nlp, nlp open-source dataset, text data
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Explore our comprehensive Spam SMS Classification Dataset designed for NLP and machine learning research.
Anik3t/spam-classification-new dataset hosted on Hugging Face and contributed by the HF Datasets community
This corpus has been collected from free or free for research sources at the Internet:
A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. A list of 450 SMS ham messages collected from Caroline Tag's PhD Thesis. the SMS Spam Corpus v.0.1 Big. It has 1,002 SMS ham messages and 322 spam messages.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
📲 SMS OTP Spam Dataset
A synthetic dataset of 10,000 OTP-style SMS messages for spam classification tasks. The dataset includes both valid and spam-like messages, with labels for message validity and delivery status.
📊 Dataset Summary
Total samples: 10,000 Valid messages: 90% Not valid (spam-like): 10% Status types: delivered, failed, spam, bounced, expired
Each entry includes:
phone_id: Synthetic phone number sms_text: Message content label: valid or not valid… See the full description on the dataset page: https://huggingface.co/datasets/alusci/sms-otp-spam-dataset.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Helloquant
Released under Apache 2.0
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
medium
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
The problem with email image spam classification is known from the year 2005. There are several approaches to this task. Lately, those approaches use convolutional neural networks (CNN). We propose a novel approach to the image spam classification task. Our approach is based on CNN and transfer learning, namely Resnet v1 used for semantic feature extraction and one layer Feedforward Neural Network for classification. We have shown that this approach can achieve state-of-the-art performance on publicly available datasets. 99% F1-score on two datasets [dredze 2007, Princeton] and 96% F1-score on the combination of these datasets. Due to the availability of GPUs, this approach may be used for just-in-time classification in anti-spam systems handling huge amounts of emails. We have observed also that mentioned publicly available datasets are no longer representative. We overcame this limitation by using a much richer dataset from a one-week long real traffic of the freemail provider Email.
This dataset was created by Neil David
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Spam Detection Dataset
This is the dataset for spam classification task. It contains:
'train' subset with 8175 samples 'validation' subset with 1362 samples 'test' subset with 1636 samples
Source and modifications
This dataset is cloned from Deysi/spam-detection-dataset with the following added processing:
Convert 'string' to 'id' label that allows to be used and trained directly with transformer's trainer Split the original 'test' dataset (2725 samples) into 2… See the full description on the dataset page: https://huggingface.co/datasets/tanquangduong/spam-detection-dataset-splits.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Huggingface Hub: link
The SMS Spam Collection v.1 is a set of SMS messages that have been collected and labeled as either spam or not spam. This dataset contains 5574 English, real, and non-encoded messages. The SMS messages are thought-provoking and eye-catching. The dataset is useful for mobile phone spam research
- This dataset could be used to train a machine learning model to classify SMS messages as spam or not spam.
- This dataset could be used to develop a tool that can automatically identify and block spam messages.
- This dataset could be used to study the characteristics of spam messages and develop strategies for identifying and avoiding them
_This dataset is used to train a machine learning model to classify SMS messages as spam or not spam.
The SMS Spam Collection v.1 is a public set of SMS labeled messages that have been collected for mobile phone spam research. This dataset contains 5574 English, real, and non-encoded messages, tagged as being legitimate (ham) or spam. The dataset has been collected from various sources and is released under the CC BY-SA 4.0 license by Kaggle user Almeida et al._
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:--------------|:------------------------------------------------------------------------------| | sms | The text of the SMS message. (String) | | label | The label for the SMS message, indicating whether it is ham or spam. (String) |
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This Data consist of raw mail messages which is suitable for the NLP pre-processing like Tokenizing, Removing Stop words, Stemming and Parsing HTML tags. All the above steps are very important for someone who enters into NLP world. The dataset also goes hand-in-hand with NLP libraries like Vectorizer etc.
Original Data Source: Spam Classification for Basic NLP
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is a synthetic dataset for training and testing spam detection models. It contains 20,000 email samples, and each sample is described by five features and one label.
num_links
λ
) of 1.5 num_words
has_offer
sender_score
all_caps
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The dataset consists of a collection of emails categorized into two major classes: spam and not spam. It is designed to facilitate the development and evaluation of spam detection or email filtering systems.
The spam emails in the dataset are typically unsolicited and unwanted messages that aim to promote products or services, spread malware, or deceive recipients for various malicious purposes. These emails often contain misleading subject lines, excessive use of advertisements, unauthorized links, or attempts to collect personal information.
The non-spam emails in the dataset are genuine and legitimate messages sent by individuals or organizations. They may include personal or professional communication, newsletters, transaction receipts, or any other non-malicious content.
The dataset encompasses emails of varying lengths, languages, and writing styles, reflecting the inherent heterogeneity of email communication. This diversity aids in training algorithms that can generalize well to different types of emails, making them robust against different spammer tactics and variations in non-spam email content.
Original Data Source: Spam Mail Prediction Dataset
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Image spam is a type of spam that contains text information inserted in an image file. Traditional classification systems based on feature engineering require manual extraction of certain quantitative and qualitative image features for classification. However, these systems are often not robust to adversarial attacks. In contrast, classification pipelines that use convolutional neural network (CNN) models automatically extract features from images. This approach has been shown to achieve high accuracies even on challenge datasets that are designed to defeat the purpose of classification. We propose a method for improving the performance of CNN models for image spam classification. Our method uses the concept of error level analysis (ELA) as a pre-processing step. ELA is a technique for detecting image tampering by analyzing the error levels of the image pixels. We show that ELA can be used to improve the accuracy of CNN models for image spam classification, even on challenge datasets. Our results demonstrate that the application of ELA as a pre-processing technique in our proposed model can significantly improve the results of the classification tasks on image spam datasets.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for Dataset Name
This dataset can serve as an example and golden example dataset for LLM assistant few shot prompts and for evaluation and validation.
Dataset Details
Dataset Sources [optional]
Repository: [More Information Needed] Paper [optional]: [More Information Needed] Demo [optional]: [More Information Needed]
Uses
Direct Use
[More Information Needed]
Out-of-Scope Use
[More Information Needed]… See the full description on the dataset page: https://huggingface.co/datasets/vijayagrawal/spam-email-classification.
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
1. Title: SPAM E-mail Database
2. Sources: - Creators: Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt (Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA 94304) - Donor: George Forman (gforman at nospam hpl.hp.com, 650-857-7835) - Generated: June-July 1999
3. Past Usage: - Hewlett-Packard Internal-only Technical Report. External forthcoming. - Used to determine whether a given email is spam or not. - Approximately 7% misclassification error. - Emphasis on minimizing false positives (marking good mail as spam) due to their undesirability. - Even with the insistence on zero false positives in the training/testing set, 20-25% of the spam passed through the filter.
4. Relevant Information: - The concept of "spam" is diverse, encompassing advertisements, make-money-fast schemes, chain letters, and pornography. - Spam emails were collected from the postmaster and individuals who reported spam. - Non-spam emails were collected from work and personal sources, with the words 'george' and the area code '650' indicating non-spam. - Non-spam indicators like 'george' and '650' need to be handled carefully or require a broad collection of non-spam for a general-purpose spam filter. - Background information on spam: Cranor, Lorrie F., LaMacchia, Brian A. "Spam!" Communications of the ACM, 41(8):74-83, 1998.
5. Number of Instances: 4601 (1813 Spam = 39.4%)
6. Number of Attributes: 58 (57 continuous, 1 nominal class label)
7. Attribute Information:
- 48 continuous real [0,100] attributes of type word_freq_WORD
: Percentage of words in the email that match the specified word.
- 6 continuous real [0,100] attributes of type char_freq_CHAR
: Percentage of characters in the email that match the specified character.
- 1 continuous real [1,...] attribute of type capital_run_length_average
: Average length of uninterrupted sequences of capital letters.
- 1 continuous integer [1,...] attribute of type capital_run_length_longest
: Length of the longest uninterrupted sequence of capital letters.
- 1 continuous integer [1,...] attribute of type capital_run_length_total
: Sum of the length of uninterrupted sequences of capital letters.
- 1 nominal {0,1} class attribute of type spam
: Denotes whether the email was considered spam (1) or not (0), i.e., unsolicited commercial e-mail.
8. Missing Attribute Values: None
9. Class Distribution: - Spam: 1813 (39.4%) - Non-Spam: 2788 (60.6%)
10. Attribute Statistics (Min, Max, Average, Std.Dev, Coeff.Var_%): - Detailed statistics provided for each of the 58 attributes.
11. Additional Information: - Documentation available in the file 'spambase.DOCUMENTATION' at the UCI Machine Learning Repository: Link
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The usage of LeetSpeak and other text hiding tricks is often used by spammers in the distribution of unsolicited contents. To evaluate deobfuscation techniques and their impact on spam content classification, we preprocessed several popular public datasets to partially obfuscate the text. The datasets transformed are:
YouTube Spam Collection [2, 3] which is available on https://www.dt.fee.unicamp.br/~tiago/youtubespamcollection/.
a subset of YouTube Comments [4, 5] which is available on http://mlg.ucd.ie/yt/.
CSDMC2010 which is available on http://csmining.org/index.php/spam-email-datasets-.html.
TREC2007 which is available on https://plg.uwaterloo.ca/~gvcormac/treccorpus07/
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
Dataset Name: Spam Email Dataset
Description: This dataset contains a collection of email text messages, labeled as either spam or not spam. Each email message is associated with a binary label, where "1" indicates that the email is spam, and "0" indicates that it is not spam. The dataset is intended for use in training and evaluating spam email classification models.
Columns:
text (Text): This column contains the text content of the email messages. It includes the body of the emails along with any associated subject lines or headers.
spam_or_not (Binary): This column contains binary labels to indicate whether an email is spam or not. "1" represents spam, while "0" represents not spam.
Usage: This dataset can be used for various Natural Language Processing (NLP) tasks, such as text classification and spam detection. Researchers and data scientists can train and evaluate machine learning models using this dataset to build effective spam email filters.