Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Email Spam Classification
The dataset consists of a collection of emails categorized into two major classes: spam and not spam. It is designed to facilitate the development and evaluation of spam detection or email filtering systems. The spam emails in the dataset are typically unsolicited and unwanted messages that aim to promote products or services, spread malware, or deceive recipients for various malicious purposes. These emails often contain misleading subject lines… See the full description on the dataset page: https://huggingface.co/datasets/UniqueData/email-spam-classification.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description: In an era where communication is predominantly digital, SMS spam poses a significant challenge, cluttering inboxes and sometimes even posing security risks. Our "SMS Spam Detection Dataset" is tailored to empower machine learning enthusiasts, data scientists, and researchers to tackle this pervasive issue using the power of AI. This dataset is meticulously curated to provide a robust foundation for developing and benchmarking spam detection models.
Dataset Overview: The dataset comprises two columns: 'Text' and 'Label', containing the SMS content and corresponding labels ('ham' for regular messages and 'spam' for unsolicited messages), respectively. With a diverse collection of messages, this dataset serves as an ideal playground for exploring various text processing and machine learning techniques.
Potential Uses: Spam Detection Models: Use the dataset to train binary classification models capable of distinguishing between spam and ham messages with high accuracy. Natural Language Processing (NLP) Techniques: Experiment with different NLP methodologies, including tokenization, stemming, lemmatization, and the application of word embeddings or transformers to understand the nuances of SMS language. Feature Engineering: Explore how different features, such as message length, punctuation usage, and keyword frequency, can impact model performance. Model Benchmarking: Compare the effectiveness of various machine learning algorithms, from classical approaches like Naive Bayes and SVM to advanced deep learning models like LSTM and BERT.
Challenges & Opportunities: While the dataset offers a straightforward binary classification task, the real challenge lies in dealing with the nuances of natural language, including slang, abbreviations, and the evolving nature of spam tactics. Innovators in the field can explore advanced techniques like transfer learning and semi-supervised models to push the boundaries of what's possible in spam detection.
Facebook
TwitterAnik3t/spam-classification dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterThe Spam dataset is based on the Enron email data, specifically the BG section of spam emails and the Kaminski section of ham emails, combined into a dataset of 5000 emails for spam classification.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This is a csv file containing related information of 5157 randomly picked email files and their respective labels for spam or not-spam classification. The csv file contains 5157 rows, each row for each email. There are 2 columns. The first column indicates Email category (spam or ham), The second column indicates the email sent.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The dataset consists of a CSV file containing of 300 generated email spam messages. Each row in the file represents a separate email message, its title and text. The dataset aims to facilitate the analysis and detection of spam emails. The dataset can be used for various purposes, such as training machine learning algorithms to classify and filter spam emails, studying spam email patterns, or analyzing text-based features of spam messages.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Inzamam Safi
Released under CC0: Public Domain
Facebook
Twitterhttps://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
The problem with email image spam classification is known from the year 2005. There are several approaches to this task. Lately, those approaches use convolutional neural networks (CNN). We propose a novel approach to the image spam classification task. Our approach is based on CNN and transfer learning, namely Resnet v1 used for semantic feature extraction and one layer Feedforward Neural Network for classification. We have shown that this approach can achieve state-of-the-art performance on publicly available datasets. 99% F1-score on two datasets [dredze 2007, Princeton] and 96% F1-score on the combination of these datasets. Due to the availability of GPUs, this approach may be used for just-in-time classification in anti-spam systems handling huge amounts of emails. We have observed also that mentioned publicly available datasets are no longer representative. We overcame this limitation by using a much richer dataset from a one-week long real traffic of the freemail provider Email.
Facebook
TwitterThis is a project that uses machine learning algorithms to classify emails/sms as either spam or not spam.
Facebook
TwitterThis dataset was created by Ayasya Batta
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Spam Text Message Classification’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/team-ai/spam-text-message-classification on 30 September 2021.
--- Dataset description provided by original source is as follows ---
Coming Soon
Coming Soon
Special thanks to; http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/
Coming soon
--- Original source retains full ownership of the source dataset ---
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Image spam is a type of spam that contains text information inserted in an image file. Traditional classification systems based on feature engineering require manual extraction of certain quantitative and qualitative image features for classification. However, these systems are often not robust to adversarial attacks. In contrast, classification pipelines that use convolutional neural network (CNN) models automatically extract features from images. This approach has been shown to achieve high accuracies even on challenge datasets that are designed to defeat the purpose of classification. We propose a method for improving the performance of CNN models for image spam classification. Our method uses the concept of error level analysis (ELA) as a pre-processing step. ELA is a technique for detecting image tampering by analyzing the error levels of the image pixels. We show that ELA can be used to improve the accuracy of CNN models for image spam classification, even on challenge datasets. Our results demonstrate that the application of ELA as a pre-processing technique in our proposed model can significantly improve the results of the classification tasks on image spam datasets.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Image spam is a type of spam that contains text information inserted in an image file. Traditional classification systems based on feature engineering require manual extraction of certain quantitative and qualitative image features for classification. However, these systems are often not robust to adversarial attacks. In contrast, classification pipelines that use convolutional neural network (CNN) models automatically extract features from images. This approach has been shown to achieve high accuracies even on challenge datasets that are designed to defeat the purpose of classification. We propose a method for improving the performance of CNN models for image spam classification. Our method uses the concept of error level analysis (ELA) as a pre-processing step. ELA is a technique for detecting image tampering by analyzing the error levels of the image pixels. We show that ELA can be used to improve the accuracy of CNN models for image spam classification, even on challenge datasets. Our results demonstrate that the application of ELA as a pre-processing technique in our proposed model can significantly improve the results of the classification tasks on image spam datasets.
Facebook
TwitterThe dataset used in the paper is the TREC05 spam corpus, which contains 39,999 real ham and 52,790 spam emails.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Image spam is a type of spam that contains text information inserted in an image file. Traditional classification systems based on feature engineering require manual extraction of certain quantitative and qualitative image features for classification. However, these systems are often not robust to adversarial attacks. In contrast, classification pipelines that use convolutional neural network (CNN) models automatically extract features from images. This approach has been shown to achieve high accuracies even on challenge datasets that are designed to defeat the purpose of classification. We propose a method for improving the performance of CNN models for image spam classification. Our method uses the concept of error level analysis (ELA) as a pre-processing step. ELA is a technique for detecting image tampering by analyzing the error levels of the image pixels. We show that ELA can be used to improve the accuracy of CNN models for image spam classification, even on challenge datasets. Our results demonstrate that the application of ELA as a pre-processing technique in our proposed model can significantly improve the results of the classification tasks on image spam datasets.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Coming Soon
Coming Soon
Special thanks to; http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/
Coming soon
Facebook
TwitterThis dataset was created by Akalya Subramanian
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This archive includes the SIMAS dataset for fine-tuning models for MMS (Multimedia Messaging Service) image moderation. SIMAS is a balanced collection of publicly available images, manually annotated in accordance with a specialized taxonomy designed for identifying visual spam in MMS messages.
The following table presents the definitions of categories used for classifying MMS images.
Table 1: Category definitions
| Category | Description |
| Alcohol* | Content related to alcoholic beverages, including advertisements and consumption. |
| Drugs* | Content related to the use, sale, or trafficking of narcotics (e.g., cannabis, cocaine, |
| Firearms* | Content involving guns, pistols, knives, or military weapons. |
| Gambling* | Content related to gambling (casinos, poker, roulette, lotteries). |
| Sexual | Content involving nudity, sexual acts, or sexually suggestive material. |
| Tobacco* | Content related to tobacco use and advertisements. |
| Violence | Content showing violent acts, self-harm, or injury. |
| Safe | All other content, including neutral depictions, products, or harmless cultural symbols |
Note: Categories marked with an asterisk are regulated in some jurisdictions and may not be universally restricted.
The SIMAS dataset combines publicly available images from multiple sources, selected to reflect the categories defined in our content taxonomy. Each image was manually reviewed by three independent annotators, with final labels assigned when at least two annotators agreed.
The largest portion of the dataset (30.4%) originates from LAION-400M, a large-scale image-text dataset. To identify relevant content, we first selected a list of ImageNet labels that semantically matched our taxonomy. These labels were generated using GPT-4o in a zero-shot setting, using separate prompts per category. This resulted in 194 candidate labels, of which 88.7% were retained after manual review. The structure of the prompts used in this process is shown in the file gpt4o_imagenet_prompting_scheme.png, which illustrates a shared base prompt template applied across all categories. The fields category_definition, file_examples, and exceptions are specified per category. Definitions align with the taxonomy, while the file_examples column includes sample labels retrieved from the ImageNet label list. The exceptions field contains category-specific filtering instructions; a dash indicates no exceptions were specified.
Another 25.1% of images were sourced from Roboflow, using open datasets such as:
The NudeNet dataset contributes 11.4% of the dataset. We sampled 1,000 images from the “porn” category to provide visual coverage of explicit sexual content.
Another 11.0% of images were collected from Kaggle, including:
An additional 9.9% of images were retrieved from Unsplash, using keyword-based search queries aligned with each category in our taxonomy.
Images from UnsafeBench make up 8.0% of the dataset. Since its original binary labels did not match our taxonomy, all samples were manually reassigned to the most appropriate category.
Finally, 4.2% of images were gathered from various publicly accessible websites. These were primarily used to improve category balance and model generalization, especially in safe classes.
All images collected from the listed sources have been manually reviewed by three independent annotators. Each image is then assigned to a category when at least two annotators reach consensus.
Table 2: Distribution of images per public source and category in SIMAS dataset
| Type | Category | LAION | Roboflow | NudeNet | Kaggle | Unsplash | UnsafeBench | Other | Total |
|---|---|---|---|---|---|---|---|---|---|
| Unsafe | Alcohol | 29 | 0 | 3 | 267 | 0 | 1 | 0 | 300 |
| Unsafe | Drugs | 17 | 211 | 0 | 0 | 13 | 8 | 1 | 250 |
| Unsafe | Firearms | 0 | 59 | 0 | 229 | 0 | 62 | 0 | 350 |
| Unsafe | Gambling | 132 | 38 | 0 | 0 | 73 | 39 | 18 | 300 |
| Unsafe | Sexual | 2 | 0 | 421 | 0 | 3 | 68 | 6 | 500 |
| Unsafe | Tobacco | 0 | 446 | 0 | 0 | 43 | 11 | 0 | 500 |
| Unsafe | Violence | 0 | 289 | 0 | 0 | 0 | 11 | 0 | 300 |
| Safe | Alcohol | 140 | 35 | 0 | 0 | 16 | 13 | 96 | 300 |
| Safe | Drugs | 67 | 49 | 0 | 15 | 72 | 17 | 30 | 250 |
| Safe | Firearms | 173 | 15 | 0 | 3 | 144 | 8 | 7 | 350 |
| Safe | Gambling | 164 | 2 | 0 | 1 | 121 | 12 | 0 | 300 |
| Safe | Sexual | 235 | 22 | 139 | 2 | 0 | 94 | 8 | 500 |
| Safe | Tobacco | 351 | 67 | 5 | 13 | 8 | 16 | 40 | 500 |
| Safe | Violence | 212 | 20 | 3 | 21 | 0 | 42 | 2 | 300 |
| All | All | 1,522 | 1,253 | 571 | 551 | 493 | 402 | 208 | 5,000 |
To ensure semantic diversity and dataset balance, undersampling was performed on overrepresented categories using a CLIP-based embedding and k-means clustering strategy. This resulted in a final dataset containing 2,500 spam and 2,500 safe images, evenly distributed across all categories.
Table 3: Distribution of images per category in SIMAS
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Email Spam Classification
The dataset consists of a collection of emails categorized into two major classes: spam and not spam. It is designed to facilitate the development and evaluation of spam detection or email filtering systems. The spam emails in the dataset are typically unsolicited and unwanted messages that aim to promote products or services, spread malware, or deceive recipients for various malicious purposes. These emails often contain misleading subject lines… See the full description on the dataset page: https://huggingface.co/datasets/UniqueData/email-spam-classification.