Facebook
Twitterhttps://sqmagazine.co.uk/privacy-policy/https://sqmagazine.co.uk/privacy-policy/
Email, text, and call spam remain major threats nowadays. Nearly half of all daily emails are unwanted, with users worldwide encountering boosted volumes of phishing and scam content. In retail and financial services, spam disrupts customer trust and inflates cybersecurity budgets. Meanwhile, call-based scams cost consumers time and mental strain...
Facebook
TwitterIn 2023, nearly 45.6 percent of all e-mails worldwide were identified as spam, down from almost 49 percent in 2022. While remaining a big part of the e-mail traffic, since 2011, the share of spam e-mails has decreased significantly. In 2023, the highest volume of spam e-mails was registered in May, approximately 50 percent of e-mail traffic worldwide.
Facebook
TwitterIn 2024, Russia ranked first by its share of unsolicited spam e-mails. Overall, ***** percent of global spam e-mails originated from IPs in Russia. The Mainland China ranked second, with ***** percent. The United States followed, accounting for over *** percent of global unsolicited spam e-mails during the measured period.
Facebook
TwitterIn 2020, healthcare-related spam e-mails accounted for nearly 33 percent of total spam volume. Spam e-mails with adult content were the second-most common category, around 27 percent. Dating-related junk mail generated approximately 10 percent of spam messages in the same period.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The dataset consists of a CSV file containing of 300 generated email spam messages. Each row in the file represents a separate email message, its title and text. The dataset aims to facilitate the analysis and detection of spam emails. The dataset can be used for various purposes, such as training machine learning algorithms to classify and filter spam emails, studying spam email patterns, or analyzing text-based features of spam messages.
Facebook
Twitterhttps://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
==Data Set Information: This corpus has been collected from free or free for research sources at the Internet: -> A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. The Grumbletext Web site is: [Web Link]. -> A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description: In an era where communication is predominantly digital, SMS spam poses a significant challenge, cluttering inboxes and sometimes even posing security risks. Our "SMS Spam Detection Dataset" is tailored to empower machine learning enthusiasts, data scientists, and researchers to tackle this pervasive issue using the power of AI. This dataset is meticulously curated to provide a robust foundation for developing and benchmarking spam detection models.
Dataset Overview: The dataset comprises two columns: 'Text' and 'Label', containing the SMS content and corresponding labels ('ham' for regular messages and 'spam' for unsolicited messages), respectively. With a diverse collection of messages, this dataset serves as an ideal playground for exploring various text processing and machine learning techniques.
Potential Uses: Spam Detection Models: Use the dataset to train binary classification models capable of distinguishing between spam and ham messages with high accuracy. Natural Language Processing (NLP) Techniques: Experiment with different NLP methodologies, including tokenization, stemming, lemmatization, and the application of word embeddings or transformers to understand the nuances of SMS language. Feature Engineering: Explore how different features, such as message length, punctuation usage, and keyword frequency, can impact model performance. Model Benchmarking: Compare the effectiveness of various machine learning algorithms, from classical approaches like Naive Bayes and SVM to advanced deep learning models like LSTM and BERT.
Challenges & Opportunities: While the dataset offers a straightforward binary classification task, the real challenge lies in dealing with the nuances of natural language, including slang, abbreviations, and the evolving nature of spam tactics. Innovators in the field can explore advanced techniques like transfer learning and semi-supervised models to push the boundaries of what's possible in spam detection.
Facebook
TwitterPreprocessed data derived from the "spam-mails" dataset, containing email messages labeled as spam or ham. Each record includes a unique identifier from the original dataset and an experiment_id indicating its assignment to a specific data split (training, validation, or test) used in this experiment. The email content has been lemmatized and cleaned to remove noise such as punctuation, special characters, and stopwords, ensuring consistent input for embedding and model training. Original data source: https://www.kaggle.com/datasets/venky73/spam-mails-dataset
Facebook
TwitterHow to Access:
To access this dataset, please contact Francisco Janez via email at francisco.janez@unileon.es. Access will be granted based on specific requests.
Purpose:The PerSentSE corpus was developed to study persuasive techniques in spam emails. It includes 130 emails randomly selected from the SpamArchive2122 dataset, which contains over 20,000 spam emails in English.
Methodology:
Segmentation: Emails were divided into sentences using the NLTK library.
Annotation: Eight persuasive techniques, along with a "non-persuasion" class, were identified. Two expert annotators labeled an initial subset of emails to measure inter-annotator agreement, achieving a final acceptable level (Îł = 0.63).
Corpus Statistics:
Total sentences: 1,075
Persuasive sentences: 216 (20.1%)
Persuasion Distribution by Email Sections (Table 7):
Subject lines: 35.59% persuasive, with an average of 1.62 techniques.
Greeting section: 54.17% persuasive, averaging 1.46 techniques.
Email body: 82.46% persuasive, with 5.51 techniques on average.
Farewell section: 31.43% persuasive, averaging 1.45 techniques.
Co-occurrence of Techniques (Figure 2):Some persuasive techniques frequently appeared together:
Appeal to Fear/Prejudice with Loaded Language: 25 instances.
Exaggeration/Minimization with Loaded Language: 24 instances.
Appeal to Fear/Prejudice with Exaggeration/Minimization: 20 instances.
Findings:The body section of emails concentrates the highest number of persuasive elements, contrary to earlier studies focusing on subject lines alone. This suggests that spam emails rely heavily on persuasive content in their main text.
Facebook
TwitterSPAM E-mail Database
The âspamâ concept is diverse: advertisements for products/websites, make money fast schemes, chain letters, pornography⊠Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word âgeorgeâ and the area code â650â are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter.
Attribute Information:
The last column denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail. Most of the attributes indicate whether a particular word or character was frequently occurring in the e-mail. The run-length attributes (55-57) measure the length of sequences of consecutive capital letters.
For the statistical measures of each attribute, see the end of this file. Here are the definitions of the attributes:
48 continuous real [0,100] attributes of type word_freq_WORD = percentage of words in the e-mail that match WORD, i.e. 100 * (number of times the WORD appears in the e-mail) / total number of words in e-mail. A âwordâ in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string.
6 continuous real [0,100] attributes of type char_freq_CHAR = percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurrences) / total characters in e-mail
1 continuous real [1,âŠ] attribute of type capital_run_length_average = average length of uninterrupted sequences of capital letters
1 continuous integer [1,âŠ] attribute of type capital_run_length_longest = length of longest uninterrupted sequence of capital letters
1 continuous integer [1,âŠ] attribute of type capital_run_length_total = sum of length of uninterrupted sequences of capital letters = total number of capital letters in the e-mail
1 nominal {0,1} class attribute of type spam = denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Dataset Card for [Dataset Name]
Dataset Summary
The SMS Spam Collection v.1 is a public set of SMS labeled messages that have been collected for mobile phone spam research. It has one collection composed by 5,574 English, real and non-enconded messages, tagged according being legitimate (ham) or spam.
Supported Tasks and Leaderboards
[More Information Needed]
Languages
English
Dataset Structure
Data Instances
[More Information⊠See the full description on the dataset page: https://huggingface.co/datasets/ucirvine/sms_spam.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Email Spam Classification
The dataset consists of a collection of emails categorized into two major classes: spam and not spam. It is designed to facilitate the development and evaluation of spam detection or email filtering systems. The spam emails in the dataset are typically unsolicited and unwanted messages that aim to promote products or services, spread malware, or deceive recipients for various malicious purposes. These emails often contain misleading subject lines⊠See the full description on the dataset page: https://huggingface.co/datasets/UniqueData/email-spam-classification.
Facebook
TwitterSpam messages accounted for over **** percent of e-mail traffic in December 2023. Russia generated the largest share of unsolicited spam e-mails in 2022, with **** percent of global spam e-mails originating from the country. Spam worldwide It is almost impossible to think about e-mail without considering the issue of spam, which usually includes billions of promotional e-mails marketers send daily. As of January 2023, the United States had the highest number of spam e-mails sent daily. While many e-mail users believe such content belongs in their spam folder, marketing e-mails are generally harmless if annoying to the user. Malicious spam Phishing e-mails remain one of the primary attack vectors for cybercriminals. On average, around ** percent of businesses worldwide experience four to six successful cyber attacks in one year. Another ** percent said they became victims of more than ** bulk phishing attacks. More than half of the companies said these phishing attacks resulted in consumer or client data breaches.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The SMS spam dataset contains a collection of text messages. The dataset includes a diverse range of spam messages, including promotional offers, fraudulent schemes, phishing attempts, and other forms of unsolicited communication. Each SMS message is represented as a string of text, and each entry in the dataset also has a link to the corresponding screenshot. The dataset's content represents real-life examples of spam messages that users encounter in their everyday communication.
Facebook
TwitterFacebook removed 165 million pieces of spam in the second quarter of 2025, down from 366 million pieces in the previous quarter. The fourth quarter of 2019 saw almost three billion pieces of spam being removed from the social network. Meta Platforms state that spam is not allowed on Facebook, and defines spam as deceptive or annoying content used to drive engagement.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This is a large corpus of 42,619 preprocessed text messages and emails sent by humans in 43 languages. is_spam=1 means spam and is_spam=0 means ham. 1040 rows of balanced data, consisting of casual conversations and scam emails in â10 languages, were manually collected and annotated by me, with some help from ChatGPT.
Some preprcoessing algorithms
spam_assassin.js, followed by spam_assassin.py enron_spam.py
Data composition
Description
To make the text⊠See the full description on the dataset page: https://huggingface.co/datasets/FredZhang7/all-scam-spam.
Facebook
TwitterThe statistic shows the global e-mail spam rate from 2012 to 2018. In the most recently observed period, it was found that spam accounted for 55 percent of all e-mail messages, same as during the previous year.
Facebook
TwitterThis dataset was created by Chirag Singh
Spam Mail Dataset
Facebook
TwitterThis dataset contains the predicted prices of the asset spam over the next 16 years. This data is calculated initially using a default 5 percent annual growth rate, and after page load, it features a sliding scale component where the user can then further adjust the growth rate to their own positive or negative projections. The maximum positive adjustable growth rate is 100 percent, and the minimum adjustable growth rate is -100 percent.
Facebook
Twitterhttps://sqmagazine.co.uk/privacy-policy/https://sqmagazine.co.uk/privacy-policy/
Email, text, and call spam remain major threats nowadays. Nearly half of all daily emails are unwanted, with users worldwide encountering boosted volumes of phishing and scam content. In retail and financial services, spam disrupts customer trust and inflates cybersecurity budgets. Meanwhile, call-based scams cost consumers time and mental strain...