The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.
The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.
This corpus has been collected from free or free for research sources at the Internet:
-> A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. The Grumbletext Web site is: [Web Link]. -> A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. The NUS SMS Corpus is avalaible at: [Web Link]. -> A list of 450 SMS ham messages collected from Caroline Tag's PhD Thesis available at [Web Link]. -> Finally, we have incorporated the SMS Spam Corpus v.0.1 Big. It has 1,002 SMS ham messages and 322 spam messages and it is public available at: [Web Link]. This corpus has been used in the following academic researches:
The original dataset can be found here. The creators would like to note that in case you find the dataset useful, please make a reference to previous paper and the web page: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/ in your papers, research, etc.
We offer a comprehensive study of this corpus in the following paper. This work presents a number of statistics, studies and baseline results for several machine learning methods.
Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011.
Using the Dataset The dataset used for training and evaluation is available here. You can load it using the datasets library: from datasets import load_dataset
dataset = load_dataset("SparkyPilot/scam-detection-data")
print(dataset["train"][0]) # Print the first example in the training set
The link for the datasets taken from different sources are mentioned down here -
spam.csv [https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset]… See the full description on the dataset page: https://huggingface.co/datasets/SparkyPilot/scam-detection-data.
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
The SMS Firewall market, valued at $3602.1 million in 2025, is experiencing robust growth driven by increasing concerns over SMS-based threats like spam, phishing, and malware. The rising adoption of mobile banking and e-commerce fuels demand for robust security solutions, making SMS firewalls a critical component of overall cybersecurity strategies. Key application segments include BFSI (Banking, Financial Services, and Insurance), where secure transactions are paramount, and the burgeoning entertainment and retail sectors, reliant on SMS-based communications for promotions and customer engagement. The market's segmentation also encompasses A2P (Application-to-Person) and P2A (Person-to-Application) messaging, reflecting the diverse ways businesses and individuals utilize SMS. Technological advancements, such as AI-powered threat detection and improved filtering techniques, further enhance the effectiveness of SMS firewalls and contribute to market expansion. Geographic growth is expected to be diverse, with North America and Europe holding significant market share initially due to high technological adoption and stringent regulatory frameworks. However, rapid digitalization in Asia-Pacific and the Middle East & Africa presents substantial growth opportunities in the coming years. Competition in the market is intense, with established players like Tata Communications and Sinch vying with newer entrants for market share. This competitive landscape fosters innovation and drives down prices, making SMS firewall solutions increasingly accessible to a broader range of businesses and organizations. The forecast period (2025-2033) anticipates continued market expansion, fuelled by evolving threats and increased regulatory scrutiny. Factors such as the increasing sophistication of malicious SMS campaigns, the rise of 5G technology (which may increase SMS vulnerabilities), and evolving privacy regulations will continue to shape market dynamics. While the precise CAGR is unavailable, a conservative estimate considering industry growth trends and the inherent need for robust security in an increasingly digital world would place the CAGR in the range of 12-15% annually. This growth projection reflects not only the increasing demand for SMS firewalls but also the ongoing development of more sophisticated solutions capable of countering increasingly complex threats. The market is expected to see significant consolidation, with larger players acquiring smaller firms to expand their product portfolios and geographic reach.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
## Overview
Spam is a dataset for object detection tasks - it contains Text annotations for 300 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [MIT license](https://creativecommons.org/licenses/MIT).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The diverse types of fake text generation practices by spammer make spam detection challenging. Existing works use manually designed discrete textual or behavior features, which cannot capture complex global semantics of text and reviews. Some studies use limited features while neglecting other significant features. However, in case of a large number of features set, the selection of all features leads to overfitting the model and expensive computation. The problem statement of this research paper revolves around addressing challenges concerning feature selection and evolving spammer behavior and linguistic features, with the goal of devising an efficient model for spam detection. The primary objective of this endeavor was to identify the most efficacious subset of features and patterns for the task of spam detection. Spammer behavior features and linguistic features often exhibit complex relationships that influence the nature of spam reviews. The unified representation of features is another challenging task in spam detection. Various deep learning approaches have been proposed for spam detection and classification but these methods are specialized in extracting the features but lack to capture feature dependencies effectively with other features but there is a lack of comprehensive models that integrate linguistic and behavioral features to improve the accuracy of spam detection. The proposed spam detection framework SD-FSL-CLSTM used the fusion of spammer behavior features and linguistic features which automatically detect and classify the spam reviews. Fusion enables the proposed model to automatically learn the interactions between the features during the training process, allowing it to capture complex relationships and make predictions based on both types of features. SD-FSL-CLSTM framework apparently shows the promising result by obtaining a minimum accuracy 97%.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Overview The Bangla Multilabel Cyberbully, Sexual Harassment, Threat, and Spam Detection Dataset is designed to facilitate the development of machine learning models to detect and classify various types of abusive content in Bangla social media text. This dataset contains a collection of comments annotated for multiple types of abuse, making it suitable for multilabel classification tasks. It aims to support research and development in natural language processing (NLP) to enhance online safety and moderate harmful content on Bangla language social media platforms.
Purpose 1. Train and evaluate machine learning models for detection of cyberbullying, sexual harassment, religious hate speech, threats, and spam in Bangla comments. 2. Support research in NLP and machine learning focused on Bangla, a low-resource language. 3. Aid in developing automated moderation systems for social media platforms to ensure safe and respectful communication.
Data Collection Initially, we collected around 30,000 comments from social media platforms like Facebook and TikTok. These comments were in Bangla, English, and Banglish (Bangla written using English characters). Since our research focuses on Bangla abusive text detection, we refined the dataset through the following steps:
After these steps, we obtained a final dataset of 12,557 comments. Each comment was manually labeled into five classes: bully, sexual, religious, threat, and spam. This dataset supports multi-class labeling, meaning a comment can simultaneously belong to more than one class.
Dataset Columns 1. Gender: Indicates the gender of the person who received the bullying. 2. Profession: Indicates the profession of the person who received the bullying. 3. Comment: Contains the text of the comment in Bangla. 4. Bully: Binary label indicating whether the comment contains bullying content. (0 for no, 1 for yes) 5. Sexual: Binary label indicating whether the comment contains sexual harassment content. (0 for no, 1 for yes) 6. Religious: Binary label indicating whether the comment contains religious hate speech. (0 for no, 1 for yes) 7. Threat: Binary label indicating whether the comment contains threats. (0 for no, 1 for yes) 8. Spam: Binary label indicating whether the comment is considered spam. (0 for no, 1 for yes)
Applications 1. Training and testing machine learning models for multilabel classification. 2. Research on natural language processing (NLP) and cyberbullying detection in low-resource languages like Bangla. 3. Developing automated systems for monitoring and moderating online content on social media platforms to ensure safe and respectful communication.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains an image database (18,981 images) that could be used to train a deep learning model to accurately detect characters. We have successfully used it to create a model that identifies characters encoded using LeetSpeak. The original dataset can be found in the Mondragon Unibertsitatea Repository -- https://gitlab.danz.eus/datasharing/ski4spam
The training dataset consists of:
- Alphabetic letters (a-z) written using different fonts and styles (regular, cursive, bold, cursive+bold)
- Handwritten letters: English handwriting from the Chars74k dataset [2] which is available at http://www.ee.surrey.ac.uk/CVSSP/demos/chars74k/.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The diverse types of fake text generation practices by spammer make spam detection challenging. Existing works use manually designed discrete textual or behavior features, which cannot capture complex global semantics of text and reviews. Some studies use limited features while neglecting other significant features. However, in case of a large number of features set, the selection of all features leads to overfitting the model and expensive computation. The problem statement of this research paper revolves around addressing challenges concerning feature selection and evolving spammer behavior and linguistic features, with the goal of devising an efficient model for spam detection. The primary objective of this endeavor was to identify the most efficacious subset of features and patterns for the task of spam detection. Spammer behavior features and linguistic features often exhibit complex relationships that influence the nature of spam reviews. The unified representation of features is another challenging task in spam detection. Various deep learning approaches have been proposed for spam detection and classification but these methods are specialized in extracting the features but lack to capture feature dependencies effectively with other features but there is a lack of comprehensive models that integrate linguistic and behavioral features to improve the accuracy of spam detection. The proposed spam detection framework SD-FSL-CLSTM used the fusion of spammer behavior features and linguistic features which automatically detect and classify the spam reviews. Fusion enables the proposed model to automatically learn the interactions between the features during the training process, allowing it to capture complex relationships and make predictions based on both types of features. SD-FSL-CLSTM framework apparently shows the promising result by obtaining a minimum accuracy 97%.
Dataset Card for "turkishSMS-ds"
The dataset was utilized in the following study. It consists of Turkish SMS spam and legitimate data. Uysal, A. K., Gunal, S., Ergin, S., & Gunal, E. S. (2013). The impact of feature extraction and selection on SMS spam filtering. Elektronika ir Elektrotechnika, 19(5), 67-72. More Information needed
This is a submission for Challenge #22 by Desights User
Click here for Challenge Details Note: This submission is in REVIEW state and is only accessible by Challenge Reviewers. So you might get errors when you try to download this asset directly from Ocean Market.
Submission Description
Replicated from README.
How to Use This Repository
Main Files
The main submission files are in the home directory:
Discord Community Dynamics - Analysis by Bryce.html - This HTML versio is the best file to use. My submission uses Highcharts for interactive charts, so this version will allow limited drilldown options.
Discord Community Dynamics - Analysis by Bryce.pdf: In case there are problems with the HTML version, I have provided this PDF version. It is not interactive and the formatting will be a bit worse.
Discord Community Dynamics - Analysis by Bryce.qmd: This Quarto document can be viewed to understand the code behind the exhibits. The code has been hidden in the other versions to remove complexity and put the focus squarely on results.
Support Files
Various support files were also used to do analysis. These are saved in the support/ folder. Due to limited time, these won't be super user-friendly unfortunately. I also moved them recently and have not refactored so they won't run without fixing file location and working directory issues.
Data Files
I have removed the data files to keep the submission file size small.
All the files can be built using support scripts, starting from only the contest dataset "Ocean Discord Data Challenge Dataset.csv". That said, please contact me (superchordate@gmail.com) if you'd like the full repository including the data files.
Data Sources
$OCEAN price and volume information are taken from the www.cryptocurrencychart.com API. External pretrained models used include mrm8488/bert-tiny-finetuned-sms-spam-detection and mshenoda/roberta-spam.
Author
Bryce Chamberlain superchordate@gmail.com https://www.bryce-chamberlain.com
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The diverse types of fake text generation practices by spammer make spam detection challenging. Existing works use manually designed discrete textual or behavior features, which cannot capture complex global semantics of text and reviews. Some studies use limited features while neglecting other significant features. However, in case of a large number of features set, the selection of all features leads to overfitting the model and expensive computation. The problem statement of this research paper revolves around addressing challenges concerning feature selection and evolving spammer behavior and linguistic features, with the goal of devising an efficient model for spam detection. The primary objective of this endeavor was to identify the most efficacious subset of features and patterns for the task of spam detection. Spammer behavior features and linguistic features often exhibit complex relationships that influence the nature of spam reviews. The unified representation of features is another challenging task in spam detection. Various deep learning approaches have been proposed for spam detection and classification but these methods are specialized in extracting the features but lack to capture feature dependencies effectively with other features but there is a lack of comprehensive models that integrate linguistic and behavioral features to improve the accuracy of spam detection. The proposed spam detection framework SD-FSL-CLSTM used the fusion of spammer behavior features and linguistic features which automatically detect and classify the spam reviews. Fusion enables the proposed model to automatically learn the interactions between the features during the training process, allowing it to capture complex relationships and make predictions based on both types of features. SD-FSL-CLSTM framework apparently shows the promising result by obtaining a minimum accuracy 97%.
https://www.marketresearchintellect.com/privacy-policyhttps://www.marketresearchintellect.com/privacy-policy
The size and share of the market is categorized based on Type (On-Premise, Cloud-Based) and Application (Data Analysis and Forecasting, Fraud-Spam Detection, Intelligence and Law Enforcement, Customer Relationship Management (CRM), Others) and geographical regions (North America, Europe, Asia-Pacific, South America, and Middle-East and Africa).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The diverse types of fake text generation practices by spammer make spam detection challenging. Existing works use manually designed discrete textual or behavior features, which cannot capture complex global semantics of text and reviews. Some studies use limited features while neglecting other significant features. However, in case of a large number of features set, the selection of all features leads to overfitting the model and expensive computation. The problem statement of this research paper revolves around addressing challenges concerning feature selection and evolving spammer behavior and linguistic features, with the goal of devising an efficient model for spam detection. The primary objective of this endeavor was to identify the most efficacious subset of features and patterns for the task of spam detection. Spammer behavior features and linguistic features often exhibit complex relationships that influence the nature of spam reviews. The unified representation of features is another challenging task in spam detection. Various deep learning approaches have been proposed for spam detection and classification but these methods are specialized in extracting the features but lack to capture feature dependencies effectively with other features but there is a lack of comprehensive models that integrate linguistic and behavioral features to improve the accuracy of spam detection. The proposed spam detection framework SD-FSL-CLSTM used the fusion of spammer behavior features and linguistic features which automatically detect and classify the spam reviews. Fusion enables the proposed model to automatically learn the interactions between the features during the training process, allowing it to capture complex relationships and make predictions based on both types of features. SD-FSL-CLSTM framework apparently shows the promising result by obtaining a minimum accuracy 97%.
https://www.marketresearchintellect.com/privacy-policyhttps://www.marketresearchintellect.com/privacy-policy
The market size of the Text Analytics Market is categorized based on Type (On-Premise, Cloud-Based) and Application (Data Analysis & Forecasting, Fraud/Spam Detection, Intelligence & Law Enforcement, Customer Relationship Management (CRM), Other) and geographical regions (North America, Europe, Asia-Pacific, South America, and Middle-East and Africa).
This report provides insights into the market size and forecasts the value of the market, expressed in USD million, across these defined segments.
https://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy
BASE YEAR | 2024 |
HISTORICAL DATA | 2019 - 2024 |
REPORT COVERAGE | Revenue Forecast, Competitive Landscape, Growth Factors, and Trends |
MARKET SIZE 2023 | 4.65(USD Billion) |
MARKET SIZE 2024 | 5.19(USD Billion) |
MARKET SIZE 2032 | 12.5(USD Billion) |
SEGMENTS COVERED | Technology, Deployment Type, End User, Application, Regional |
COUNTRIES COVERED | North America, Europe, APAC, South America, MEA |
KEY MARKET DYNAMICS | rising regulatory compliance demands, increasing user-generated content, enhanced AI moderation technologies, growing concerns over online safety, demand for multilingual support |
MARKET FORECAST UNITS | USD Billion |
KEY COMPANIES PROFILED | Salesforce, Facebook, Verint, Microsoft, Google, Sprinklr, OpenAI, Twitter, IBM, Dynatrace, Clarifai, Cision, Sift, Hootsuite, AWS |
MARKET FORECAST PERIOD | 2025 - 2032 |
KEY MARKET OPPORTUNITIES | AI-driven moderation technologies, Increased demand from social media platforms, Expansion in e-commerce content moderation, Rising need for compliance solutions, Growth in multilingual moderation services |
COMPOUND ANNUAL GROWTH RATE (CAGR) | 11.6% (2025 - 2032) |
Not seeing a result you expected?
Learn how you can add new datasets to our index.
The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.
The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.
This corpus has been collected from free or free for research sources at the Internet:
-> A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. The Grumbletext Web site is: [Web Link]. -> A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. The NUS SMS Corpus is avalaible at: [Web Link]. -> A list of 450 SMS ham messages collected from Caroline Tag's PhD Thesis available at [Web Link]. -> Finally, we have incorporated the SMS Spam Corpus v.0.1 Big. It has 1,002 SMS ham messages and 322 spam messages and it is public available at: [Web Link]. This corpus has been used in the following academic researches:
The original dataset can be found here. The creators would like to note that in case you find the dataset useful, please make a reference to previous paper and the web page: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/ in your papers, research, etc.
We offer a comprehensive study of this corpus in the following paper. This work presents a number of statistics, studies and baseline results for several machine learning methods.
Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011.