15 datasets found
  1. SMS Spam Collection Dataset

    • kaggle.com
    • opendatalab.com
    zip
    Updated Dec 2, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCI Machine Learning (2016). SMS Spam Collection Dataset [Dataset]. https://www.kaggle.com/uciml/sms-spam-collection-dataset
    Explore at:
    zip(215934 bytes)Available download formats
    Dataset updated
    Dec 2, 2016
    Dataset authored and provided by
    UCI Machine Learning
    Description

    Context

    The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.

    Content

    The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.

    This corpus has been collected from free or free for research sources at the Internet:

    -> A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. The Grumbletext Web site is: [Web Link]. -> A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. The NUS SMS Corpus is avalaible at: [Web Link]. -> A list of 450 SMS ham messages collected from Caroline Tag's PhD Thesis available at [Web Link]. -> Finally, we have incorporated the SMS Spam Corpus v.0.1 Big. It has 1,002 SMS ham messages and 322 spam messages and it is public available at: [Web Link]. This corpus has been used in the following academic researches:

    Acknowledgements

    The original dataset can be found here. The creators would like to note that in case you find the dataset useful, please make a reference to previous paper and the web page: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/ in your papers, research, etc.

    We offer a comprehensive study of this corpus in the following paper. This work presents a number of statistics, studies and baseline results for several machine learning methods.

    Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011.

    Inspiration

    • Can you use this dataset to build a prediction model that will accurately classify which texts are spam?
  2. h

    scam-detection-data

    • huggingface.co
    Updated Mar 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sherwin Larsen Alva (2025). scam-detection-data [Dataset]. https://huggingface.co/datasets/SparkyPilot/scam-detection-data
    Explore at:
    Dataset updated
    Mar 25, 2025
    Authors
    Sherwin Larsen Alva
    Description

    Using the Dataset The dataset used for training and evaluation is available here. You can load it using the datasets library: from datasets import load_dataset

    Load the dataset

    dataset = load_dataset("SparkyPilot/scam-detection-data")

    Explore the dataset

    print(dataset["train"][0]) # Print the first example in the training set

    The link for the datasets taken from different sources are mentioned down here -

    spam.csv [https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset]… See the full description on the dataset page: https://huggingface.co/datasets/SparkyPilot/scam-detection-data.

  3. S

    SMS Firewall Report

    • marketresearchforecast.com
    doc, pdf, ppt
    Updated Mar 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Research Forecast (2025). SMS Firewall Report [Dataset]. https://www.marketresearchforecast.com/reports/sms-firewall-29332
    Explore at:
    pdf, doc, pptAvailable download formats
    Dataset updated
    Mar 7, 2025
    Dataset authored and provided by
    Market Research Forecast
    License

    https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The SMS Firewall market, valued at $3602.1 million in 2025, is experiencing robust growth driven by increasing concerns over SMS-based threats like spam, phishing, and malware. The rising adoption of mobile banking and e-commerce fuels demand for robust security solutions, making SMS firewalls a critical component of overall cybersecurity strategies. Key application segments include BFSI (Banking, Financial Services, and Insurance), where secure transactions are paramount, and the burgeoning entertainment and retail sectors, reliant on SMS-based communications for promotions and customer engagement. The market's segmentation also encompasses A2P (Application-to-Person) and P2A (Person-to-Application) messaging, reflecting the diverse ways businesses and individuals utilize SMS. Technological advancements, such as AI-powered threat detection and improved filtering techniques, further enhance the effectiveness of SMS firewalls and contribute to market expansion. Geographic growth is expected to be diverse, with North America and Europe holding significant market share initially due to high technological adoption and stringent regulatory frameworks. However, rapid digitalization in Asia-Pacific and the Middle East & Africa presents substantial growth opportunities in the coming years. Competition in the market is intense, with established players like Tata Communications and Sinch vying with newer entrants for market share. This competitive landscape fosters innovation and drives down prices, making SMS firewall solutions increasingly accessible to a broader range of businesses and organizations. The forecast period (2025-2033) anticipates continued market expansion, fuelled by evolving threats and increased regulatory scrutiny. Factors such as the increasing sophistication of malicious SMS campaigns, the rise of 5G technology (which may increase SMS vulnerabilities), and evolving privacy regulations will continue to shape market dynamics. While the precise CAGR is unavailable, a conservative estimate considering industry growth trends and the inherent need for robust security in an increasingly digital world would place the CAGR in the range of 12-15% annually. This growth projection reflects not only the increasing demand for SMS firewalls but also the ongoing development of more sophisticated solutions capable of countering increasingly complex threats. The market is expected to see significant consolidation, with larger players acquiring smaller firms to expand their product portfolios and geographic reach.

  4. R

    Spam Dataset

    • universe.roboflow.com
    zip
    Updated Oct 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    spam (2024). Spam Dataset [Dataset]. https://universe.roboflow.com/spam-jwkhh/spam-qttjo/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 18, 2024
    Dataset authored and provided by
    spam
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Variables measured
    Text Bounding Boxes
    Description

    Spam

    ## Overview
    
    Spam is a dataset for object detection tasks - it contains Text annotations for 300 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [MIT license](https://creativecommons.org/licenses/MIT).
    
  5. f

    Spammer behavior used in literature.

    • plos.figshare.com
    xls
    Updated Feb 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amna Iqbal; Muhammad Younas; Muhammad Kashif Hanif; Muhammad Murad; Rabia Saleem; Muhammad Aater Javed (2025). Spammer behavior used in literature. [Dataset]. http://doi.org/10.1371/journal.pone.0313628.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 6, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Amna Iqbal; Muhammad Younas; Muhammad Kashif Hanif; Muhammad Murad; Rabia Saleem; Muhammad Aater Javed
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The diverse types of fake text generation practices by spammer make spam detection challenging. Existing works use manually designed discrete textual or behavior features, which cannot capture complex global semantics of text and reviews. Some studies use limited features while neglecting other significant features. However, in case of a large number of features set, the selection of all features leads to overfitting the model and expensive computation. The problem statement of this research paper revolves around addressing challenges concerning feature selection and evolving spammer behavior and linguistic features, with the goal of devising an efficient model for spam detection. The primary objective of this endeavor was to identify the most efficacious subset of features and patterns for the task of spam detection. Spammer behavior features and linguistic features often exhibit complex relationships that influence the nature of spam reviews. The unified representation of features is another challenging task in spam detection. Various deep learning approaches have been proposed for spam detection and classification but these methods are specialized in extracting the features but lack to capture feature dependencies effectively with other features but there is a lack of comprehensive models that integrate linguistic and behavioral features to improve the accuracy of spam detection. The proposed spam detection framework SD-FSL-CLSTM used the fusion of spammer behavior features and linguistic features which automatically detect and classify the spam reviews. Fusion enables the proposed model to automatically learn the interactions between the features during the training process, allowing it to capture complex relationships and make predictions based on both types of features. SD-FSL-CLSTM framework apparently shows the promising result by obtaining a minimum accuracy 97%.

  6. m

    Bangla Multilabel Cyberbully, Sexual Harrasment, Threat and Spam Detection...

    • data.mendeley.com
    Updated Jul 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saieef Sunny (2024). Bangla Multilabel Cyberbully, Sexual Harrasment, Threat and Spam Detection Dataset [Dataset]. http://doi.org/10.17632/sz5558wrd4.3
    Explore at:
    Dataset updated
    Jul 16, 2024
    Authors
    Saieef Sunny
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Overview The Bangla Multilabel Cyberbully, Sexual Harassment, Threat, and Spam Detection Dataset is designed to facilitate the development of machine learning models to detect and classify various types of abusive content in Bangla social media text. This dataset contains a collection of comments annotated for multiple types of abuse, making it suitable for multilabel classification tasks. It aims to support research and development in natural language processing (NLP) to enhance online safety and moderate harmful content on Bangla language social media platforms.

    Purpose 1. Train and evaluate machine learning models for detection of cyberbullying, sexual harassment, religious hate speech, threats, and spam in Bangla comments. 2. Support research in NLP and machine learning focused on Bangla, a low-resource language. 3. Aid in developing automated moderation systems for social media platforms to ensure safe and respectful communication.

    Data Collection Initially, we collected around 30,000 comments from social media platforms like Facebook and TikTok. These comments were in Bangla, English, and Banglish (Bangla written using English characters). Since our research focuses on Bangla abusive text detection, we refined the dataset through the following steps:

    1. We filtered out all comments written in English to focus on the Bangla text.
    2. To ensure data quality, We eliminated duplicate entries and rows with missing or null values.
    3. We removed any remaining English characters and both Bangla and English numerical values to ensure the analysis was based solely on Bangla text.

    After these steps, we obtained a final dataset of 12,557 comments. Each comment was manually labeled into five classes: bully, sexual, religious, threat, and spam. This dataset supports multi-class labeling, meaning a comment can simultaneously belong to more than one class.

    Dataset Columns 1. Gender: Indicates the gender of the person who received the bullying. 2. Profession: Indicates the profession of the person who received the bullying. 3. Comment: Contains the text of the comment in Bangla. 4. Bully: Binary label indicating whether the comment contains bullying content. (0 for no, 1 for yes) 5. Sexual: Binary label indicating whether the comment contains sexual harassment content. (0 for no, 1 for yes) 6. Religious: Binary label indicating whether the comment contains religious hate speech. (0 for no, 1 for yes) 7. Threat: Binary label indicating whether the comment contains threats. (0 for no, 1 for yes) 8. Spam: Binary label indicating whether the comment is considered spam. (0 for no, 1 for yes)

    Applications 1. Training and testing machine learning models for multilabel classification. 2. Research on natural language processing (NLP) and cyberbullying detection in low-resource languages like Bangla. 3. Developing automated systems for monitoring and moderating online content on social media platforms to ensure safe and respectful communication.

  7. Data from: Image dataset to train a deep learning model to decode Leetspeak...

    • zenodo.org
    • research.science.eus
    • +2more
    zip
    Updated Mar 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iñaki Velez de Mendizabal; Iñaki Velez de Mendizabal; Xabier Vidriales; Vitor Basto Fernandes; Vitor Basto Fernandes; Enaitz Ezpeleta; Enaitz Ezpeleta; José Ramón Méndez; José Ramón Méndez; Urko Zurutuza; Urko Zurutuza; Xabier Vidriales (2022). Image dataset to train a deep learning model to decode Leetspeak obfuscated characters [Dataset]. http://doi.org/10.5281/zenodo.6373423
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 22, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Iñaki Velez de Mendizabal; Iñaki Velez de Mendizabal; Xabier Vidriales; Vitor Basto Fernandes; Vitor Basto Fernandes; Enaitz Ezpeleta; Enaitz Ezpeleta; José Ramón Méndez; José Ramón Méndez; Urko Zurutuza; Urko Zurutuza; Xabier Vidriales
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains an image database (18,981 images) that could be used to train a deep learning model to accurately detect characters. We have successfully used it to create a model that identifies characters encoded using LeetSpeak. The original dataset can be found in the Mondragon Unibertsitatea Repository -- https://gitlab.danz.eus/datasharing/ski4spam

    The training dataset consists of:

    - Alphabetic letters (a-z) written using different fonts and styles (regular, cursive, bold, cursive+bold)

    - Handwritten letters: English handwriting from the Chars74k dataset [2] which is available at http://www.ee.surrey.ac.uk/CVSSP/demos/chars74k/.

  8. f

    Feature selection using PCA.

    • plos.figshare.com
    xls
    Updated Feb 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Feature selection using PCA. [Dataset]. https://plos.figshare.com/articles/dataset/Feature_selection_using_PCA_/28362299
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 6, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Amna Iqbal; Muhammad Younas; Muhammad Kashif Hanif; Muhammad Murad; Rabia Saleem; Muhammad Aater Javed
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The diverse types of fake text generation practices by spammer make spam detection challenging. Existing works use manually designed discrete textual or behavior features, which cannot capture complex global semantics of text and reviews. Some studies use limited features while neglecting other significant features. However, in case of a large number of features set, the selection of all features leads to overfitting the model and expensive computation. The problem statement of this research paper revolves around addressing challenges concerning feature selection and evolving spammer behavior and linguistic features, with the goal of devising an efficient model for spam detection. The primary objective of this endeavor was to identify the most efficacious subset of features and patterns for the task of spam detection. Spammer behavior features and linguistic features often exhibit complex relationships that influence the nature of spam reviews. The unified representation of features is another challenging task in spam detection. Various deep learning approaches have been proposed for spam detection and classification but these methods are specialized in extracting the features but lack to capture feature dependencies effectively with other features but there is a lack of comprehensive models that integrate linguistic and behavioral features to improve the accuracy of spam detection. The proposed spam detection framework SD-FSL-CLSTM used the fusion of spammer behavior features and linguistic features which automatically detect and classify the spam reviews. Fusion enables the proposed model to automatically learn the interactions between the features during the training process, allowing it to capture complex relationships and make predictions based on both types of features. SD-FSL-CLSTM framework apparently shows the promising result by obtaining a minimum accuracy 97%.

  9. h

    turkishSMS-ds

    • huggingface.co
    Updated Apr 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alper Kürşat Uysal (2024). turkishSMS-ds [Dataset]. https://huggingface.co/datasets/akuysal/turkishSMS-ds
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 6, 2023
    Authors
    Alper Kürşat Uysal
    Description

    Dataset Card for "turkishSMS-ds"

    The dataset was utilized in the following study. It consists of Turkish SMS spam and legitimate data. Uysal, A. K., Gunal, S., Ergin, S., & Gunal, E. S. (2013). The impact of feature extraction and selection on SMS spam filtering. Elektronika ir Elektrotechnika, 19(5), 67-72. More Information needed

  10. o

    Desights: Discord Community Dynamics - Analysis by Bryce

    • market.oceanprotocol.com
    Updated Mar 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Desights User (2024). Desights: Discord Community Dynamics - Analysis by Bryce [Dataset]. https://market.oceanprotocol.com/asset/did:op:b0471a0985ff1dba5e4384a87705beb53afa3925310536eec0b86ac8fdde78f8
    Explore at:
    Dataset updated
    Mar 12, 2024
    Dataset authored and provided by
    Desights User
    Description

    This is a submission for Challenge #22 by Desights User

    Click here for Challenge Details Note: This submission is in REVIEW state and is only accessible by Challenge Reviewers. So you might get errors when you try to download this asset directly from Ocean Market.

    Submission Description

    Replicated from README.

    How to Use This Repository

    Main Files

    The main submission files are in the home directory:

    Discord Community Dynamics - Analysis by Bryce.html - This HTML versio is the best file to use. My submission uses Highcharts for interactive charts, so this version will allow limited drilldown options.

    Discord Community Dynamics - Analysis by Bryce.pdf: In case there are problems with the HTML version, I have provided this PDF version. It is not interactive and the formatting will be a bit worse.

    Discord Community Dynamics - Analysis by Bryce.qmd: This Quarto document can be viewed to understand the code behind the exhibits. The code has been hidden in the other versions to remove complexity and put the focus squarely on results.

    Support Files

    Various support files were also used to do analysis. These are saved in the support/ folder. Due to limited time, these won't be super user-friendly unfortunately. I also moved them recently and have not refactored so they won't run without fixing file location and working directory issues.

    Data Files

    I have removed the data files to keep the submission file size small.

    All the files can be built using support scripts, starting from only the contest dataset "Ocean Discord Data Challenge Dataset.csv". That said, please contact me (superchordate@gmail.com) if you'd like the full repository including the data files.

    Data Sources

    $OCEAN price and volume information are taken from the www.cryptocurrencychart.com API. External pretrained models used include mrm8488/bert-tiny-finetuned-sms-spam-detection and mshenoda/roberta-spam.

    Author

    Bryce Chamberlain superchordate@gmail.com https://www.bryce-chamberlain.com

  11. f

    Feature selection using XGB.

    • plos.figshare.com
    xls
    Updated Feb 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amna Iqbal; Muhammad Younas; Muhammad Kashif Hanif; Muhammad Murad; Rabia Saleem; Muhammad Aater Javed (2025). Feature selection using XGB. [Dataset]. http://doi.org/10.1371/journal.pone.0313628.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 6, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Amna Iqbal; Muhammad Younas; Muhammad Kashif Hanif; Muhammad Murad; Rabia Saleem; Muhammad Aater Javed
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The diverse types of fake text generation practices by spammer make spam detection challenging. Existing works use manually designed discrete textual or behavior features, which cannot capture complex global semantics of text and reviews. Some studies use limited features while neglecting other significant features. However, in case of a large number of features set, the selection of all features leads to overfitting the model and expensive computation. The problem statement of this research paper revolves around addressing challenges concerning feature selection and evolving spammer behavior and linguistic features, with the goal of devising an efficient model for spam detection. The primary objective of this endeavor was to identify the most efficacious subset of features and patterns for the task of spam detection. Spammer behavior features and linguistic features often exhibit complex relationships that influence the nature of spam reviews. The unified representation of features is another challenging task in spam detection. Various deep learning approaches have been proposed for spam detection and classification but these methods are specialized in extracting the features but lack to capture feature dependencies effectively with other features but there is a lack of comprehensive models that integrate linguistic and behavioral features to improve the accuracy of spam detection. The proposed spam detection framework SD-FSL-CLSTM used the fusion of spammer behavior features and linguistic features which automatically detect and classify the spam reviews. Fusion enables the proposed model to automatically learn the interactions between the features during the training process, allowing it to capture complex relationships and make predictions based on both types of features. SD-FSL-CLSTM framework apparently shows the promising result by obtaining a minimum accuracy 97%.

  12. m

    Global Text Analytics (Mining) Software Market Size, Trends and Projections

    • marketresearchintellect.com
    Updated Mar 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Research Intellect (2025). Global Text Analytics (Mining) Software Market Size, Trends and Projections [Dataset]. https://www.marketresearchintellect.com/product/text-analytics-mining-software-market/
    Explore at:
    Dataset updated
    Mar 11, 2025
    Dataset authored and provided by
    Market Research Intellect
    License

    https://www.marketresearchintellect.com/privacy-policyhttps://www.marketresearchintellect.com/privacy-policy

    Area covered
    Global
    Description

    The size and share of the market is categorized based on Type (On-Premise, Cloud-Based) and Application (Data Analysis and Forecasting, Fraud-Spam Detection, Intelligence and Law Enforcement, Customer Relationship Management (CRM), Others) and geographical regions (North America, Europe, Asia-Pacific, South America, and Middle-East and Africa).

  13. f

    Evaluation matrices of the proposed approach.

    • plos.figshare.com
    xls
    Updated Feb 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amna Iqbal; Muhammad Younas; Muhammad Kashif Hanif; Muhammad Murad; Rabia Saleem; Muhammad Aater Javed (2025). Evaluation matrices of the proposed approach. [Dataset]. http://doi.org/10.1371/journal.pone.0313628.t009
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 6, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Amna Iqbal; Muhammad Younas; Muhammad Kashif Hanif; Muhammad Murad; Rabia Saleem; Muhammad Aater Javed
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The diverse types of fake text generation practices by spammer make spam detection challenging. Existing works use manually designed discrete textual or behavior features, which cannot capture complex global semantics of text and reviews. Some studies use limited features while neglecting other significant features. However, in case of a large number of features set, the selection of all features leads to overfitting the model and expensive computation. The problem statement of this research paper revolves around addressing challenges concerning feature selection and evolving spammer behavior and linguistic features, with the goal of devising an efficient model for spam detection. The primary objective of this endeavor was to identify the most efficacious subset of features and patterns for the task of spam detection. Spammer behavior features and linguistic features often exhibit complex relationships that influence the nature of spam reviews. The unified representation of features is another challenging task in spam detection. Various deep learning approaches have been proposed for spam detection and classification but these methods are specialized in extracting the features but lack to capture feature dependencies effectively with other features but there is a lack of comprehensive models that integrate linguistic and behavioral features to improve the accuracy of spam detection. The proposed spam detection framework SD-FSL-CLSTM used the fusion of spammer behavior features and linguistic features which automatically detect and classify the spam reviews. Fusion enables the proposed model to automatically learn the interactions between the features during the training process, allowing it to capture complex relationships and make predictions based on both types of features. SD-FSL-CLSTM framework apparently shows the promising result by obtaining a minimum accuracy 97%.

  14. m

    Text Analytics Market Size, Share, Trends, Scope And Forecast

    • marketresearchintellect.com
    Updated Nov 24, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Research Intellect® | Market Analysis and Research Reports (2021). Text Analytics Market Size, Share, Trends, Scope And Forecast [Dataset]. https://www.marketresearchintellect.com/product/global-text-analytics-market-size-forecast/
    Explore at:
    Dataset updated
    Nov 24, 2021
    Dataset authored and provided by
    Market Research Intellect® | Market Analysis and Research Reports
    License

    https://www.marketresearchintellect.com/privacy-policyhttps://www.marketresearchintellect.com/privacy-policy

    Area covered
    Global
    Description

    The market size of the Text Analytics Market is categorized based on Type (On-Premise, Cloud-Based) and Application (Data Analysis & Forecasting, Fraud/Spam Detection, Intelligence & Law Enforcement, Customer Relationship Management (CRM), Other) and geographical regions (North America, Europe, Asia-Pacific, South America, and Middle-East and Africa).

    This report provides insights into the market size and forecasts the value of the market, expressed in USD million, across these defined segments.

  15. w

    Global Text Content Moderation Solution Market Research Report: By...

    • wiseguyreports.com
    Updated Mar 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wWiseguy Research Consultants Pvt Ltd (2025). Global Text Content Moderation Solution Market Research Report: By Technology (Machine Learning, Natural Language Processing, Artificial Intelligence, Rule-Based Systems), By Deployment Type (Cloud-Based, On-Premise, Hybrid), By End User (Enterprises, Social Media Platforms, E-commerce Platforms, Gaming Platforms), By Application (Content Moderation, Spam Detection, Sentiment Analysis, User-Generated Content Monitoring) and By Regional (North America, Europe, South America, Asia Pacific, Middle East and Africa) - Forecast to 2032. [Dataset]. https://www.wiseguyreports.com/cn/reports/text-content-moderation-solution-market
    Explore at:
    Dataset updated
    Mar 21, 2025
    Dataset authored and provided by
    wWiseguy Research Consultants Pvt Ltd
    License

    https://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy

    Area covered
    Global
    Description
    BASE YEAR2024
    HISTORICAL DATA2019 - 2024
    REPORT COVERAGERevenue Forecast, Competitive Landscape, Growth Factors, and Trends
    MARKET SIZE 20234.65(USD Billion)
    MARKET SIZE 20245.19(USD Billion)
    MARKET SIZE 203212.5(USD Billion)
    SEGMENTS COVEREDTechnology, Deployment Type, End User, Application, Regional
    COUNTRIES COVEREDNorth America, Europe, APAC, South America, MEA
    KEY MARKET DYNAMICSrising regulatory compliance demands, increasing user-generated content, enhanced AI moderation technologies, growing concerns over online safety, demand for multilingual support
    MARKET FORECAST UNITSUSD Billion
    KEY COMPANIES PROFILEDSalesforce, Facebook, Verint, Microsoft, Google, Sprinklr, OpenAI, Twitter, IBM, Dynatrace, Clarifai, Cision, Sift, Hootsuite, AWS
    MARKET FORECAST PERIOD2025 - 2032
    KEY MARKET OPPORTUNITIESAI-driven moderation technologies, Increased demand from social media platforms, Expansion in e-commerce content moderation, Rising need for compliance solutions, Growth in multilingual moderation services
    COMPOUND ANNUAL GROWTH RATE (CAGR) 11.6% (2025 - 2032)
  16. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
UCI Machine Learning (2016). SMS Spam Collection Dataset [Dataset]. https://www.kaggle.com/uciml/sms-spam-collection-dataset
Organization logo

SMS Spam Collection Dataset

Collection of SMS messages tagged as spam or legitimate

Explore at:
57 scholarly articles cite this dataset (View in Google Scholar)
zip(215934 bytes)Available download formats
Dataset updated
Dec 2, 2016
Dataset authored and provided by
UCI Machine Learning
Description

Context

The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.

Content

The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.

This corpus has been collected from free or free for research sources at the Internet:

-> A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. The Grumbletext Web site is: [Web Link]. -> A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. The NUS SMS Corpus is avalaible at: [Web Link]. -> A list of 450 SMS ham messages collected from Caroline Tag's PhD Thesis available at [Web Link]. -> Finally, we have incorporated the SMS Spam Corpus v.0.1 Big. It has 1,002 SMS ham messages and 322 spam messages and it is public available at: [Web Link]. This corpus has been used in the following academic researches:

Acknowledgements

The original dataset can be found here. The creators would like to note that in case you find the dataset useful, please make a reference to previous paper and the web page: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/ in your papers, research, etc.

We offer a comprehensive study of this corpus in the following paper. This work presents a number of statistics, studies and baseline results for several machine learning methods.

Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011.

Inspiration

  • Can you use this dataset to build a prediction model that will accurately classify which texts are spam?
Search
Clear search
Close search
Google apps
Main menu