16 datasets found
  1. P

    Malicious URLs Dataset Dataset

    • paperswithcode.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Malicious URLs Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/malicious-urls-dataset
    Explore at:
    Description

    Context Malicious URLs or malicious website is a very serious threat to cybersecurity. Malicious URLs host unsolicited content (spam, phishing, drive-by downloads, etc.) and lure unsuspecting users to become victims of scams (monetary loss, theft of private information, and malware installation), and cause losses of billions of dollars every year. We have collected this dataset to include a large number of examples of Malicious URLs so that a machine learning-based model can be developed to identify malicious urls so that we can stop them in advance before infecting computer system or spreading through inteinternet.

    Content we have collected a huge dataset of 651,191 URLs, out of which 428103 benign or safe URLs, 96457 defacement URLs, 94111 phishing URLs, and 32520 malware URLs. Figure 2 depicts their distribution in terms of percentage. As we know one of the most crucial tasks is to curate the dataset for a machine learning project. We have curated this dataset from five different sources.

    For collecting benign, phishing, malware and defacement URLs we have used URL dataset (ISCX-URL-2016) For increasing phishing and malware URLs, we have used Malware domain black list dataset. We have increased benign URLs using faizan git repo At last, we have increased more number of phishing URLs using Phishtank dataset and PhishStorm dataset As we have told you that dataset is collected from different sources. So firstly, we have collected the URLs from different sources into a separate data frame and finally merge them to retain only URLs and their class type.

  2. Facebook Spam Dataset

    • kaggle.com
    Updated Apr 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khaja Hussain SK (2021). Facebook Spam Dataset [Dataset]. https://www.kaggle.com/khajahussainsk/facebook-spam-dataset/activity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 11, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Khaja Hussain SK
    Description

    Context Collection of Facebook spam-legit profile and content-based data. It can be used for classification tasks.

    Content The dataset can be used for building machine learning models. To collect the dataset, Facebook API and Facebook Graph API are used and the data is collected from public profiles. There are 500 legit profiles and 100 spam profiles. The list of features is as follows with Label (0-legit, 1-spam). 1. Number of friends 2. Number of followings 3. Number of Community 4. The age of the user account (in days) 5. Total number of posts shared 6. Total number of URLs shared 7. Total number of photos/videos shared 8. Fraction of the posts containing URLs 9. Fraction of the posts containing photos/videos 10. Average number of comments per post 11. Average number of likes per post 12. Average number of tags in a post (Rate of tagging) 13. Average number of hashtags present in a post

    Inspiration Dataset helps the community to understand how features can help to differ Facebook legit users from spam users.

  3. P

    Google Ranked URLs Dataset Dataset

    • paperswithcode.com
    Updated Oct 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fardin Rastakhiz; Mahdi Eftekhari; Sahar Vahdati (2024). Google Ranked URLs Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/google-ranked-urls-dataset
    Explore at:
    Dataset updated
    Oct 21, 2024
    Authors
    Fardin Rastakhiz; Mahdi Eftekhari; Sahar Vahdati
    Description

    This dataset was curated for Search Engine Optimization (SEO) analysis tasks, including categorization and spam detection. It covers 12 diverse topics: basketball, books, cats, gardening, history, movies, music, recipes, sports, technology, travel, and weather. Some topics have hierarchical relationships, such as sports and basketball, while others are closely related (e.g., movies and music) or unrelated (e.g., basketball and gardening), with varying degrees of overlap among them. For each topic, approximately 300 search queries were generated using large language models (LLMs) like GPT, Llama, and Claude. The top 10 URLs from the Google Search Console’s search engine results page (SERP) were retrieved for each query.

  4. Indonesian Email Spam

    • kaggle.com
    Updated Jan 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    gevabriel (2024). Indonesian Email Spam [Dataset]. https://www.kaggle.com/datasets/gevabriel/indonesian-email-spam/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 15, 2024
    Dataset provided by
    Kaggle
    Authors
    gevabriel
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset was forked from: https://www.kaggle.com/datasets/mfaisalqureshi/spam-email/ and translated to Indonesian Language by me. This dataset consists of 2620 data, which consists of 1362 spam messages and 1258 non-spam messages (ham) [52%:48%].

  5. Phishing websites Data

    • kaggle.com
    Updated Aug 31, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aman Nagariya (2020). Phishing websites Data [Dataset]. https://www.kaggle.com/aman9d/phishing-data/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 31, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aman Nagariya
    Description

    Domain: The URL itself. Ranking: Page Ranking isIp: Is there an IP address in the weblink valid: This data is fetched from google's whois API that tells us more about the current status of the URL's registration. activeDuration: Also from whois API. Gives the duration of the time since the registration up until now. urlLen: It is simply the length of the URL is@: If the link has a '@' character then it's value = 1 isredirect: If the link has double dashes, there is a chance that it is a redirect. 1-> multiple dashes present together. haveDash: If there are any dashes in the domain name. domainLen: The length of just the domain name. noOfSubdomain: The number of subdomains preset in the URL. Labels: 0 -> Legitimate website , 1 -> Phishing Link/ Spam Link

  6. U

    URL Shortener Software Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Feb 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). URL Shortener Software Report [Dataset]. https://www.archivemarketresearch.com/reports/url-shortener-software-43511
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    Feb 22, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    Paragraph 1: The global URL Shortener Software market size was valued at USD XXX million in 2025 and is projected to reach USD XXX million by 2033, exhibiting a CAGR of XX% during the forecast period from 2025 to 2033. The market growth is primarily driven by the increasing need to track and manage links effectively. With the rise of social media and email marketing, businesses require tools to shorten long and complex URLs while maintaining their functionality and trackability. The cloud-based deployment model is gaining popularity due to its scalability, cost-effectiveness, and ease of access. Paragraph 2: Major trends shaping the market include the adoption of AI-powered features to enhance link analysis and optimization. AI-driven URL shorteners can automatically tag and categorize links, identify spam or malicious URLs, and provide advanced analytics to improve campaign performance. Additionally, the integration of URL shortening capabilities within social media platforms and content management systems is expected to further drive market growth. Key players in the market include Hootsuite, Twitter, Bitly, and Rebrandly, among others. The market is expected to witness increased competition as new entrants emerge offering innovative features and competitive pricing. URL shortener software has emerged as a crucial tool in the digital age, enabling users to condense lengthy website addresses into manageable and shareable formats. This report provides in-depth insights into this software, analyzing market dynamics, key trends, and industry leaders.

  7. domains@spam.com - Reverse Whois Lookup

    • whoisdatacenter.com
    csv
    Updated Feb 2, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AllHeart Web Inc (2018). domains@spam.com - Reverse Whois Lookup [Dataset]. https://whoisdatacenter.com/email/domains@spam.com/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Feb 2, 2018
    Dataset provided by
    AllHeart Web
    Authors
    AllHeart Web Inc
    License

    https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/

    Time period covered
    Mar 15, 1985 - Jul 14, 2025
    Description

    Explore historical ownership and registration records by performing a reverse Whois lookup for the email address domains@spam.com..

  8. n

    මාධ්‍යවිකි:Spam-blacklist

    • wiki-data.si-lk.nina.az
    Updated Jun 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). මාධ්‍යවිකි:Spam-blacklist [Dataset]. https://www.wiki-data.si-lk.nina.az/%E0%B6%B8%E0%B7%8F%E0%B6%B0%E0%B7%8A%E2%80%8D%E0%B6%BA%E0%B7%80%E0%B7%92%E0%B6%9A%E0%B7%92:Spam-blacklist.html
    Explore at:
    Dataset updated
    Jun 27, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    External URLs matching this list will be blocked when added to a page This list affects only this wiki refer also to the

  9. Average results by country

    • getresponse.com
    Updated Apr 5, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GetResponse (2017). Average results by country [Dataset]. https://www.getresponse.com/resources/reports/email-marketing-benchmarks
    Explore at:
    Dataset updated
    Apr 5, 2017
    Dataset authored and provided by
    GetResponse
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    What are the average email marketing results in different countries? Here’s what we’ve found.

  10. Average results by industry

    • getresponse.com
    Updated Apr 5, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GetResponse (2017). Average results by industry [Dataset]. https://www.getresponse.com/resources/reports/email-marketing-benchmarks
    Explore at:
    Dataset updated
    Apr 5, 2017
    Dataset authored and provided by
    GetResponse
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Here, we’ve gathered email marketing benchmarks by industry. You can see how your average email open, click-through, click-to-open, unsubscribe, and spam complaint rates compare against other companies in your industry.

  11. Number of autoresponders in a cycle

    • getresponse.com
    Updated Apr 5, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GetResponse (2017). Number of autoresponders in a cycle [Dataset]. https://www.getresponse.com/resources/reports/email-marketing-benchmarks
    Explore at:
    Dataset updated
    Apr 5, 2017
    Dataset authored and provided by
    GetResponse
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    How many emails should you put into your autoresponder cycle? We’ve analyzed how the average engagement metrics change depending on the number of emails our customers used in their autoresp onder cycles.

  12. Enron Fraud Email Dataset

    • kaggle.com
    Updated Dec 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Advaith S Rao (2023). Enron Fraud Email Dataset [Dataset]. https://www.kaggle.com/datasets/advaithsrao/enron-fraud-email-dataset/versions/1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 28, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Advaith S Rao
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. The data has been made public and presents a diverse set of email information ranging from internal, marketing emails to spam and fraud attempts.

    In the early 2000s, Leslie Kaelbling at MIT purchased the dataset and noted that, though the dataset contained scam emails, it also had several integrity problems. The dataset was updated later, but it becomes key to ensure privacy in the data while it is used to train a deep neural network model.

    Though the Enron Email Dataset contains over 500K emails, one of the problems with the dataset is the availability of labeled frauds in the dataset. Label annotation is done to detect an umbrella of fraud emails accurately. Since, fraud emails fall into several types such as Phishing, Financial, Romance, Subscription, and Nigerian Prince scams, there have to be multiple heuristics used to label all types of fraudulent emails effectively.

    To tackle this problem, heuristics have been used to label the Enron data corpus using email signals, and automated labeling has been performed using simple ML models on other smaller email datasets available online. These fraud annotation techniques are discussed in detail below.

    To perform fraud annotation on the Enron dataset as well as provide more fraud examples for modeling, two more fraud data sources have been used, Phishing Email Dataset: https://www.kaggle.com/dsv/6090437 Social Engineering Dataset: http://aclweb.org/aclwiki

    Label Annotation

    To label the Enron email dataset two signals are used to filter suspicious emails and label them into fraud and non-fraud classes. Automated ML labeling Email Signals

    Automated ML Labeling

    The following heuristics are used to annotate labels for Enron email data using the other two data sources,

    Phishing Model Annotation: A high-precision SVM model trained on the Phishing mails dataset, which is used to annotate the Phishing Label on the Enron Dataset.

    Social Engineering Model Annotation: A high-precision SVM model trained on the Social Engineering mails dataset, which is used to annotate the Social Engineering Label on the Enron Dataset.

    The two ML Annotator models use Term Frequency Inverse Document Frequency (TF-IDF) to embed the input text and make use of SVM models with Gaussian Kernel.

    If either of the models predicted that an email was a fraud, the mail metadata was checked for several email signals. If these heuristics meet the requirements of a high-probability fraud email, we label it as a fraud email.

    Email Signals

    Email Signal-based heuristics are used to filter and target suspicious emails for fraud labeling specifically. The signals used were,

    Person Of Interest: There is a publicly available list of email addresses of employees who were liable for the massive data leak at Enron. These user mailboxes have a higher chance of containing quality fraud emails.

    Suspicious Folders: The Enron data is dumped into several folders for every employee. Folders consist of inbox, deleted_items, junk, calendar, etc. A set of folders with a higher chance of containing fraud emails, such as Deleted Items and Junk.

    Sender Type: The sender type was categorized as ‘Internal’ and ‘External’ based on their email address.

    Low Communication: A threshold of 4 emails based on the table below was used to define Low Communication. A user qualifies as a Low-Comm sender if their emails are below this threshold. Mails sent from low-comm senders have been assigned with a high probability of being a fraud.

    Contains Replies and Forwards: If an email contains forwards or replies, a low probability was assigned for it to be a fraud email.

    Manual Inspection

    To ensure high-quality labels, the mismatch examples from ML Annotation have been manually inspected for Enron dataset relabeling.

    Dataset Breakdown

    FraudNon-Fraud
    2327445090

    Citations

    Enron Dataset Title: Enron Email Dataset URL: https://www.cs.cmu.edu/~enron/ Publisher: MIT, CMU Author: Leslie Kaelbling, William W. Cohen Year: 2015

    Phishing Email Detection Dataset Title: Phishing Email Detection URL: https://www.kaggle.com/dsv/6090437 DOI: 10.34740/KAGGLE/DSV/6090437 Publisher: Kaggle Author: Subhadeep Chakraborty Year: 2023

    CLAIR Fraud Email Collection Title: CLAIR collection of fraud email URL: http://aclweb.org/aclwiki Author: Radev, D. Year: 2008

  13. Promedio de los resultados por sector

    • getresponse.com
    Updated Apr 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GetResponse (2024). Promedio de los resultados por sector [Dataset]. https://www.getresponse.com/es/recursos/reports/benchmark-de-email-marketing
    Explore at:
    Dataset updated
    Apr 2, 2024
    Dataset authored and provided by
    GetResponse
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Aquí, hemos recopilado los benchmarks de email marketing por sector. Verás cómo tus tasas de apertura, CTR, CTOR, suscripciones canceladas y quejas de spam se comparan con las de otras empresas en tu mercado.

  14. Número de autoresponders em um ciclo

    • getresponse.com
    Updated Dec 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GetResponse (2023). Número de autoresponders em um ciclo [Dataset]. https://www.getresponse.com/pt/resources/reports/benchmark-de-email-marketing
    Explore at:
    Dataset updated
    Dec 21, 2023
    Dataset authored and provided by
    GetResponse
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Quantos e-mails você deveria colocar em uma sequência automática? Investigamos como as métricas de engajamento mudam dependendo do número de mensagens que os nossos clientes usaram nos ciclos de autoresponder.

  15. Durchschnittliche Ergebnisse nach Branche

    • getresponse.com
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GetResponse (2023). Durchschnittliche Ergebnisse nach Branche [Dataset]. https://www.getresponse.com/de/resources/reports/email-marketing-benchmarks
    Explore at:
    Dataset updated
    Jun 21, 2023
    Dataset authored and provided by
    GetResponse
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Hier haben wir Benchmarks für das E-Mail-Marketing nach Branchen zusammengestellt. Du kannst sehen, wie deine durchschnittlichen E-Mail-Öffnungs-, Click-Through-, Click-to-Open-, Abmelde- und Spam-Beschwerderaten im Vergleich zu anderen Unternehmen in deiner Branche aussehen.

  16. Número de autoresponders en un ciclo

    • getresponse.com
    Updated Apr 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GetResponse (2024). Número de autoresponders en un ciclo [Dataset]. https://www.getresponse.com/es/recursos/reports/benchmark-de-email-marketing
    Explore at:
    Dataset updated
    Apr 2, 2024
    Dataset authored and provided by
    GetResponse
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ¿Cuántos emails deberías incluir en una secuencia automática? Investigamos cómo los resultados de engagement cambian según el número de mensajes que nuestros clientes pusieron en sus ciclos de autoresponder.

  17. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Malicious URLs Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/malicious-urls-dataset

Malicious URLs Dataset Dataset

Explore at:
Description

Context Malicious URLs or malicious website is a very serious threat to cybersecurity. Malicious URLs host unsolicited content (spam, phishing, drive-by downloads, etc.) and lure unsuspecting users to become victims of scams (monetary loss, theft of private information, and malware installation), and cause losses of billions of dollars every year. We have collected this dataset to include a large number of examples of Malicious URLs so that a machine learning-based model can be developed to identify malicious urls so that we can stop them in advance before infecting computer system or spreading through inteinternet.

Content we have collected a huge dataset of 651,191 URLs, out of which 428103 benign or safe URLs, 96457 defacement URLs, 94111 phishing URLs, and 32520 malware URLs. Figure 2 depicts their distribution in terms of percentage. As we know one of the most crucial tasks is to curate the dataset for a machine learning project. We have curated this dataset from five different sources.

For collecting benign, phishing, malware and defacement URLs we have used URL dataset (ISCX-URL-2016) For increasing phishing and malware URLs, we have used Malware domain black list dataset. We have increased benign URLs using faizan git repo At last, we have increased more number of phishing URLs using Phishtank dataset and PhishStorm dataset As we have told you that dataset is collected from different sources. So firstly, we have collected the URLs from different sources into a separate data frame and finally merge them to retain only URLs and their class type.

Search
Clear search
Close search
Google apps
Main menu