16 datasets found

P
Malicious URLs Dataset Dataset
paperswithcode.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Malicious URLs Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/malicious-urls-dataset
Explore at:
Description
Context Malicious URLs or malicious website is a very serious threat to cybersecurity. Malicious URLs host unsolicited content (spam, phishing, drive-by downloads, etc.) and lure unsuspecting users to become victims of scams (monetary loss, theft of private information, and malware installation), and cause losses of billions of dollars every year. We have collected this dataset to include a large number of examples of Malicious URLs so that a machine learning-based model can be developed to identify malicious urls so that we can stop them in advance before infecting computer system or spreading through inteinternet.

Content we have collected a huge dataset of 651,191 URLs, out of which 428103 benign or safe URLs, 96457 defacement URLs, 94111 phishing URLs, and 32520 malware URLs. Figure 2 depicts their distribution in terms of percentage. As we know one of the most crucial tasks is to curate the dataset for a machine learning project. We have curated this dataset from five different sources.

For collecting benign, phishing, malware and defacement URLs we have used URL dataset (ISCX-URL-2016) For increasing phishing and malware URLs, we have used Malware domain black list dataset. We have increased benign URLs using faizan git repo At last, we have increased more number of phishing URLs using Phishtank dataset and PhishStorm dataset As we have told you that dataset is collected from different sources. So firstly, we have collected the URLs from different sources into a separate data frame and finally merge them to retain only URLs and their class type.
Facebook Spam Dataset
kaggle.com
Updated Apr 11, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khaja Hussain SK (2021). Facebook Spam Dataset [Dataset]. https://www.kaggle.com/khajahussainsk/facebook-spam-dataset/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 11, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Khaja Hussain SK
Description
Context Collection of Facebook spam-legit profile and content-based data. It can be used for classification tasks.

Content The dataset can be used for building machine learning models. To collect the dataset, Facebook API and Facebook Graph API are used and the data is collected from public profiles. There are 500 legit profiles and 100 spam profiles. The list of features is as follows with Label (0-legit, 1-spam). 1. Number of friends 2. Number of followings 3. Number of Community 4. The age of the user account (in days) 5. Total number of posts shared 6. Total number of URLs shared 7. Total number of photos/videos shared 8. Fraction of the posts containing URLs 9. Fraction of the posts containing photos/videos 10. Average number of comments per post 11. Average number of likes per post 12. Average number of tags in a post (Rate of tagging) 13. Average number of hashtags present in a post

Inspiration Dataset helps the community to understand how features can help to differ Facebook legit users from spam users.
P
Google Ranked URLs Dataset Dataset
paperswithcode.com
Updated Oct 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fardin Rastakhiz; Mahdi Eftekhari; Sahar Vahdati (2024). Google Ranked URLs Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/google-ranked-urls-dataset
Explore at:
Dataset updated
Oct 21, 2024
Authors
Fardin Rastakhiz; Mahdi Eftekhari; Sahar Vahdati
Description
This dataset was curated for Search Engine Optimization (SEO) analysis tasks, including categorization and spam detection. It covers 12 diverse topics: basketball, books, cats, gardening, history, movies, music, recipes, sports, technology, travel, and weather. Some topics have hierarchical relationships, such as sports and basketball, while others are closely related (e.g., movies and music) or unrelated (e.g., basketball and gardening), with varying degrees of overlap among them. For each topic, approximately 300 search queries were generated using large language models (LLMs) like GPT, Llama, and Claude. The top 10 URLs from the Google Search Console’s search engine results page (SERP) were retrieved for each query.
Indonesian Email Spam
kaggle.com
Updated Jan 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
gevabriel (2024). Indonesian Email Spam [Dataset]. https://www.kaggle.com/datasets/gevabriel/indonesian-email-spam/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 15, 2024
Dataset provided by
Kaggle
Authors
gevabriel
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset was forked from: https://www.kaggle.com/datasets/mfaisalqureshi/spam-email/ and translated to Indonesian Language by me. This dataset consists of 2620 data, which consists of 1362 spam messages and 1258 non-spam messages (ham) [52%:48%].
Phishing websites Data
kaggle.com
Updated Aug 31, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aman Nagariya (2020). Phishing websites Data [Dataset]. https://www.kaggle.com/aman9d/phishing-data/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 31, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aman Nagariya
Description
Domain: The URL itself. Ranking: Page Ranking isIp: Is there an IP address in the weblink valid: This data is fetched from google's whois API that tells us more about the current status of the URL's registration. activeDuration: Also from whois API. Gives the duration of the time since the registration up until now. urlLen: It is simply the length of the URL is@: If the link has a '@' character then it's value = 1 isredirect: If the link has double dashes, there is a chance that it is a redirect. 1-> multiple dashes present together. haveDash: If there are any dashes in the domain name. domainLen: The length of just the domain name. noOfSubdomain: The number of subdomains preset in the URL. Labels: 0 -> Legitimate website , 1 -> Phishing Link/ Spam Link
U
URL Shortener Software Report
archivemarketresearch.com
doc, pdf, ppt
Updated Feb 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archive Market Research (2025). URL Shortener Software Report [Dataset]. https://www.archivemarketresearch.com/reports/url-shortener-software-43511
Explore at:
pdf, ppt, docAvailable download formats
Dataset updated
Feb 22, 2025
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
Paragraph 1: The global URL Shortener Software market size was valued at USD XXX million in 2025 and is projected to reach USD XXX million by 2033, exhibiting a CAGR of XX% during the forecast period from 2025 to 2033. The market growth is primarily driven by the increasing need to track and manage links effectively. With the rise of social media and email marketing, businesses require tools to shorten long and complex URLs while maintaining their functionality and trackability. The cloud-based deployment model is gaining popularity due to its scalability, cost-effectiveness, and ease of access. Paragraph 2: Major trends shaping the market include the adoption of AI-powered features to enhance link analysis and optimization. AI-driven URL shorteners can automatically tag and categorize links, identify spam or malicious URLs, and provide advanced analytics to improve campaign performance. Additionally, the integration of URL shortening capabilities within social media platforms and content management systems is expected to further drive market growth. Key players in the market include Hootsuite, Twitter, Bitly, and Rebrandly, among others. The market is expected to witness increased competition as new entrants emerge offering innovative features and competitive pricing. URL shortener software has emerged as a crucial tool in the digital age, enabling users to condense lengthy website addresses into manageable and shareable formats. This report provides in-depth insights into this software, analyzing market dynamics, key trends, and industry leaders.
domains@spam.com - Reverse Whois Lookup
whoisdatacenter.com
csv
Updated Feb 2, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AllHeart Web Inc (2018). domains@spam.com - Reverse Whois Lookup [Dataset]. https://whoisdatacenter.com/email/domains@spam.com/
Explore at:
csvAvailable download formats
Dataset updated
Feb 2, 2018
Dataset provided by
AllHeart Web
Authors
AllHeart Web Inc
License
https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/
Time period covered
Mar 15, 1985 - Jul 14, 2025
Description
Explore historical ownership and registration records by performing a reverse Whois lookup for the email address domains@spam.com..
n
මාධ්‍යවිකි:Spam-blacklist
wiki-data.si-lk.nina.az
Updated Jun 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). මාධ්‍යවිකි:Spam-blacklist [Dataset]. https://www.wiki-data.si-lk.nina.az/%E0%B6%B8%E0%B7%8F%E0%B6%B0%E0%B7%8A%E2%80%8D%E0%B6%BA%E0%B7%80%E0%B7%92%E0%B6%9A%E0%B7%92:Spam-blacklist.html
Explore at:
Dataset updated
Jun 27, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
External URLs matching this list will be blocked when added to a page This list affects only this wiki refer also to the
Average results by country
getresponse.com
Updated Apr 5, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GetResponse (2017). Average results by country [Dataset]. https://www.getresponse.com/resources/reports/email-marketing-benchmarks
Explore at:
Dataset updated
Apr 5, 2017
Dataset authored and provided by
GetResponse
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
What are the average email marketing results in different countries? Here’s what we’ve found.
Average results by industry
getresponse.com
Updated Apr 5, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GetResponse (2017). Average results by industry [Dataset]. https://www.getresponse.com/resources/reports/email-marketing-benchmarks
Explore at:
Dataset updated
Apr 5, 2017
Dataset authored and provided by
GetResponse
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Here, we’ve gathered email marketing benchmarks by industry. You can see how your average email open, click-through, click-to-open, unsubscribe, and spam complaint rates compare against other companies in your industry.
Number of autoresponders in a cycle
getresponse.com
Updated Apr 5, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GetResponse (2017). Number of autoresponders in a cycle [Dataset]. https://www.getresponse.com/resources/reports/email-marketing-benchmarks
Explore at:
Dataset updated
Apr 5, 2017
Dataset authored and provided by
GetResponse
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
How many emails should you put into your autoresponder cycle? We’ve analyzed how the average engagement metrics change depending on the number of emails our customers used in their autoresp onder cycles.
Enron Fraud Email Dataset
kaggle.com
Updated Dec 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Advaith S Rao (2023). Enron Fraud Email Dataset [Dataset]. https://www.kaggle.com/datasets/advaithsrao/enron-fraud-email-dataset/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 28, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Advaith S Rao
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. The data has been made public and presents a diverse set of email information ranging from internal, marketing emails to spam and fraud attempts.

In the early 2000s, Leslie Kaelbling at MIT purchased the dataset and noted that, though the dataset contained scam emails, it also had several integrity problems. The dataset was updated later, but it becomes key to ensure privacy in the data while it is used to train a deep neural network model.

Though the Enron Email Dataset contains over 500K emails, one of the problems with the dataset is the availability of labeled frauds in the dataset. Label annotation is done to detect an umbrella of fraud emails accurately. Since, fraud emails fall into several types such as Phishing, Financial, Romance, Subscription, and Nigerian Prince scams, there have to be multiple heuristics used to label all types of fraudulent emails effectively.

To tackle this problem, heuristics have been used to label the Enron data corpus using email signals, and automated labeling has been performed using simple ML models on other smaller email datasets available online. These fraud annotation techniques are discussed in detail below.

To perform fraud annotation on the Enron dataset as well as provide more fraud examples for modeling, two more fraud data sources have been used, Phishing Email Dataset: https://www.kaggle.com/dsv/6090437 Social Engineering Dataset: http://aclweb.org/aclwiki

Label Annotation

To label the Enron email dataset two signals are used to filter suspicious emails and label them into fraud and non-fraud classes. Automated ML labeling Email Signals

Automated ML Labeling

The following heuristics are used to annotate labels for Enron email data using the other two data sources,

Phishing Model Annotation: A high-precision SVM model trained on the Phishing mails dataset, which is used to annotate the Phishing Label on the Enron Dataset.

Social Engineering Model Annotation: A high-precision SVM model trained on the Social Engineering mails dataset, which is used to annotate the Social Engineering Label on the Enron Dataset.

The two ML Annotator models use Term Frequency Inverse Document Frequency (TF-IDF) to embed the input text and make use of SVM models with Gaussian Kernel.

If either of the models predicted that an email was a fraud, the mail metadata was checked for several email signals. If these heuristics meet the requirements of a high-probability fraud email, we label it as a fraud email.

Email Signals

Email Signal-based heuristics are used to filter and target suspicious emails for fraud labeling specifically. The signals used were,

Person Of Interest: There is a publicly available list of email addresses of employees who were liable for the massive data leak at Enron. These user mailboxes have a higher chance of containing quality fraud emails.

Suspicious Folders: The Enron data is dumped into several folders for every employee. Folders consist of inbox, deleted_items, junk, calendar, etc. A set of folders with a higher chance of containing fraud emails, such as Deleted Items and Junk.

Sender Type: The sender type was categorized as ‘Internal’ and ‘External’ based on their email address.

Low Communication: A threshold of 4 emails based on the table below was used to define Low Communication. A user qualifies as a Low-Comm sender if their emails are below this threshold. Mails sent from low-comm senders have been assigned with a high probability of being a fraud.

Contains Replies and Forwards: If an email contains forwards or replies, a low probability was assigned for it to be a fraud email.

Manual Inspection

To ensure high-quality labels, the mismatch examples from ML Annotation have been manually inspected for Enron dataset relabeling.

Dataset Breakdown

Fraud Non-Fraud
2327 445090

Citations

Enron Dataset Title: Enron Email Dataset URL: https://www.cs.cmu.edu/~enron/ Publisher: MIT, CMU Author: Leslie Kaelbling, William W. Cohen Year: 2015

Phishing Email Detection Dataset Title: Phishing Email Detection URL: https://www.kaggle.com/dsv/6090437 DOI: 10.34740/KAGGLE/DSV/6090437 Publisher: Kaggle Author: Subhadeep Chakraborty Year: 2023

CLAIR Fraud Email Collection Title: CLAIR collection of fraud email URL: http://aclweb.org/aclwiki Author: Radev, D. Year: 2008
Promedio de los resultados por sector
getresponse.com
Updated Apr 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GetResponse (2024). Promedio de los resultados por sector [Dataset]. https://www.getresponse.com/es/recursos/reports/benchmark-de-email-marketing
Explore at:
Dataset updated
Apr 2, 2024
Dataset authored and provided by
GetResponse
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Aquí, hemos recopilado los benchmarks de email marketing por sector. Verás cómo tus tasas de apertura, CTR, CTOR, suscripciones canceladas y quejas de spam se comparan con las de otras empresas en tu mercado.
Número de autoresponders em um ciclo
getresponse.com
Updated Dec 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GetResponse (2023). Número de autoresponders em um ciclo [Dataset]. https://www.getresponse.com/pt/resources/reports/benchmark-de-email-marketing
Explore at:
Dataset updated
Dec 21, 2023
Dataset authored and provided by
GetResponse
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Quantos e-mails você deveria colocar em uma sequência automática? Investigamos como as métricas de engajamento mudam dependendo do número de mensagens que os nossos clientes usaram nos ciclos de autoresponder.
Durchschnittliche Ergebnisse nach Branche
getresponse.com
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GetResponse (2023). Durchschnittliche Ergebnisse nach Branche [Dataset]. https://www.getresponse.com/de/resources/reports/email-marketing-benchmarks
Explore at:
Dataset updated
Jun 21, 2023
Dataset authored and provided by
GetResponse
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Hier haben wir Benchmarks für das E-Mail-Marketing nach Branchen zusammengestellt. Du kannst sehen, wie deine durchschnittlichen E-Mail-Öffnungs-, Click-Through-, Click-to-Open-, Abmelde- und Spam-Beschwerderaten im Vergleich zu anderen Unternehmen in deiner Branche aussehen.
Número de autoresponders en un ciclo
getresponse.com
Updated Apr 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GetResponse (2024). Número de autoresponders en un ciclo [Dataset]. https://www.getresponse.com/es/recursos/reports/benchmark-de-email-marketing
Explore at:
Dataset updated
Apr 2, 2024
Dataset authored and provided by
GetResponse
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
¿Cuántos emails deberías incluir en una secuencia automática? Investigamos cómo los resultados de engagement cambian según el número de mensajes que nuestros clientes pusieron en sus ciclos de autoresponder.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Fraud	Non-Fraud
2327	445090

Facebook

Twitter

Click to copy link

Link copied

Cite

Malicious URLs Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/malicious-urls-dataset

Malicious URLs Dataset Dataset

Explore at:

Description

Context Malicious URLs or malicious website is a very serious threat to cybersecurity. Malicious URLs host unsolicited content (spam, phishing, drive-by downloads, etc.) and lure unsuspecting users to become victims of scams (monetary loss, theft of private information, and malware installation), and cause losses of billions of dollars every year. We have collected this dataset to include a large number of examples of Malicious URLs so that a machine learning-based model can be developed to identify malicious urls so that we can stop them in advance before infecting computer system or spreading through inteinternet.

Content we have collected a huge dataset of 651,191 URLs, out of which 428103 benign or safe URLs, 96457 defacement URLs, 94111 phishing URLs, and 32520 malware URLs. Figure 2 depicts their distribution in terms of percentage. As we know one of the most crucial tasks is to curate the dataset for a machine learning project. We have curated this dataset from five different sources.

For collecting benign, phishing, malware and defacement URLs we have used URL dataset (ISCX-URL-2016) For increasing phishing and malware URLs, we have used Malware domain black list dataset. We have increased benign URLs using faizan git repo At last, we have increased more number of phishing URLs using Phishtank dataset and PhishStorm dataset As we have told you that dataset is collected from different sources. So firstly, we have collected the URLs from different sources into a separate data frame and finally merge them to retain only URLs and their class type.

Clear search

Close search

Google apps

Main menu

Malicious URLs Dataset Dataset

Facebook Spam Dataset

Google Ranked URLs Dataset Dataset

Indonesian Email Spam

Phishing websites Data

URL Shortener Software Report

domains@spam.com - Reverse Whois Lookup

මාධ්‍යවිකි:Spam-blacklist

Average results by country

Average results by industry

Number of autoresponders in a cycle

Enron Fraud Email Dataset

Label Annotation

Automated ML Labeling

Email Signals

Manual Inspection

Dataset Breakdown

Citations

Promedio de los resultados por sector

Número de autoresponders em um ciclo

Durchschnittliche Ergebnisse nach Branche

Número de autoresponders en un ciclo

Malicious URLs Dataset Dataset