Context Malicious URLs or malicious website is a very serious threat to cybersecurity. Malicious URLs host unsolicited content (spam, phishing, drive-by downloads, etc.) and lure unsuspecting users to become victims of scams (monetary loss, theft of private information, and malware installation), and cause losses of billions of dollars every year. We have collected this dataset to include a large number of examples of Malicious URLs so that a machine learning-based model can be developed to identify malicious urls so that we can stop them in advance before infecting computer system or spreading through inteinternet.
Content we have collected a huge dataset of 651,191 URLs, out of which 428103 benign or safe URLs, 96457 defacement URLs, 94111 phishing URLs, and 32520 malware URLs. Figure 2 depicts their distribution in terms of percentage. As we know one of the most crucial tasks is to curate the dataset for a machine learning project. We have curated this dataset from five different sources.
For collecting benign, phishing, malware and defacement URLs we have used URL dataset (ISCX-URL-2016) For increasing phishing and malware URLs, we have used Malware domain black list dataset. We have increased benign URLs using faizan git repo At last, we have increased more number of phishing URLs using Phishtank dataset and PhishStorm dataset As we have told you that dataset is collected from different sources. So firstly, we have collected the URLs from different sources into a separate data frame and finally merge them to retain only URLs and their class type.
Context Collection of Facebook spam-legit profile and content-based data. It can be used for classification tasks.
Content The dataset can be used for building machine learning models. To collect the dataset, Facebook API and Facebook Graph API are used and the data is collected from public profiles. There are 500 legit profiles and 100 spam profiles. The list of features is as follows with Label (0-legit, 1-spam). 1. Number of friends 2. Number of followings 3. Number of Community 4. The age of the user account (in days) 5. Total number of posts shared 6. Total number of URLs shared 7. Total number of photos/videos shared 8. Fraction of the posts containing URLs 9. Fraction of the posts containing photos/videos 10. Average number of comments per post 11. Average number of likes per post 12. Average number of tags in a post (Rate of tagging) 13. Average number of hashtags present in a post
Inspiration Dataset helps the community to understand how features can help to differ Facebook legit users from spam users.
This dataset was curated for Search Engine Optimization (SEO) analysis tasks, including categorization and spam detection. It covers 12 diverse topics: basketball, books, cats, gardening, history, movies, music, recipes, sports, technology, travel, and weather. Some topics have hierarchical relationships, such as sports and basketball, while others are closely related (e.g., movies and music) or unrelated (e.g., basketball and gardening), with varying degrees of overlap among them. For each topic, approximately 300 search queries were generated using large language models (LLMs) like GPT, Llama, and Claude. The top 10 URLs from the Google Search Console’s search engine results page (SERP) were retrieved for each query.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was forked from: https://www.kaggle.com/datasets/mfaisalqureshi/spam-email/ and translated to Indonesian Language by me. This dataset consists of 2620 data, which consists of 1362 spam messages and 1258 non-spam messages (ham) [52%:48%].
Domain: The URL itself. Ranking: Page Ranking isIp: Is there an IP address in the weblink valid: This data is fetched from google's whois API that tells us more about the current status of the URL's registration. activeDuration: Also from whois API. Gives the duration of the time since the registration up until now. urlLen: It is simply the length of the URL is@: If the link has a '@' character then it's value = 1 isredirect: If the link has double dashes, there is a chance that it is a redirect. 1-> multiple dashes present together. haveDash: If there are any dashes in the domain name. domainLen: The length of just the domain name. noOfSubdomain: The number of subdomains preset in the URL. Labels: 0 -> Legitimate website , 1 -> Phishing Link/ Spam Link
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Paragraph 1: The global URL Shortener Software market size was valued at USD XXX million in 2025 and is projected to reach USD XXX million by 2033, exhibiting a CAGR of XX% during the forecast period from 2025 to 2033. The market growth is primarily driven by the increasing need to track and manage links effectively. With the rise of social media and email marketing, businesses require tools to shorten long and complex URLs while maintaining their functionality and trackability. The cloud-based deployment model is gaining popularity due to its scalability, cost-effectiveness, and ease of access. Paragraph 2: Major trends shaping the market include the adoption of AI-powered features to enhance link analysis and optimization. AI-driven URL shorteners can automatically tag and categorize links, identify spam or malicious URLs, and provide advanced analytics to improve campaign performance. Additionally, the integration of URL shortening capabilities within social media platforms and content management systems is expected to further drive market growth. Key players in the market include Hootsuite, Twitter, Bitly, and Rebrandly, among others. The market is expected to witness increased competition as new entrants emerge offering innovative features and competitive pricing. URL shortener software has emerged as a crucial tool in the digital age, enabling users to condense lengthy website addresses into manageable and shareable formats. This report provides in-depth insights into this software, analyzing market dynamics, key trends, and industry leaders.
https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/
Explore historical ownership and registration records by performing a reverse Whois lookup for the email address domains@spam.com..
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
External URLs matching this list will be blocked when added to a page This list affects only this wiki refer also to the
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
What are the average email marketing results in different countries? Here’s what we’ve found.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here, we’ve gathered email marketing benchmarks by industry. You can see how your average email open, click-through, click-to-open, unsubscribe, and spam complaint rates compare against other companies in your industry.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
How many emails should you put into your autoresponder cycle? We’ve analyzed how the average engagement metrics change depending on the number of emails our customers used in their autoresp onder cycles.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. The data has been made public and presents a diverse set of email information ranging from internal, marketing emails to spam and fraud attempts.
In the early 2000s, Leslie Kaelbling at MIT purchased the dataset and noted that, though the dataset contained scam emails, it also had several integrity problems. The dataset was updated later, but it becomes key to ensure privacy in the data while it is used to train a deep neural network model.
Though the Enron Email Dataset contains over 500K emails, one of the problems with the dataset is the availability of labeled frauds in the dataset. Label annotation is done to detect an umbrella of fraud emails accurately. Since, fraud emails fall into several types such as Phishing, Financial, Romance, Subscription, and Nigerian Prince scams, there have to be multiple heuristics used to label all types of fraudulent emails effectively.
To tackle this problem, heuristics have been used to label the Enron data corpus using email signals, and automated labeling has been performed using simple ML models on other smaller email datasets available online. These fraud annotation techniques are discussed in detail below.
To perform fraud annotation on the Enron dataset as well as provide more fraud examples for modeling, two more fraud data sources have been used, Phishing Email Dataset: https://www.kaggle.com/dsv/6090437 Social Engineering Dataset: http://aclweb.org/aclwiki
To label the Enron email dataset two signals are used to filter suspicious emails and label them into fraud and non-fraud classes. Automated ML labeling Email Signals
The following heuristics are used to annotate labels for Enron email data using the other two data sources,
Phishing Model Annotation: A high-precision SVM model trained on the Phishing mails dataset, which is used to annotate the Phishing Label on the Enron Dataset.
Social Engineering Model Annotation: A high-precision SVM model trained on the Social Engineering mails dataset, which is used to annotate the Social Engineering Label on the Enron Dataset.
The two ML Annotator models use Term Frequency Inverse Document Frequency (TF-IDF) to embed the input text and make use of SVM models with Gaussian Kernel.
If either of the models predicted that an email was a fraud, the mail metadata was checked for several email signals. If these heuristics meet the requirements of a high-probability fraud email, we label it as a fraud email.
Email Signal-based heuristics are used to filter and target suspicious emails for fraud labeling specifically. The signals used were,
Person Of Interest: There is a publicly available list of email addresses of employees who were liable for the massive data leak at Enron. These user mailboxes have a higher chance of containing quality fraud emails.
Suspicious Folders: The Enron data is dumped into several folders for every employee. Folders consist of inbox, deleted_items, junk, calendar, etc. A set of folders with a higher chance of containing fraud emails, such as Deleted Items and Junk.
Sender Type: The sender type was categorized as ‘Internal’ and ‘External’ based on their email address.
Low Communication: A threshold of 4 emails based on the table below was used to define Low Communication. A user qualifies as a Low-Comm sender if their emails are below this threshold. Mails sent from low-comm senders have been assigned with a high probability of being a fraud.
Contains Replies and Forwards: If an email contains forwards or replies, a low probability was assigned for it to be a fraud email.
To ensure high-quality labels, the mismatch examples from ML Annotation have been manually inspected for Enron dataset relabeling.
Fraud | Non-Fraud |
---|---|
2327 | 445090 |
Enron Dataset Title: Enron Email Dataset URL: https://www.cs.cmu.edu/~enron/ Publisher: MIT, CMU Author: Leslie Kaelbling, William W. Cohen Year: 2015
Phishing Email Detection Dataset Title: Phishing Email Detection URL: https://www.kaggle.com/dsv/6090437 DOI: 10.34740/KAGGLE/DSV/6090437 Publisher: Kaggle Author: Subhadeep Chakraborty Year: 2023
CLAIR Fraud Email Collection Title: CLAIR collection of fraud email URL: http://aclweb.org/aclwiki Author: Radev, D. Year: 2008
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Aquí, hemos recopilado los benchmarks de email marketing por sector. Verás cómo tus tasas de apertura, CTR, CTOR, suscripciones canceladas y quejas de spam se comparan con las de otras empresas en tu mercado.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Quantos e-mails você deveria colocar em uma sequência automática? Investigamos como as métricas de engajamento mudam dependendo do número de mensagens que os nossos clientes usaram nos ciclos de autoresponder.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Hier haben wir Benchmarks für das E-Mail-Marketing nach Branchen zusammengestellt. Du kannst sehen, wie deine durchschnittlichen E-Mail-Öffnungs-, Click-Through-, Click-to-Open-, Abmelde- und Spam-Beschwerderaten im Vergleich zu anderen Unternehmen in deiner Branche aussehen.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
¿Cuántos emails deberías incluir en una secuencia automática? Investigamos cómo los resultados de engagement cambian según el número de mensajes que nuestros clientes pusieron en sus ciclos de autoresponder.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Context Malicious URLs or malicious website is a very serious threat to cybersecurity. Malicious URLs host unsolicited content (spam, phishing, drive-by downloads, etc.) and lure unsuspecting users to become victims of scams (monetary loss, theft of private information, and malware installation), and cause losses of billions of dollars every year. We have collected this dataset to include a large number of examples of Malicious URLs so that a machine learning-based model can be developed to identify malicious urls so that we can stop them in advance before infecting computer system or spreading through inteinternet.
Content we have collected a huge dataset of 651,191 URLs, out of which 428103 benign or safe URLs, 96457 defacement URLs, 94111 phishing URLs, and 32520 malware URLs. Figure 2 depicts their distribution in terms of percentage. As we know one of the most crucial tasks is to curate the dataset for a machine learning project. We have curated this dataset from five different sources.
For collecting benign, phishing, malware and defacement URLs we have used URL dataset (ISCX-URL-2016) For increasing phishing and malware URLs, we have used Malware domain black list dataset. We have increased benign URLs using faizan git repo At last, we have increased more number of phishing URLs using Phishtank dataset and PhishStorm dataset As we have told you that dataset is collected from different sources. So firstly, we have collected the URLs from different sources into a separate data frame and finally merge them to retain only URLs and their class type.