Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Web has long become a major platform for online criminal activities. URLs are used as the main vehicle in this domain. To counter this issues security community focused its efforts on developing techniques for mostly blacklisting of malicious URLs.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We have curated 7 repositories.The Ling and Enron datasets possess just two features: ‘Subject’ and ‘Body’. The other datasets consists of six features, namely ‘Sender’, ‘Receiver’, ‘Date’, ‘Subject’, ‘Body’, and ‘Urls’.Please cite this dataset:A. I. Champa, M. F. Rabbi, and M. F. Zibran, “Curated datasets and feature analysis for phishing email detection with machine learning,” in 3rd IEEE International Conference on Computing and Machine Intelligence (ICMI), 2024, pp. 1–7 (to appear).or @inproceedings{champa2024curated, title={Curated Datasets and Feature Analysis for Phishing Email Detection with Machine Learning}, author={Champa, Arifa I and Rabbi, Md Fazle and Zibran, Minhaz F}, booktitle={3rd IEEE International Conference on Computing and Machine Intelligence (ICMI)}, pages = {1--7 (to appear)}, year={2024}}
https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/
Explore historical ownership and registration records by performing a reverse Whois lookup for the email address domains@spam.com..
Domain: The URL itself. Ranking: Page Ranking isIp: Is there an IP address in the weblink valid: This data is fetched from google's whois API that tells us more about the current status of the URL's registration. activeDuration: Also from whois API. Gives the duration of the time since the registration up until now. urlLen: It is simply the length of the URL is@: If the link has a '@' character then it's value = 1 isredirect: If the link has double dashes, there is a chance that it is a redirect. 1-> multiple dashes present together. haveDash: If there are any dashes in the domain name. domainLen: The length of just the domain name. noOfSubdomain: The number of subdomains preset in the URL. Labels: 0 -> Legitimate website , 1 -> Phishing Link/ Spam Link
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Paragraph 1: The global URL Shortener Software market size was valued at USD XXX million in 2025 and is projected to reach USD XXX million by 2033, exhibiting a CAGR of XX% during the forecast period from 2025 to 2033. The market growth is primarily driven by the increasing need to track and manage links effectively. With the rise of social media and email marketing, businesses require tools to shorten long and complex URLs while maintaining their functionality and trackability. The cloud-based deployment model is gaining popularity due to its scalability, cost-effectiveness, and ease of access. Paragraph 2: Major trends shaping the market include the adoption of AI-powered features to enhance link analysis and optimization. AI-driven URL shorteners can automatically tag and categorize links, identify spam or malicious URLs, and provide advanced analytics to improve campaign performance. Additionally, the integration of URL shortening capabilities within social media platforms and content management systems is expected to further drive market growth. Key players in the market include Hootsuite, Twitter, Bitly, and Rebrandly, among others. The market is expected to witness increased competition as new entrants emerge offering innovative features and competitive pricing. URL shortener software has emerged as a crucial tool in the digital age, enabling users to condense lengthy website addresses into manageable and shareable formats. This report provides in-depth insights into this software, analyzing market dynamics, key trends, and industry leaders.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. The data has been made public and presents a diverse set of email information ranging from internal, marketing emails to spam and fraud attempts.
In the early 2000s, Leslie Kaelbling at MIT purchased the dataset and noted that, though the dataset contained scam emails, it also had several integrity problems. The dataset was updated later, but it becomes key to ensure privacy in the data while it is used to train a deep neural network model.
Though the Enron Email Dataset contains over 500K emails, one of the problems with the dataset is the availability of labeled frauds in the dataset. Label annotation is done to detect an umbrella of fraud emails accurately. Since, fraud emails fall into several types such as Phishing, Financial, Romance, Subscription, and Nigerian Prince scams, there have to be multiple heuristics used to label all types of fraudulent emails effectively.
To tackle this problem, heuristics have been used to label the Enron data corpus using email signals, and automated labeling has been performed using simple ML models on other smaller email datasets available online. These fraud annotation techniques are discussed in detail below.
To perform fraud annotation on the Enron dataset as well as provide more fraud examples for modeling, two more fraud data sources have been used, Phishing Email Dataset: https://www.kaggle.com/dsv/6090437 Social Engineering Dataset: http://aclweb.org/aclwiki
To label the Enron email dataset two signals are used to filter suspicious emails and label them into fraud and non-fraud classes. Automated ML labeling Email Signals
The following heuristics are used to annotate labels for Enron email data using the other two data sources,
Phishing Model Annotation: A high-precision SVM model trained on the Phishing mails dataset, which is used to annotate the Phishing Label on the Enron Dataset.
Social Engineering Model Annotation: A high-precision SVM model trained on the Social Engineering mails dataset, which is used to annotate the Social Engineering Label on the Enron Dataset.
The two ML Annotator models use Term Frequency Inverse Document Frequency (TF-IDF) to embed the input text and make use of SVM models with Gaussian Kernel.
If either of the models predicted that an email was a fraud, the mail metadata was checked for several email signals. If these heuristics meet the requirements of a high-probability fraud email, we label it as a fraud email.
Email Signal-based heuristics are used to filter and target suspicious emails for fraud labeling specifically. The signals used were,
Person Of Interest: There is a publicly available list of email addresses of employees who were liable for the massive data leak at Enron. These user mailboxes have a higher chance of containing quality fraud emails.
Suspicious Folders: The Enron data is dumped into several folders for every employee. Folders consist of inbox, deleted_items, junk, calendar, etc. A set of folders with a higher chance of containing fraud emails, such as Deleted Items and Junk.
Sender Type: The sender type was categorized as ‘Internal’ and ‘External’ based on their email address.
Low Communication: A threshold of 4 emails based on the table below was used to define Low Communication. A user qualifies as a Low-Comm sender if their emails are below this threshold. Mails sent from low-comm senders have been assigned with a high probability of being a fraud.
Contains Replies and Forwards: If an email contains forwards or replies, a low probability was assigned for it to be a fraud email.
To ensure high-quality labels, the mismatch examples from ML Annotation have been manually inspected for Enron dataset relabeling.
Fraud | Non-Fraud |
---|---|
2327 | 445090 |
Enron Dataset Title: Enron Email Dataset URL: https://www.cs.cmu.edu/~enron/ Publisher: MIT, CMU Author: Leslie Kaelbling, William W. Cohen Year: 2015
Phishing Email Detection Dataset Title: Phishing Email Detection URL: https://www.kaggle.com/dsv/6090437 DOI: 10.34740/KAGGLE/DSV/6090437 Publisher: Kaggle Author: Subhadeep Chakraborty Year: 2023
CLAIR Fraud Email Collection Title: CLAIR collection of fraud email URL: http://aclweb.org/aclwiki Author: Radev, D. Year: 2008
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Web has long become a major platform for online criminal activities. URLs are used as the main vehicle in this domain. To counter this issues security community focused its efforts on developing techniques for mostly blacklisting of malicious URLs.