14 datasets found
  1. Object-Centric Event Log (OCEL) of the Enron Email Dataset

    • zenodo.org
    bin
    Updated May 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alessandro Berti; Alessandro Berti (2025). Object-Centric Event Log (OCEL) of the Enron Email Dataset [Dataset]. http://doi.org/10.5281/zenodo.15516869
    Explore at:
    binAvailable download formats
    Dataset updated
    May 26, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alessandro Berti; Alessandro Berti
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description:

    This dataset provides an object-centric event log (OCEL) representation of the publicly available Enron email corpus. The OCEL format allows for a richer analysis of interconnected processes and objects, making it particularly suitable for advanced process mining techniques, communication pattern analysis, and social network exploration.

    The event logs were generated from a pre-processed CSV version of the Enron emails using a custom Python script leveraging the PM4Py library. The script parses individual emails to extract key information, including:

    • Timestamps: Derived from the 'Date' field of emails, parsed into timezone-aware datetime objects.
    • Activities: Inferred from email subject prefixes (e.g., "Re:" becomes "Response", "Fw:" becomes "Forwarding", "Invitation:" becomes "Invitation"). Emails without recognized prefixes are assigned a "Default" activity.
    • Objects: Two primary object types are identified:
      • EMAILADDRESS: Extracted from 'From', 'To', and 'Cc' fields.
      • MESSAGEID: Extracted from 'Message-ID', 'In-Reply-To', and 'References' fields, prefixed with "MID_" in the OCEL to ensure unique object identifiers across types.
    • Attributes: Event attributes include the original cleaned subject and content of the email.
    • Relationships: Events (emails) are linked to EMAILADDRESS objects with qualifiers 'FROM', 'TO', or 'CC'. Events are linked to MESSAGEID objects with qualifiers 'MESSAGEID' (for the email's own ID), 'INREPLYTO', or 'REFERENCES' to trace conversational threads.

    To accommodate various analytical needs and computational resources, the dataset is provided in three distinct checkpoints:

    1. Top 10,000 Emails: An OCEL generated from the first 10,000 emails processed.
    2. Top 100,000 Emails: An OCEL generated from the first 100,000 emails processed.
    3. All Emails: An OCEL generated from all emails processed by the script from the input emails.csv file.

    Each checkpoint is available in the .jsonocel format (OCEL 2.0 standard), ready for use with PM4Py and other OCEL-compatible process mining tools. This dataset can be valuable for researchers and practitioners seeking to apply object-centric process discovery, conformance checking, and enhancement techniques to a large, real-world communication log.

    Keywords: Object-Centric Event Log, OCEL, Process Mining, Enron Dataset, Email Analysis, Communication Networks, Social Network Analysis, PM4Py

  2. h

    enron_curated_labeled

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mark Davey, enron_curated_labeled [Dataset]. https://huggingface.co/datasets/kariatouk/enron_curated_labeled
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Mark Davey
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Enron Email Dataset

      Description
    

    The Enron Email Dataset is a collection of emails from the Enron Corporation, which was one of the largest energy companies in the United States. This dataset is widely used for various natural language processing (NLP) tasks, such as email classification, sentiment analysis, and named entity recognition.

      Processed Dataset
    

    This file could not be uploaded due to its size 500MB The processed dataset, named enron_processed.zip, is a… See the full description on the dataset page: https://huggingface.co/datasets/kariatouk/enron_curated_labeled.

  3. Aeslc (Email Subject Generation Task)

    • kaggle.com
    zip
    Updated Dec 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Aeslc (Email Subject Generation Task) [Dataset]. https://www.kaggle.com/datasets/thedevastator/uncovering-enron-employees-secrets-exploring-the
    Explore at:
    zip(4991904 bytes)Available download formats
    Dataset updated
    Dec 1, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Aeslc (Email Subject Generation Task)

    A collection of email messages of employees in the Enron Corporation.

    By Huggingface Hub [source]

    About this dataset

    The AESLC (Automatic Extraction of Semantically-Linked Corporate Communications) dataset provides a unique and captivating glimpse into the lives of Enron employees - from the perspective of communications sent via emails during a period between 1999 to 2004. These anonymous emails not only provide fascinating insight into the daily professional activities, interactions, and relationships within Enron employees, but also offer an educational opportunity for those interested in further exploring corporate communication. Containing such features as email body and subject lines, researchers can tap into this invaluable resource to research topics surrounding linguistics, sentiment analysis, and data mining. Unlock their secrets by discovering what messages were shared amongst these before the breach of scandal that caused their company’s downfall!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This comprehensive dataset includes anonymized emails sent by then Enron employees in the period of 1999 and 2004. By delving into this unique dataset, you can gain a deeper insight into the lives of former Enron employees as well as their professional activities and relationships.

    In this guide, we'll provide a walkthrough on how to use this dataset and make meaningful discoveries from it. Let's get started!

    Research Ideas

    • Analyzing the connections between Enron employees by tracking their email communications over time to uncover trends and correlations.
    • Examining the emails for keywords or topics as a way to classify each email in order to gain better understanding of what Enron employees were discussing and what activities they were engaging in.
    • Using sentiment analysis techniques on the emails in order to gain insight into the emotional state of Enron employees at different points in time or during particular events or incidents such as when allegations against Enron emerged

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: validation.csv | Column name | Description | |:-----------------|:--------------------------------------------------------------| | email_body | The body of the email sent by Enron employees. (Text) | | subject_line | The subject line of the email sent by Enron employees. (Text) |

    File: train.csv | Column name | Description | |:-----------------|:--------------------------------------------------------------| | email_body | The body of the email sent by Enron employees. (Text) | | subject_line | The subject line of the email sent by Enron employees. (Text) |

    File: test.csv | Column name | Description | |:-----------------|:--------------------------------------------------------------| | email_body | The body of the email sent by Enron employees. (Text) | | subject_line | The subject line of the email sent by Enron employees. (Text) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

  4. Enron Email Time-Series Network

    • zenodo.org
    • explore.openaire.eu
    • +1more
    csv
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Volodymyr Miz; Benjamin Ricaud; Pierre Vandergheynst; Volodymyr Miz; Benjamin Ricaud; Pierre Vandergheynst (2020). Enron Email Time-Series Network [Dataset]. http://doi.org/10.5281/zenodo.1342353
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Volodymyr Miz; Benjamin Ricaud; Pierre Vandergheynst; Volodymyr Miz; Benjamin Ricaud; Pierre Vandergheynst
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We use the Enron email dataset to build a network of email addresses. It contains 614586 emails sent over the period from 6 January 1998 until 4 February 2004. During the pre-processing, we remove the periods of low activity and keep the emails from 1 January 1999 until 31 July 2002 which is 1448 days of email records in total. Also, we remove email addresses that sent less than three emails over that period. In total, the Enron email network contains 6 600 nodes and 50 897 edges.

    To build a graph G = (V, E), we use email addresses as nodes V. Every node vi has an attribute which is a time-varying signal that corresponds to the number of emails sent from this address during a day. We draw an edge eij between two nodes i and j if there is at least one email exchange between the corresponding addresses.

    Column 'Count' in 'edges.csv' file is the number of 'From'->'To' email exchanges between the two addresses. This column can be used as an edge weight.

    The file 'nodes.csv' contains a dictionary that is a compressed representation of time-series. The format of the dictionary is Day->The Number Of Emails Sent By the Address During That Day. The total number of days is 1448.

    'id-email.csv' is a file containing the actual email addresses.

  5. Enron Emails Dataset for Language Modeling

    • kaggle.com
    zip
    Updated May 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    hrterhrter (2024). Enron Emails Dataset for Language Modeling [Dataset]. https://www.kaggle.com/datasets/programmerrdai/enron-emails-dataset-for-language-modeling/code
    Explore at:
    zip(737415481 bytes)Available download formats
    Dataset updated
    May 21, 2024
    Authors
    hrterhrter
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The Enron Emails dataset, formatted for language modeling applications, is a comprehensive collection of email communications from Enron Corporation. This dataset is invaluable for researchers and developers working on natural language processing (NLP), machine learning, and artificial intelligence projects.

    Key Features: - Source: The emails are sourced from the Enron Corporation email archive. - Format: The dataset is provided in a language modeling-friendly format, making it easy to use for training and testing NLP models. - Content: Each email contains metadata such as Message-ID, Date, From, To, Subject, Mime-Version, Content-Type, Content-Transfer-Encoding, X-From, X-To, X-cc, X-bcc, X-Folder, X-Origin, and X-FileName. - Usage: Ideal for tasks such as text generation, sentiment analysis, topic modeling, and more.

    Example Record: ``` Message-ID: 18782981.1075855378110.JavaMail.evans@thyme Date: Mon, 14 May 2001 16:39:00 -0700 (PDT) From: phillip.allen@enron.com To: tim.belden@enron.com Subject: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-From: Phillip K Allen X-To: Tim Belden

  6. Phishing Email Dataset

    • kaggle.com
    zip
    Updated May 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Naser Abdullah Alam (2024). Phishing Email Dataset [Dataset]. https://www.kaggle.com/datasets/naserabdullahalam/phishing-email-dataset
    Explore at:
    zip(80864554 bytes)Available download formats
    Dataset updated
    May 24, 2024
    Authors
    Naser Abdullah Alam
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    PHISHING EMAIL DATASET

    This dataset was compiled by researchers to study phishing email tactics. It combines emails from a variety of sources to create a comprehensive resource for analysis.

    Initial Datasets:

    • Enron and Ling Datasets: These datasets focus on the core content of phishing emails, containing subject lines, email body text, and labels indicating whether the email is spam (phishing) or legitimate.

    • CEAS, Nazario, Nigerian Fraud, and SpamAssassin Datasets: These datasets provide broader context for the emails, including sender information, recipient information, date, and labels for spam/legitimate classification.

    Final Dataset:

    The final dataset combines the information from the initial datasets into a single resource for analysis. This dataset contains:

    • Approximately 82,500 emails
    • 42,891 spam emails
    • 39,595 legitimate emails

    This dataset allows researchers to study the content of phishing emails and the context in which they are sent to improve detection methods.

    Please cite the following two articles if you are using this dataset:

    • Al-Subaiey, A., Al-Thani, M., Alam, N. A., Antora, K. F., Khandakar, A., & Zaman, S. A. U. (2024, May 19). Novel Interpretable and Robust Web-based AI Platform for Phishing Email Detection. ArXiv.org. https://arxiv.org/abs/2405.11619
  7. h

    enron

    • huggingface.co
    Updated Jan 2, 2003
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KP (2003). enron [Dataset]. https://huggingface.co/datasets/SuccessfulCrab/enron
    Explore at:
    Dataset updated
    Jan 2, 2003
    Authors
    KP
    Description

    Enron Email Dataset — Entity Extraction

    This dataset contains the results of running the Entity Extractor service over the complete Enron email corpus. The Entity Extractor is a Python-based text analysis tool that identifies and extracts structured data from unstructured text using a combination of regex pattern matching and deep learning NER models.

      Dataset Overview
    

    Property Value

    Source Enron Email Corpus

    Processing Entity Extractor (Azure ML Endpoint)… See the full description on the dataset page: https://huggingface.co/datasets/SuccessfulCrab/enron.

  8. email-blog

    • kaggle.com
    zip
    Updated Oct 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mike Schmidt-Avemac (2021). email-blog [Dataset]. https://www.kaggle.com/datasets/mikeschmidtavemac/emailblog
    Explore at:
    zip(15977463 bytes)Available download formats
    Dataset updated
    Oct 9, 2021
    Authors
    Mike Schmidt-Avemac
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    Context

    Supervised classification dataset produced as part of a blog series on classifying corporate email for morale and professional alignment. Series covers raw data extraction, analysis, unsupervised topic discovery and supervised model development.

    The blog posts are available at:

    Part 1. Raw email processing. https://www.avemacconsulting.com/2021/08/24/email-insights-from-data-science-techniques-part-1/ Part 2. Data analysis. https://www.avemacconsulting.com/2021/08/27/email-insights-from-data-science-part-2/ Part 3. Unsupervised topic classification (creates this dataset). https://www.avemacconsulting.com/2021/09/23/email-insights-from-data-science-part-3/ Part 4. Supervised modeling (uses this dataset). https://www.avemacconsulting.com/2021/10/12/email-insights-from-data-science-part-4/

    ** Note. This data is part of a blog series so is not vetted 100%. Specifically the unsupervised topic extraction step should be further tuned for accuracy.

    Content

    Original email content taking from the public Enron email repository located at https://www.cs.cmu.edu/~enron/.

    Dataset contains email body text, various supporting features (email addresses, data/time, etc.) plus multiple classification labels.

    Three (3) labels were generated for sentiment with three (3) classes (positive/negative/(neutral/unknown)). Three (3) labels were also created for alignment(business/personal) with two (2) classes (fun/work)).

    Acknowledgements

    Uses sentiment lexicon from http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

    Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews." Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, Washington, USA,

    Uses VADER from https://www.nltk.org/api/nltk.sentiment.html?highlight=vader#module-nltk.sentiment.vader

    Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

    Uses AFINN from http://corpustext.com/reference/sentiment_afinn.html

    Finn Årup Nielsen A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. Proceedings of the ESWC2011 Workshop on 'Making Sense of Microposts': Big things come in small packages 718 in CEUR Workshop Proceedings 93-98. 2011 May.

  9. Z

    Two Dynamic Attributed Networks: Enron & Jazz LastFM

    • data.niaid.nih.gov
    Updated Oct 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Orman, Günce Keziban; Labatut, Vincent; Plantevit, Marc; Boulicaut, Jean-François (2024). Two Dynamic Attributed Networks: Enron & Jazz LastFM [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6815611
    Explore at:
    Dataset updated
    Oct 1, 2024
    Dataset provided by
    Galatasaray University
    Authors
    Orman, Günce Keziban; Labatut, Vincent; Plantevit, Marc; Boulicaut, Jean-François
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description. This repository contains two dynamic and attributed social networks extracted from the well-known Enron email dataset, and from the LastFM online music platform. We used both networks in the following papers:

    G. K. Orman, V. Labatut, M. Plantevit, and J.-F. Boulicaut, “A Method for Characterizing Communities in Dynamic Attributed Complex Networks,” in IEEE/ACM International Conference on Advances in Social Network Analysis and Mining (ASONAM), 2014, pp. 481–484. ⟨hal-01011913⟩ DOI: 10.1109/ASONAM.2014.6921629

    G. K. Orman, V. Labatut, M. Plantevit, and J.-F. Boulicaut, “Interpreting communities based on the evolution of a dynamic attributed network,” Social Network Analysis and Mining, vol. 5, p. 20, 2015. ⟨hal-01163778⟩ DOI: 10.1007/s13278-015-0262-4

    Citation. If you use these data, please cite the paper [1].

    @InProceedings{Orman2014, author = {Orman, Günce Keziban and Labatut, Vincent and Plantevit, Marc and Boulicaut, Jean-François}, title = {A Method for Characterizing Communities in Dynamic Attributed Complex Networks}, booktitle = {IEEE/ACM International Conference on Advances in Social Network Analysis and Mining}, year = {2014}, pages = {481-484}, address = {Beijing, CN}, publisher = {IEEE Publishing}, doi = {10.1109/ASONAM.2014.6921629},}

    Enron dataset. Enron is a well-known dataset in network science and text mining. It has been widely studied in academia. In network science, several different static networks appear in the literature. However, up to now, no dynamic network has been published, even though the email conversations have timestamps.

    We processed the original dataset to extract a dynamic network. There are 158 nodes representing Enron employees between 1997 and 2002. All the addresses in the From and To fields of each email are considered, resulting in a network of 28,802 nodes representing a distinct email addresses. A time span of one month is chosen for the time slices, generating 46 time slices. Two nodes are connected if the corresponding persons emailed each other during the given time slice. We did not make any distinction between sender and receiver, and thus produced an undirected dynamic network.

    LastFM dataset. LastFM is a music website that allows its members to register and listen to music online. It is also a social network platform, because its members can declare friendship relationships. In LastFM, members can join a predefined group related to their music tastes, and participate in music-related events such as concerts. Using the LastFM API, One can retrieve the information of the artist and track a user has listened to, with the exact timestamp. Moreover, it is also possible to get some information regarding the music-related events the users joined, including the exact timestamps.

    We extracted a network by focusing on the members of the Jazz group, which is supposed to include users appreciating this type of music. We took advantage of the LastFM API to retrieve the members of this group and the existing friendship connection between them. In the end, our network contains 1,702 nodes representing the Jazz users. The friendship relationships between them is static, though, in the sense that the LastFM API does not give access to any temporal information regarding their beginning or end. So, we decided to take advantage of some additional information to get a dynamic structure. We put a link between two nodes if two conditions were simultaneously true: 1) both considered users listened to at least one common artist for a specific period of time, and 2) they are friends on the LastFM platform. For the mentioned period of time, we decided to use 3 months with 1 month overlap, after having analyzed the dynamics of the platform. In other words, we extracted a dynamic network in which each time slice represents three months of LastFM usage for our 1,702 users of interest. There are one month overlap between two consecutive time slices.

  10. Curated Dataset - Phishing Email

    • figshare.com
    bin
    Updated May 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous20230623 Anonymous (2024). Curated Dataset - Phishing Email [Dataset]. http://doi.org/10.6084/m9.figshare.24899952.v2
    Explore at:
    binAvailable download formats
    Dataset updated
    May 2, 2024
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Anonymous20230623 Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We have curated 7 repositories.The Ling and Enron datasets possess just two features: ‘Subject’ and ‘Body’. The other datasets consists of six features, namely ‘Sender’, ‘Receiver’, ‘Date’, ‘Subject’, ‘Body’, and ‘Urls’.Please cite this dataset:A. I. Champa, M. F. Rabbi, and M. F. Zibran, “Curated datasets and feature analysis for phishing email detection with machine learning,” in 3rd IEEE International Conference on Computing and Machine Intelligence (ICMI), 2024, pp. 1–7 (to appear).or @inproceedings{champa2024curated, title={Curated Datasets and Feature Analysis for Phishing Email Detection with Machine Learning}, author={Champa, Arifa I and Rabbi, Md Fazle and Zibran, Minhaz F}, booktitle={3rd IEEE International Conference on Computing and Machine Intelligence (ICMI)}, pages = {1--7 (to appear)}, year={2024}}

  11. Email Networks (SNAP)

    • kaggle.com
    zip
    Updated Dec 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subhajit Sahu (2021). Email Networks (SNAP) [Dataset]. https://www.kaggle.com/wolfram77/graphs-snap-email
    Explore at:
    zip(4271412 bytes)Available download formats
    Dataset updated
    Dec 16, 2021
    Authors
    Subhajit Sahu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    EU email communication network

    Dataset information

    The network was generated using email data from a large European research
    institution. For a period from October 2003 to May 2005 (18 months) we have
    anonymized information about all incoming and outgoing email of the research
    institution. For each sent or received email message we know the time, the
    sender and the recipient of the email. Overall we have 3,038,531 emails between 287,755 different email addresses. Note that we have a complete email graph for only 1,258 email addresses that come from the research institution.
    Furthermore, there are 34,203 email addresses that both sent and received email within the span of our dataset. All other email addresses are either
    non-existing, mistyped or spam.

    Given a set of email messages, each node corresponds to an email address. We
    create a directed edge between nodes i and j, if i sent at least one message to j.

    Dataset statistics

    Nodes 265214
    Edges 420045
    Nodes in largest WCC 224832 (0.848)
    Edges in largest WCC 395270 (0.941)
    Nodes in largest SCC 34203 (0.129)
    Edges in largest SCC 151930 (0.362)
    Average clustering coefficient 0.3093
    Number of triangles 267313
    Fraction of closed triangles 0.004106
    Diameter (longest shortest path) 13
    90-percentile effective diameter 4.5

    Source (citation)

    J. Leskovec, J. Kleinberg and C. Faloutsos. Graph Evolution: Densification and Shrinking Diameters. ACM Transactions on Knowledge Discovery from Data (ACM
    TKDD), 1(1), 2007.

    Files
    File Description
    email-EuAll.txt.gz Email network of a large European Research Institution

    Enron email network

    Dataset information

    Enron email communication network covers all the email communication within a
    dataset of around half million emails. This data was originally made public,
    and posted to the web, by the Federal Energy Regulatory Commission during its
    investigation. Nodes of the network are email addresses and if an address i
    sent at least one email to address j, the graph contains a directed edge from i to j. Note that non-Enron email addresses act as sinks and sources in the
    network as we only observe their communication with the Enron email addresses.

    The Enron email data was originally released by William Cohen at CMU.

    Dataset statistics
    Nodes 36692
    Edges 367662
    Nodes in largest WCC 33696 (0.918)
    Edges in largest WCC 361622 (0.984)
    Nodes in largest...

  12. Enron Clean dataset

    • kaggle.com
    zip
    Updated Jun 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aman Kumar (2020). Enron Clean dataset [Dataset]. https://www.kaggle.com/amank56/enron-clean-dataset
    Explore at:
    zip(223581699 bytes)Available download formats
    Dataset updated
    Jun 15, 2020
    Authors
    Aman Kumar
    Description

    Context

    Idea started for building a smart model for the finance company to ensure all employees are compliant to company's policy. We were targeting email, chats, SMS for that, starting point was emails. As a POC I worked on this dataset to find out the suspicious activities for audit. I was able to locate some of them but considering there is huge years of difference b/w them now about the user behavior.

    Further reading/information:

    you can check out my github page for this Enron dataset wrangling, cleaning process and also some analysis. https://github.com/amank56/Enron

    Content

    There are 6 files with clean emails along with additional important attributes.

    Acknowledgements

    I would like to thank you for all those who are directly or indirectly related to this task starting from getting the baseline data to final outcome.

  13. Communication Graphs

    • kaggle.com
    zip
    Updated Nov 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subhajit Sahu (2021). Communication Graphs [Dataset]. https://www.kaggle.com/wolfram77/graphs-communication
    Explore at:
    zip(66715371 bytes)Available download formats
    Dataset updated
    Nov 15, 2021
    Authors
    Subhajit Sahu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    email-EuAll: EU email communication network

    The network was generated using email data from a large European research institution. For a period from October 2003 to May 2005 (18 months) we have anonymized information about all incoming and outgoing email of the research institution. For each sent or received email message we know the time, the sender and the recipient of the email. Overall we have 3,038,531 emails between 287,755 different email addresses. Note that we have a complete email graph for only 1,258 email addresses that come from the research institution. Furthermore, there are 34,203 email addresses that both sent and received email within the span of our dataset. All other email addresses are either non-existing, mistyped or spam.

    Given a set of email messages, each node corresponds to an email address. We create a directed edge between nodes i and j, if i sent at least one message to j.

    email-Enron: Enron email network

    Enron email communication network covers all the email communication within a dataset of around half million emails. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation. Nodes of the network are email addresses and if an address i sent at least one email to address j, the graph contains an undirected edge from i to j. Note that non-Enron email addresses act as sinks and sources in the network as we only observe their communication with the Enron email addresses.

    The Enron email data was originally released by William Cohen at CMU.

    wiki-Talk: Wikipedia Talk network

    Wikipedia is a free encyclopedia written collaboratively by volunteers around the world. Each registered user has a talk page, that she and other users can edit in order to communicate and discuss updates to various articles on Wikipedia. Using the latest complete dump of Wikipedia page edit history (from January 3 2008) we extracted all user talk page changes and created a network.

    The network contains all the users and discussion from the inception of Wikipedia till January 2008. Nodes in the network represent Wikipedia users and a directed edge from node i to node j represents that user i at least once edited a talk page of user j.

    comm-f2f-Resistance: Dynamic Face-to-Face Interaction Networks

    The dynamic face-to-face interaction networks represent the interactions that happen during discussions between a group of participants playing the Resistance game. This dataset contains networks extracted from 62 games. Each game is played by 5-8 participants and lasts between 45--60 minutes. We extract dynamically evolving networks from the free-form discussions using the ICAF algorithm. The extracted networks are used to characterize and detect group deceptive behavior using the DeceptionRank algorithm.

    The networks are weighted, directed and temporal. Each node represents a participant. At each 1/3 second, a directed edge from node u to v is weighted by the probability of participant u looking at participant v or the laptop. Additionally, we also provide a binary version where an edge from u to v indicates participant u looks at participant v (or the laptop).

    Stanford Network Analysis Platform (SNAP) is a general purpose, high performance system for analysis and manipulation of large networks. Graphs consists of nodes and directed/undirected/multiple edges between the graph nodes. Networks are graphs with data on nodes and/or edges of the network.

    The core SNAP library is written in C++ and optimized for maximum performance and compact graph representation. It easily scales to massive networks with hundreds of millions of nodes, and billions of edges. It efficiently manipulates large graphs, calculates structural properties, generates regular and random graphs, and supports attributes on nodes and edges. Besides scalability to large graphs, an additional strength of SNAP is that nodes, edges and attributes in a graph or a network can be changed dynamically during the computation.

    SNAP was originally developed by Jure Leskovec in the course of his PhD studies. The first release was made available in Nov, 2009. SNAP uses a general purpose STL (Standard Template Library)-like library GLib developed at Jozef Stefan Institute. SNAP and GLib are being actively developed and used in numerous academic and industrial projects.

    http://snap.stanford.edu/data/index.html#email

  14. Spam Email Classification Dataset

    • kaggle.com
    zip
    Updated Nov 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Puru Singhvi (2023). Spam Email Classification Dataset [Dataset]. https://www.kaggle.com/datasets/purusinghvi/email-spam-classification-dataset
    Explore at:
    zip(45062184 bytes)Available download formats
    Dataset updated
    Nov 6, 2023
    Authors
    Puru Singhvi
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Introduction

    This is a csv file containing 83446 records of email which are labelled as either spam or not-spam. It is formed by combining the 2007 TREC Public Spam Corpus and Enron-Spam Dataset.

    Columns

    1. label
      • '1' indicates that the email is classified as spam.
      • '0' denotes that the email is legitimate (ham).
    2. text
      • This column contains the actual content of the email messages.

    Sources

    1. 2007 TREC Public Spam Corpus
    2. Enron-Spam Dataset

    Code for combining and processing the two datasets: https://github.com/PuruSinghvi/Spam-Email-Classifier/blob/main/Combining%20Datasets.ipynb

    Spam Email Classifier

    A spam email classifier has been trained and built using this dataset.
    It can be found here: https://github.com/PuruSinghvi/Spam-Email-Classifier

  15. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Alessandro Berti; Alessandro Berti (2025). Object-Centric Event Log (OCEL) of the Enron Email Dataset [Dataset]. http://doi.org/10.5281/zenodo.15516869
Organization logo

Object-Centric Event Log (OCEL) of the Enron Email Dataset

Explore at:
binAvailable download formats
Dataset updated
May 26, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alessandro Berti; Alessandro Berti
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Description:

This dataset provides an object-centric event log (OCEL) representation of the publicly available Enron email corpus. The OCEL format allows for a richer analysis of interconnected processes and objects, making it particularly suitable for advanced process mining techniques, communication pattern analysis, and social network exploration.

The event logs were generated from a pre-processed CSV version of the Enron emails using a custom Python script leveraging the PM4Py library. The script parses individual emails to extract key information, including:

  • Timestamps: Derived from the 'Date' field of emails, parsed into timezone-aware datetime objects.
  • Activities: Inferred from email subject prefixes (e.g., "Re:" becomes "Response", "Fw:" becomes "Forwarding", "Invitation:" becomes "Invitation"). Emails without recognized prefixes are assigned a "Default" activity.
  • Objects: Two primary object types are identified:
    • EMAILADDRESS: Extracted from 'From', 'To', and 'Cc' fields.
    • MESSAGEID: Extracted from 'Message-ID', 'In-Reply-To', and 'References' fields, prefixed with "MID_" in the OCEL to ensure unique object identifiers across types.
  • Attributes: Event attributes include the original cleaned subject and content of the email.
  • Relationships: Events (emails) are linked to EMAILADDRESS objects with qualifiers 'FROM', 'TO', or 'CC'. Events are linked to MESSAGEID objects with qualifiers 'MESSAGEID' (for the email's own ID), 'INREPLYTO', or 'REFERENCES' to trace conversational threads.

To accommodate various analytical needs and computational resources, the dataset is provided in three distinct checkpoints:

  1. Top 10,000 Emails: An OCEL generated from the first 10,000 emails processed.
  2. Top 100,000 Emails: An OCEL generated from the first 100,000 emails processed.
  3. All Emails: An OCEL generated from all emails processed by the script from the input emails.csv file.

Each checkpoint is available in the .jsonocel format (OCEL 2.0 standard), ready for use with PM4Py and other OCEL-compatible process mining tools. This dataset can be valuable for researchers and practitioners seeking to apply object-centric process discovery, conformance checking, and enhancement techniques to a large, real-world communication log.

Keywords: Object-Centric Event Log, OCEL, Process Mining, Enron Dataset, Email Analysis, Communication Networks, Social Network Analysis, PM4Py

Search
Clear search
Close search
Google apps
Main menu