9 datasets found
  1. Communication Graphs

    • kaggle.com
    zip
    Updated Nov 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subhajit Sahu (2021). Communication Graphs [Dataset]. https://www.kaggle.com/datasets/wolfram77/graphs-communication/discussion
    Explore at:
    zip(66715371 bytes)Available download formats
    Dataset updated
    Nov 15, 2021
    Authors
    Subhajit Sahu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    email-EuAll: EU email communication network

    The network was generated using email data from a large European research institution. For a period from October 2003 to May 2005 (18 months) we have anonymized information about all incoming and outgoing email of the research institution. For each sent or received email message we know the time, the sender and the recipient of the email. Overall we have 3,038,531 emails between 287,755 different email addresses. Note that we have a complete email graph for only 1,258 email addresses that come from the research institution. Furthermore, there are 34,203 email addresses that both sent and received email within the span of our dataset. All other email addresses are either non-existing, mistyped or spam.

    Given a set of email messages, each node corresponds to an email address. We create a directed edge between nodes i and j, if i sent at least one message to j.

    email-Enron: Enron email network

    Enron email communication network covers all the email communication within a dataset of around half million emails. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation. Nodes of the network are email addresses and if an address i sent at least one email to address j, the graph contains an undirected edge from i to j. Note that non-Enron email addresses act as sinks and sources in the network as we only observe their communication with the Enron email addresses.

    The Enron email data was originally released by William Cohen at CMU.

    wiki-Talk: Wikipedia Talk network

    Wikipedia is a free encyclopedia written collaboratively by volunteers around the world. Each registered user has a talk page, that she and other users can edit in order to communicate and discuss updates to various articles on Wikipedia. Using the latest complete dump of Wikipedia page edit history (from January 3 2008) we extracted all user talk page changes and created a network.

    The network contains all the users and discussion from the inception of Wikipedia till January 2008. Nodes in the network represent Wikipedia users and a directed edge from node i to node j represents that user i at least once edited a talk page of user j.

    comm-f2f-Resistance: Dynamic Face-to-Face Interaction Networks

    The dynamic face-to-face interaction networks represent the interactions that happen during discussions between a group of participants playing the Resistance game. This dataset contains networks extracted from 62 games. Each game is played by 5-8 participants and lasts between 45--60 minutes. We extract dynamically evolving networks from the free-form discussions using the ICAF algorithm. The extracted networks are used to characterize and detect group deceptive behavior using the DeceptionRank algorithm.

    The networks are weighted, directed and temporal. Each node represents a participant. At each 1/3 second, a directed edge from node u to v is weighted by the probability of participant u looking at participant v or the laptop. Additionally, we also provide a binary version where an edge from u to v indicates participant u looks at participant v (or the laptop).

    Stanford Network Analysis Platform (SNAP) is a general purpose, high performance system for analysis and manipulation of large networks. Graphs consists of nodes and directed/undirected/multiple edges between the graph nodes. Networks are graphs with data on nodes and/or edges of the network.

    The core SNAP library is written in C++ and optimized for maximum performance and compact graph representation. It easily scales to massive networks with hundreds of millions of nodes, and billions of edges. It efficiently manipulates large graphs, calculates structural properties, generates regular and random graphs, and supports attributes on nodes and edges. Besides scalability to large graphs, an additional strength of SNAP is that nodes, edges and attributes in a graph or a network can be changed dynamically during the computation.

    SNAP was originally developed by Jure Leskovec in the course of his PhD studies. The first release was made available in Nov, 2009. SNAP uses a general purpose STL (Standard Template Library)-like library GLib developed at Jozef Stefan Institute. SNAP and GLib are being actively developed and used in numerous academic and industrial projects.

    http://snap.stanford.edu/data/index.html#email

  2. m

    Data from: A Hybrid Matheuristic for the Spread of Influence on Social...

    • data.mendeley.com
    Updated Nov 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Felipe Pereira (2024). A Hybrid Matheuristic for the Spread of Influence on Social Networks - Complementary Data [Dataset]. http://doi.org/10.17632/f4kyk7vkst.1
    Explore at:
    Dataset updated
    Nov 11, 2024
    Authors
    Felipe Pereira
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    This dataset contains complementary data to the paper "A Hybrid Matheuristic for the Spread of Influence on Social Networks" [1], which proposes a matheuristic for combinatorial optimization problems involving the spread of information in social networks.

    For the computational experiments discussed in that paper, we provide:

    • Two sets of instances, originally obtained from [2-6];
    • The solutions attained by exact and heuristic methods;
    • The collected results;
    • The matheuristic source code;

    The directories "benchmark_*/instances/" contain files that describe the sets of instances. Each instance is associated with a graph containing

    The first

    where and

    The next line contains

    The last line contains

    The directories "benchmark_*/solutions_*/" contain files describing feasible solutions for the corresponding sets of instances.

    The first line of each file contains:

    where is the number of vertices in the target set. Each of the next lines contains:

    where

    The last line contains an integer that represents the target set cost.

    The directory "hmf_source_code/" contains an implementation of the matheuristic framework proposed in [1], namely, HMF.

    This work was supported by grants from Santander Bank, the Brazilian National Council for Scientific and Technological Development (CNPq), the São Paulo Research Foundation (FAPESP), the Fund for Support to Teaching, Research and Outreach Activities (FAEPEX), and the Coordination for the Improvement of Higher Education Personnel (CAPES), all in Brazil.

    Caveat: The opinions, hypotheses and conclusions or recommendations expressed in this material are the sole responsibility of the authors and do not necessarily reflect the views of Santander, CNPq, FAPESP, FAEPEX, or CAPES.

    References

    [1] F. C. Pereira, P. J. de Rezende, and T. Yunes. A Hybrid Matheuristic for the Spread of Influence on Social Networks. 2024. Submitted.

    [2] S. Raghavan and R. Zhang. A branch-and-cut approach for the weighted target set selection problem on social networks. 2024. https://doi.org/10.1287/ijoo.2019.0012

    [3] J. Leskovec and A. Krevl. SNAP Datasets: Stanford Large Network Dataset Collection. 2024. https://snap.stanford.edu/data

    [4] R. A. Rossi and N. K. Ahmed. The Network Data Repository with Interactive Graph Analytics and Visualization. 2022. https://networkrepository.com

    [5] J. Kunegis. KONECT – The Koblenz Network Collection. 2013. http://dl.acm.org/citation.cfm?id=2488173

    [6] O. Lesser, L. Tenenboim-Chekina, L. Rokach, and Y. Elovici. Intruder or Welcome Friend: Inferring Group Membership in Online Social Networks. 2013. https://doi.org/10.1007/978-3-642-37210-0_40

  3. m

    A Row Generation Algorithm for Finding Optimal Burning Sequences of Large...

    • data.mendeley.com
    Updated Nov 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Felipe Pereira (2024). A Row Generation Algorithm for Finding Optimal Burning Sequences of Large Graphs - Complementary Data [Dataset]. http://doi.org/10.17632/c95hp3m4mz.2
    Explore at:
    Dataset updated
    Nov 11, 2024
    Authors
    Felipe Pereira
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    This dataset contains complementary data to the paper "A Row Generation Algorithm for Finding Optimal Burning Sequences of Large Graphs" [1], which proposes an exact algorithm for the Graph Burning Problem, an NP-hard optimization problem that models a form of contagion diffusion on social networks.

    Concerning the computational experiments discussed in that paper, we make available:

    • Four sets of instances;
    • The optimal (or best known) solutions obtained;
    • The source code;
    • An Appendix with additional details about the results.

    The "delta" input sets include graphs that are real-world networks [1,2], while the "grid" input set contains graphs that are square grids.

    The directories "delta_10K_instances", "delta_100K_instances", "delta_4M_instances" and "grid_instances" contain files that describe the sets of instances. The first two lines of each file contain:

    where

    where and

    The directories "delta_10K_solutions", "delta_100K_solutions", "delta_4M_solutions" and "grid_solutions" contain files that describe the optimal (or best known) solutions for the corresponding sets of instances.

    The first line of each file contains:

    where is the number of vertices in the burning sequence. Each of the next lines contains:

    where

    The directory "source_code" contains the implementations of the exact algorithm proposed in the paper [1], namely, PRYM.

    Lastly, the file "appendix.pdf" presents additional details on the results reported in the paper.

    This work was supported by grants from Santander Bank, Brazil, Brazilian National Council for Scientific and Technological Development (CNPq), Brazil, São Paulo Research Foundation (FAPESP), Brazil and Fund for Support to Teaching, Research and Outreach Activities (FAEPEX).

    Caveat: the opinions, hypotheses and conclusions or recommendations expressed in this material are the sole responsibility of the authors and do not necessarily reflect the views of Santander, CNPq, FAPESP or FAEPEX.

    References

    [1] F. C. Pereira, P. J. de Rezende, T. Yunes and L. F. B. Morato. A Row Generation Algorithm for Finding Optimal Burning Sequences of Large Graphs. Submitted. 2024.

    [2] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford Large Network Dataset Collection. 2024. https://snap.stanford.edu/data

    [3] Ryan A. Rossi and Nesreen K. Ahmed. The Network Data Repository with Interactive Graph Analytics and Visualization. In: AAAI, 2022. https://networkrepository.com

  4. Email Networks (SNAP)

    • kaggle.com
    zip
    Updated Dec 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subhajit Sahu (2021). Email Networks (SNAP) [Dataset]. https://www.kaggle.com/wolfram77/graphs-snap-email
    Explore at:
    zip(4271412 bytes)Available download formats
    Dataset updated
    Dec 16, 2021
    Authors
    Subhajit Sahu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    EU email communication network

    Dataset information

    The network was generated using email data from a large European research
    institution. For a period from October 2003 to May 2005 (18 months) we have
    anonymized information about all incoming and outgoing email of the research
    institution. For each sent or received email message we know the time, the
    sender and the recipient of the email. Overall we have 3,038,531 emails between 287,755 different email addresses. Note that we have a complete email graph for only 1,258 email addresses that come from the research institution.
    Furthermore, there are 34,203 email addresses that both sent and received email within the span of our dataset. All other email addresses are either
    non-existing, mistyped or spam.

    Given a set of email messages, each node corresponds to an email address. We
    create a directed edge between nodes i and j, if i sent at least one message to j.

    Dataset statistics

    Nodes 265214
    Edges 420045
    Nodes in largest WCC 224832 (0.848)
    Edges in largest WCC 395270 (0.941)
    Nodes in largest SCC 34203 (0.129)
    Edges in largest SCC 151930 (0.362)
    Average clustering coefficient 0.3093
    Number of triangles 267313
    Fraction of closed triangles 0.004106
    Diameter (longest shortest path) 13
    90-percentile effective diameter 4.5

    Source (citation)

    J. Leskovec, J. Kleinberg and C. Faloutsos. Graph Evolution: Densification and Shrinking Diameters. ACM Transactions on Knowledge Discovery from Data (ACM
    TKDD), 1(1), 2007.

    Files
    File Description
    email-EuAll.txt.gz Email network of a large European Research Institution

    Enron email network

    Dataset information

    Enron email communication network covers all the email communication within a
    dataset of around half million emails. This data was originally made public,
    and posted to the web, by the Federal Energy Regulatory Commission during its
    investigation. Nodes of the network are email addresses and if an address i
    sent at least one email to address j, the graph contains a directed edge from i to j. Note that non-Enron email addresses act as sinks and sources in the
    network as we only observe their communication with the Enron email addresses.

    The Enron email data was originally released by William Cohen at CMU.

    Dataset statistics
    Nodes 36692
    Edges 367662
    Nodes in largest WCC 33696 (0.918)
    Edges in largest WCC 361622 (0.984)
    Nodes in largest...

  5. Citation Graphs

    • kaggle.com
    zip
    Updated Nov 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subhajit Sahu (2021). Citation Graphs [Dataset]. https://www.kaggle.com/datasets/wolfram77/graphs-citation/data
    Explore at:
    zip(111812120 bytes)Available download formats
    Dataset updated
    Nov 13, 2021
    Authors
    Subhajit Sahu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Arxiv HEP-PH (high energy physics phenomenology ) citation graph is from the e-print arXiv and covers all the citations within a dataset of 34,546 papers with 421,578 edges. If a paper i cites paper j, the graph contains a directed edge from i to j. If a paper cites, or is cited by, a paper outside the dataset, the graph does not contain any information about this.

    The data covers papers in the period from January 1993 to April 2003 (124 months). It begins within a few months of the inception of the arXiv, and thus represents essentially the complete history of its HEP-PH section.

    The data was originally released as a part of 2003 KDD Cup.

    Arxiv HEP-TH (high energy physics theory) citation graph is from the e-print arXiv and covers all the citations within a dataset of 27,770 papers with 352,807 edges. If a paper i cites paper j, the graph contains a directed edge from i to j. If a paper cites, or is cited by, a paper outside the dataset, the graph does not contain any information about this.

    The data covers papers in the period from January 1993 to April 2003 (124 months). It begins within a few months of the inception of the arXiv, and thus represents essentially the complete history of its HEP-TH section.

    The data was originally released as a part of 2003 KDD Cup.

    U.S. patent dataset is maintained by the National Bureau of Economic Research. The data set spans 37 years (January 1, 1963 to December 30, 1999), and includes all the utility patents granted during that period, totaling 3,923,922 patents. The citation graph includes all citations made by patents granted between 1975 and 1999, totaling 16,522,438 citations. For the patents dataset there are 1,803,511 nodes for which we have no information about their citations (we only have the in-links).

    The data was originally released by NBER.

    Stanford Network Analysis Platform (SNAP) is a general purpose, high performance system for analysis and manipulation of large networks. Graphs consists of nodes and directed/undirected/multiple edges between the graph nodes. Networks are graphs with data on nodes and/or edges of the network.

    The core SNAP library is written in C++ and optimized for maximum performance and compact graph representation. It easily scales to massive networks with hundreds of millions of nodes, and billions of edges. It efficiently manipulates large graphs, calculates structural properties, generates regular and random graphs, and supports attributes on nodes and edges. Besides scalability to large graphs, an additional strength of SNAP is that nodes, edges and attributes in a graph or a network can be changed dynamically during the computation.

    SNAP was originally developed by Jure Leskovec in the course of his PhD studies. The first release was made available in Nov, 2009. SNAP uses a general purpose STL (Standard Template Library)-like library GLib developed at Jozef Stefan Institute. SNAP and GLib are being actively developed and used in numerous academic and industrial projects.

    https://snap.stanford.edu/data/index.html

  6. s

    FCP dataset for forecasting temperature, PV, price, and load

    • irr.singaporetech.edu.sg
    zip
    Updated Aug 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hanwen Zhang; Wei Zhang (2025). FCP dataset for forecasting temperature, PV, price, and load [Dataset]. http://doi.org/10.25447/sit.29755640.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 1, 2025
    Dataset provided by
    Singapore Institute of Technology
    Authors
    Hanwen Zhang; Wei Zhang
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Singapore aims to transform into a green and sustainable city by 2030. One of the key actions is to phase out Internal Combustion Engine (ICE) vehicles and significantly expand electric vehicle (EV) adoption. An EV is powered by electricity generated from natural gas and renewables, so the average carbon emission is only half of an ICE vehicle powered by petrol and diesel. By 2030, Singapore will cease the new registration for ICE cars, and eventually, all vehicles will run on clean energy by 2040. With the massive expansion of EVs in the foreseeable future, EV charger installation shall also match the trend and there will be at least 60,000 EV chargers deployed by 2030, roughly five EVs per charger. The action is ambitious indeed, and the EV and charger network system is expected to be enormous soon.The rapid expansion of the system comes with the requirements of advanced management and operation. However, EV chargers nowadays cannot well satisfy the requirements. The key issue is that EV chargers are not smart. Largely due to the cost consideration, an EV charger is more of an electricity transforming and delivery unit instead of a computation-driven intelligent module, and computational resource is often missing or minimal in existing chargers. Besides computing resources, EV chargers also lack sufficient capability for connectivity, where data transmission is mostly cable-based for wired transmission and only relatively advanced chargers support wireless connectivity like 4G and Wi-Fi. Lack of intelligence with data scarcity might be acceptable for early-stage small-scale deployment. But for a large-scale system, potential consequences can be poor management, inferior scheduling, economic loss, weekend reliability, and so on.In this project, we propose to empower the EV chargers with 5G capabilities for connectivity and computing and bring smartness and intelligence into them. 5G is fast, so the high-resolution EV charger data can be accessed in real-time with minimal delay. 5G supports high concurrency, so a large number of EV chargers can utilize the connectivity without being forced to be sequential to avoid conflict and long delay. 5G has great bandwidth, so abundant information from EV chargers and the associated facilities like battery energy storage systems (BESS) and solar panels can be transmitted. 5G is also ultra-reliable with low latency which makes 5G suitable for mission critical functionalities and time-sensitive control. Overall, 5G connectivity addresses the key challenge of data scarcity in current chargers and facilitates data-driven system monitoring and intelligent management. Besides providing connectivity, 5G is also featured with edge computing capability with edge servers integrated into 5G networks. So, the data can be processed and analyzed in edge servers, where the computing resource enables insights and knowledge extraction from the data to realize intelligent EV charging management. To achieve the overall goal of the 5G-powered intelligent EV charging system, we have the following key objectives for our research.• To design and develop 5G-based data processing and analytics systems and interfaces for data acquisition, transmission, storage, management, and analytics.• To design and develop data-driven algorithms for accurate and reliable charging supplydemand forecasting and cost-optimal scheduling with large-volume and high-resolution data.• To implement a prototype and demonstrate the system effectiveness utilizing the facilities from SIT’s Future Communications Translation Lab (FCTLab) and our EV sector industry partner. Upon successful demonstration, our industry partner plans to commercialize the solutions and deploy them in the company’s EV charging system for widespread adoption.Our research tackles the urgent challenges of lacking connectivity, hence data-driven intelligence in the current industry of EV charging management. We leverage the 5G capabilities of connectivity and computation to promote data availability and analytics. We believe our research is promising with strong support from both academia and industry. The research has a significant impact on upgrading the EV or mobility industry with great potential for economic and sustainability.1. Source of Weather Dataset: https://www.visualcrossing.com/2. Source of PV Dataset: https://purl.stanford.edu/fb002mq94073. Source of Price Dataset: https://www.nems.emcsg.com/nems-prices4. Source of EV Charging Demand Dataset:https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/NFPQLW5. Source of EV Charging Demand Dataset: https://data.cityofpaloalto.org/dataviews/257812/electric-vehiclecharging-station-usage-july-2011-dec-2020/

  7. OGBN-Proteins (Processed for PyG)

    • kaggle.com
    zip
    Updated Feb 27, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Redao da Taupl (2021). OGBN-Proteins (Processed for PyG) [Dataset]. https://www.kaggle.com/dataup1/ogbn-proteins
    Explore at:
    zip(677947148 bytes)Available download formats
    Dataset updated
    Feb 27, 2021
    Authors
    Redao da Taupl
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    OGBN-Proteins

    Webpage: https://ogb.stanford.edu/docs/nodeprop/#ogbn-proteins

    Usage in Python

    import os.path as osp
    import pandas as pd
    import torch
    import torch_geometric.transforms as T
    from ogb.nodeproppred import PygNodePropPredDataset
    
    class PygOgbnProteins(PygNodePropPredDataset):
      def _init_(self, meta_csv = None):
        root, name, transform = '/kaggle/input', 'ogbn-proteins', T.ToSparseTensor()
        if meta_csv is None:
          meta_csv = osp.join(root, name, 'ogbn-master.csv')
        master = pd.read_csv(meta_csv, index_col = 0)
        meta_dict = master[name]
        meta_dict['dir_path'] = osp.join(root, name)
        super()._init_(name = name, root = root, transform = transform, meta_dict = meta_dict)
      def get_idx_split(self, split_type = None):
        if split_type is None:
          split_type = self.meta_info['split']
        path = osp.join(self.root, 'split', split_type)
        if osp.isfile(os.path.join(path, 'split_dict.pt')):
          return torch.load(os.path.join(path, 'split_dict.pt'))
        if self.is_hetero:
          train_idx_dict, valid_idx_dict, test_idx_dict = read_nodesplitidx_split_hetero(path)
          for nodetype in train_idx_dict.keys():
            train_idx_dict[nodetype] = torch.from_numpy(train_idx_dict[nodetype]).to(torch.long)
            valid_idx_dict[nodetype] = torch.from_numpy(valid_idx_dict[nodetype]).to(torch.long)
            test_idx_dict[nodetype] = torch.from_numpy(test_idx_dict[nodetype]).to(torch.long)
            return {'train': train_idx_dict, 'valid': valid_idx_dict, 'test': test_idx_dict}
        else:
          train_idx = dt.fread(osp.join(path, 'train.csv'), header = None).to_numpy().T[0]
          train_idx = torch.from_numpy(train_idx).to(torch.long)
          valid_idx = dt.fread(osp.join(path, 'valid.csv'), header = None).to_numpy().T[0]
          valid_idx = torch.from_numpy(valid_idx).to(torch.long)
          test_idx = dt.fread(osp.join(path, 'test.csv'), header = None).to_numpy().T[0]
          test_idx = torch.from_numpy(test_idx).to(torch.long)
          return {'train': train_idx, 'valid': valid_idx, 'test': test_idx}
    
    dataset = PygOgbnProteins()
    split_idx = dataset.get_idx_split()
    train_idx, valid_idx, test_idx = split_idx['train'], split_idx['valid'], split_idx['test']
    graph = dataset[0] # PyG Graph object
    

    Description

    Graph: The ogbn-proteins dataset is an undirected, weighted, and typed (according to species) graph. Nodes represent proteins, and edges indicate different types of biologically meaningful associations between proteins, e.g., physical interactions, co-expression or homology [1,2]. All edges come with 8-dimensional features, where each dimension represents the strength of a single association type and takes values between 0 and 1 (the larger the value is, the stronger the association is). The proteins come from 8 species.

    Prediction task: The task is to predict the presence of protein functions in a multi-label binary classification setup, where there are 112 kinds of labels to predict in total. The performance is measured by the average of ROC-AUC scores across the 112 tasks.

    Dataset splitting: The authors split the protein nodes into training/validation/test sets according to the species which the proteins come from. This enables the evaluation of the generalization performance of the model across different species.

    Note: For undirected graphs, the loaded graphs will have the doubled number of edges because the bidirectional edges will be added automatically.

    Summary

    Package#Nodes#EdgesSplit TypeTask TypeMetric
    ogb>=1.1.1132,53439,561,252SpeciesMulti-label binary classificationROC-AUC

    Open Graph Benchmark

    Website: https://ogb.stanford.edu

    The Open Graph Benchmark (OGB) [3] is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner.

    References

    [1] Damian Szklarczyk, Annika L Gable, David Lyon, Alexander Junge, Stefan Wyder, Jaime Huerta-Cepas, Milan Simonovic, Nadezhda T Doncheva, John H Morris, Peer Bork, et al. STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Research, 47(D1):D607–D613, 2019. [2] Gene Ontology Consortium. The gene ontology resource: 20 years and still going strong. Nucleic Acids Research, 47(D1):D330–D338, 2018. [3] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchm...

  8. Data from: Clinical Dataset

    • kaggle.com
    zip
    Updated Oct 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamadreza Momeni (2023). Clinical Dataset [Dataset]. https://www.kaggle.com/datasets/imtkaggleteam/clinical-dataset
    Explore at:
    zip(16220 bytes)Available download formats
    Dataset updated
    Oct 5, 2023
    Authors
    Mohamadreza Momeni
    Description

    The purest type of electronic clinical data which is obtained at the point of care at a medical facility, hospital, clinic or practice. Often referred to as the electronic medical record (EMR), the EMR is generally not available to outside researchers. The data collected includes administrative and demographic information, diagnosis, treatment, prescription drugs, laboratory tests, physiologic monitoring data, hospitalization, patient insurance, etc.

    Individual organizations such as hospitals or health systems may provide access to internal staff. Larger collaborations, such as the NIH Collaboratory Distributed Research Network provides mediated or collaborative access to clinical data repositories by eligible researchers. Additionally, the UW De-identified Clinical Data Repository (DCDR) and the Stanford Center for Clinical Informatics allow for initial cohort identification.

    About Dataset:

    333 scholarly articles cite this dataset.

    Unique identifier: DOI

    Dataset updated: 2023

    Authors: Haoyang Mi

    In this dataset, we have two dataset:

    1- Clinical Data_Discovery_Cohort: Name of columns: Patient ID Specimen date Dead or Alive Date of Death Date of last Follow Sex Race Stage Event Time

    2- Clinical_Data_Validation_Cohort Name of columns: Patient ID Survival time (days) Event Tumor size Grade Stage Age Sex Cigarette Pack per year Type Adjuvant Batch EGFR KRAS

    Feel free to put your thought and analysis in a notebook for this datasets. And you can create some interesting and valuable ML projects for this case. Thanks for your attention.

  9. ScisummNet Corpus

    • kaggle.com
    zip
    Updated Sep 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jawakar (2021). ScisummNet Corpus [Dataset]. https://www.kaggle.com/datasets/jawakar/scisummnet-corpus/discussion
    Explore at:
    zip(9655883 bytes)Available download formats
    Dataset updated
    Sep 3, 2021
    Authors
    Jawakar
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Context

    This large corpus can be used to train scientific paper summarization models that utilize citations, facilitating research in supervised methods.

    Previous datasets for scientific document summarization are small with only several dozen articles. This dataset includes 1000 examples which is much larger than the prior works.

    Content

    I acquired this dataset from here in XML format. The CL-Scisumm project developed the first large-scale, human-annotated Scisumm dataset, ScisummNet. It provides over 1,000 papers in the ACL anthology network with their citation networks (e.g. citation sentences, citation counts) and their comprehensive, manual summaries.

    The text column has every token of the research paper, and the summary column consists of summaries of the scientific paper.

    Acknowledgements

    This dataset is possible by the CL-Scisumm shared task, which has been organized since 2014 for papers in the computational linguistics and NLP domain.

    Inspiration

    This dataset should be trained with SOTA models and perform better than the model proposed by the SCisummNet.

  10. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Subhajit Sahu (2021). Communication Graphs [Dataset]. https://www.kaggle.com/datasets/wolfram77/graphs-communication/discussion
Organization logo

Communication Graphs

Communication networks from the Stanford Network Analysis Platform (SNAP)

Explore at:
zip(66715371 bytes)Available download formats
Dataset updated
Nov 15, 2021
Authors
Subhajit Sahu
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

email-EuAll: EU email communication network

The network was generated using email data from a large European research institution. For a period from October 2003 to May 2005 (18 months) we have anonymized information about all incoming and outgoing email of the research institution. For each sent or received email message we know the time, the sender and the recipient of the email. Overall we have 3,038,531 emails between 287,755 different email addresses. Note that we have a complete email graph for only 1,258 email addresses that come from the research institution. Furthermore, there are 34,203 email addresses that both sent and received email within the span of our dataset. All other email addresses are either non-existing, mistyped or spam.

Given a set of email messages, each node corresponds to an email address. We create a directed edge between nodes i and j, if i sent at least one message to j.

email-Enron: Enron email network

Enron email communication network covers all the email communication within a dataset of around half million emails. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation. Nodes of the network are email addresses and if an address i sent at least one email to address j, the graph contains an undirected edge from i to j. Note that non-Enron email addresses act as sinks and sources in the network as we only observe their communication with the Enron email addresses.

The Enron email data was originally released by William Cohen at CMU.

wiki-Talk: Wikipedia Talk network

Wikipedia is a free encyclopedia written collaboratively by volunteers around the world. Each registered user has a talk page, that she and other users can edit in order to communicate and discuss updates to various articles on Wikipedia. Using the latest complete dump of Wikipedia page edit history (from January 3 2008) we extracted all user talk page changes and created a network.

The network contains all the users and discussion from the inception of Wikipedia till January 2008. Nodes in the network represent Wikipedia users and a directed edge from node i to node j represents that user i at least once edited a talk page of user j.

comm-f2f-Resistance: Dynamic Face-to-Face Interaction Networks

The dynamic face-to-face interaction networks represent the interactions that happen during discussions between a group of participants playing the Resistance game. This dataset contains networks extracted from 62 games. Each game is played by 5-8 participants and lasts between 45--60 minutes. We extract dynamically evolving networks from the free-form discussions using the ICAF algorithm. The extracted networks are used to characterize and detect group deceptive behavior using the DeceptionRank algorithm.

The networks are weighted, directed and temporal. Each node represents a participant. At each 1/3 second, a directed edge from node u to v is weighted by the probability of participant u looking at participant v or the laptop. Additionally, we also provide a binary version where an edge from u to v indicates participant u looks at participant v (or the laptop).

Stanford Network Analysis Platform (SNAP) is a general purpose, high performance system for analysis and manipulation of large networks. Graphs consists of nodes and directed/undirected/multiple edges between the graph nodes. Networks are graphs with data on nodes and/or edges of the network.

The core SNAP library is written in C++ and optimized for maximum performance and compact graph representation. It easily scales to massive networks with hundreds of millions of nodes, and billions of edges. It efficiently manipulates large graphs, calculates structural properties, generates regular and random graphs, and supports attributes on nodes and edges. Besides scalability to large graphs, an additional strength of SNAP is that nodes, edges and attributes in a graph or a network can be changed dynamically during the computation.

SNAP was originally developed by Jure Leskovec in the course of his PhD studies. The first release was made available in Nov, 2009. SNAP uses a general purpose STL (Standard Template Library)-like library GLib developed at Jozef Stefan Institute. SNAP and GLib are being actively developed and used in numerous academic and industrial projects.

http://snap.stanford.edu/data/index.html#email

Search
Clear search
Close search
Google apps
Main menu