100+ datasets found
  1. Sample Graph Datasets in CSV Format

    • zenodo.org
    csv
    Updated Dec 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Edwin Carreño; Edwin Carreño (2024). Sample Graph Datasets in CSV Format [Dataset]. http://doi.org/10.5281/zenodo.14335015
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 9, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Edwin Carreño; Edwin Carreño
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sample Graph Datasets in CSV Format

    Note: none of the data sets published here contain actual data, they are for testing purposes only.

    Description

    This data repository contains graph datasets, where each graph is represented by two CSV files: one for node information and another for edge details. To link the files to the same graph, their names include a common identifier based on the number of nodes. For example:

    • dataset_30_nodes_interactions.csv:contains 30 rows (nodes).
    • dataset_30_edges_interactions.csv: contains 47 rows (edges).
    • the common identifier dataset_30 refers to the same graph.

    CSV nodes

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    UniProt IDstringprotein identification
    labelstringprotein label (type of node)
    propertiesstringa dictionary containing properties related to the protein.

    CSV edges

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    Relationship IDstringrelationship identification
    Source IDstringidentification of the source protein in the relationship
    Target IDstringidentification of the target protein in the relationship
    labelstringrelationship label (type of relationship)
    propertiesstringa dictionary containing properties related to the relationship.

    Metadata

    GraphNumber of NodesNumber of EdgesSparse graph

    dataset_30*

    30

    47

    Y

    dataset_60*

    60

    181

    Y

    dataset_120*

    120

    689

    Y

    dataset_240*

    240

    2819

    Y

    dataset_300*

    300

    4658

    Y

    dataset_600*

    600

    18004

    Y

    dataset_1200*

    1200

    71785

    Y

    dataset_2400*

    2400

    288600

    Y

    dataset_3000*

    3000

    449727

    Y

    dataset_6000*

    6000

    1799413

    Y

    dataset_12000*

    12000

    7199863

    Y

    dataset_24000*

    24000

    28792361

    Y

    dataset_30000*

    30000

    44991744

    Y

    This repository include two (2) additional tiny graph datasets to experiment before dealing with larger datasets.

    CSV nodes (tiny graphs)

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    IDstringnode identification
    labelstringnode label (type of node)
    propertiesstringa dictionary containing properties related to the node.

    CSV edges (tiny graphs)

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    IDstringrelationship identification
    sourcestringidentification of the source node in the relationship
    targetstringidentification of the target node in the relationship
    labelstringrelationship label (type of relationship)
    propertiesstringa dictionary containing properties related to the relationship.

    Metadata (tiny graphs)

    GraphNumber of NodesNumber of EdgesSparse graph
    dataset_dummy*36N
    dataset_dummy2*36N
  2. CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company...

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, bin +1
    Updated Jun 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lele Cao; Lele Cao; Vilhelm von Ehrenheim; Vilhelm von Ehrenheim; Mark Granroth-Wilding; Mark Granroth-Wilding; Richard Anselmo Stahl; Richard Anselmo Stahl; Drew McCornack; Drew McCornack; Armin Catovic; Armin Catovic; Dhiana Deva Cavacanti Rocha; Dhiana Deva Cavacanti Rocha (2024). CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company Similarity Quantification [Dataset]. http://doi.org/10.5281/zenodo.11391315
    Explore at:
    application/gzip, bin, txtAvailable download formats
    Dataset updated
    Jun 4, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lele Cao; Lele Cao; Vilhelm von Ehrenheim; Vilhelm von Ehrenheim; Mark Granroth-Wilding; Mark Granroth-Wilding; Richard Anselmo Stahl; Richard Anselmo Stahl; Drew McCornack; Drew McCornack; Armin Catovic; Armin Catovic; Dhiana Deva Cavacanti Rocha; Dhiana Deva Cavacanti Rocha
    Time period covered
    May 29, 2024
    Description

    CompanyKG is a heterogeneous graph consisting of 1,169,931 nodes and 50,815,503 undirected edges, with each node representing a real-world company and each edge signifying a relationship between the connected pair of companies.

    Edges: We model 15 different inter-company relations as undirected edges, each of which corresponds to a unique edge type. These edge types capture various forms of similarity between connected company pairs. Associated with each edge of a certain type, we calculate a real-numbered weight as an approximation of the similarity level of that type. It is important to note that the constructed edges do not represent an exhaustive list of all possible edges due to incomplete information. Consequently, this leads to a sparse and occasionally skewed distribution of edges for individual relation/edge types. Such characteristics pose additional challenges for downstream learning tasks. Please refer to our paper for a detailed definition of edge types and weight calculations.

    Nodes: The graph includes all companies connected by edges defined previously. Each node represents a company and is associated with a descriptive text, such as "Klarna is a fintech company that provides support for direct and post-purchase payments ...". To comply with privacy and confidentiality requirements, we encoded the text into numerical embeddings using four different pre-trained text embedding models: mSBERT (multilingual Sentence BERT), ADA2, SimCSE (fine-tuned on the raw company descriptions) and PAUSE.

    Evaluation Tasks. The primary goal of CompanyKG is to develop algorithms and models for quantifying the similarity between pairs of companies. In order to evaluate the effectiveness of these methods, we have carefully curated three evaluation tasks:

    • Similarity Prediction (SP). To assess the accuracy of pairwise company similarity, we constructed the SP evaluation set comprising 3,219 pairs of companies that are labeled either as positive (similar, denoted by "1") or negative (dissimilar, denoted by "0"). Of these pairs, 1,522 are positive and 1,697 are negative.
    • Competitor Retrieval (CR). Each sample contains one target company and one of its direct competitors. It contains 76 distinct target companies, each of which has 5.3 competitors annotated in average. For a given target company A with N direct competitors in this CR evaluation set, we expect a competent method to retrieve all N competitors when searching for similar companies to A.
    • Similarity Ranking (SR) is designed to assess the ability of any method to rank candidate companies (numbered 0 and 1) based on their similarity to a query company. Paid human annotators, with backgrounds in engineering, science, and investment, were tasked with determining which candidate company is more similar to the query company. It resulted in an evaluation set comprising 1,856 rigorously labeled ranking questions. We retained 20% (368 samples) of this set as a validation set for model development.
    • Edge Prediction (EP) evaluates a model's ability to predict future or missing relationships between companies, providing forward-looking insights for investment professionals. The EP dataset, derived (and sampled) from new edges collected between April 6, 2023, and May 25, 2024, includes 40,000 samples, with edges not present in the pre-existing CompanyKG (a snapshot up until April 5, 2023).

    Background and Motivation

    In the investment industry, it is often essential to identify similar companies for a variety of purposes, such as market/competitor mapping and Mergers & Acquisitions (M&A). Identifying comparable companies is a critical task, as it can inform investment decisions, help identify potential synergies, and reveal areas for growth and improvement. The accurate quantification of inter-company similarity, also referred to as company similarity quantification, is the cornerstone to successfully executing such tasks. However, company similarity quantification is often a challenging and time-consuming process, given the vast amount of data available on each company, and the complex and diversified relationships among them.

    While there is no universally agreed definition of company similarity, researchers and practitioners in PE industry have adopted various criteria to measure similarity, typically reflecting the companies' operations and relationships. These criteria can embody one or more dimensions such as industry sectors, employee profiles, keywords/tags, customers' review, financial performance, co-appearance in news, and so on. Investment professionals usually begin with a limited number of companies of interest (a.k.a. seed companies) and require an algorithmic approach to expand their search to a larger list of companies for potential investment.

    In recent years, transformer-based Language Models (LMs) have become the preferred method for encoding textual company descriptions into vector-space embeddings. Then companies that are similar to the seed companies can be searched in the embedding space using distance metrics like cosine similarity. The rapid advancements in Large LMs (LLMs), such as GPT-3/4 and LLaMA, have significantly enhanced the performance of general-purpose conversational models. These models, such as ChatGPT, can be employed to answer questions related to similar company discovery and quantification in a Q&A format.

    However, graph is still the most natural choice for representing and learning diverse company relations due to its ability to model complex relationships between a large number of entities. By representing companies as nodes and their relationships as edges, we can form a Knowledge Graph (KG). Utilizing this KG allows us to efficiently capture and analyze the network structure of the business landscape. Moreover, KG-based approaches allow us to leverage powerful tools from network science, graph theory, and graph-based machine learning, such as Graph Neural Networks (GNNs), to extract insights and patterns to facilitate similar company analysis. While there are various company datasets (mostly commercial/proprietary and non-relational) and graph datasets available (mostly for single link/node/graph-level predictions), there is a scarcity of datasets and benchmarks that combine both to create a large-scale KG dataset expressing rich pairwise company relations.

    Source Code and Tutorial:
    https://github.com/llcresearch/CompanyKG2

    Paper: to be published

  3. h

    Data from: Negative Sampling for Learning Knowledge Graph Embeddings

    • heidata.uni-heidelberg.de
    zip
    Updated Sep 12, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bhushan Kotnis; Bhushan Kotnis (2019). Negative Sampling for Learning Knowledge Graph Embeddings [Dataset]. http://doi.org/10.11588/DATA/YYULL2
    Explore at:
    zip(19883)Available download formats
    Dataset updated
    Sep 12, 2019
    Dataset provided by
    heiDATA
    Authors
    Bhushan Kotnis; Bhushan Kotnis
    License

    https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.11588/DATA/YYULL2https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.11588/DATA/YYULL2

    Description

    Reimplementation of four KG factorization methods and six negative sampling methods. Abstract Knowledge graphs are large, useful, but incomplete knowledge repositories. They encode knowledge through entities and relations which define each other through the connective structure of the graph. This has inspired methods for the joint embedding of entities and relations in continuous low-dimensional vector spaces, that can be used to induce new edges in the graph, i.e., link prediction in knowledge graphs. Learning these representations relies on contrasting positive instances with negative ones. Knowledge graphs include only positive relation instances, leaving the door open for a variety of methods for selecting negative examples. In this paper we present an empirical study on the impact of negative sampling on the learned embeddings, assessed through the task of link prediction. We use state-of-the-art knowledge graph embeddings -- \rescal , TransE, DistMult and ComplEX -- and evaluate on benchmark datasets -- FB15k and WN18. We compare well known methods for negative sampling and additionally propose embedding based sampling methods. We note a marked difference in the impact of these sampling methods on the two datasets, with the "traditional" corrupting positives method leading to best results on WN18, while embedding based methods benefiting the task on FB15k.

  4. f

    Parameters of the sample used in the random planar graphs test.

    • figshare.com
    • plos.figshare.com
    xls
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    José Antonio Martín H. (2023). Parameters of the sample used in the random planar graphs test. [Dataset]. http://doi.org/10.1371/journal.pone.0053437.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    PLOS ONE
    Authors
    José Antonio Martín H.
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Uniformly random planar graph instances from 100 to 1000 vertices incremented by 100 and generating 1000 graphs for each number of vertices, i.e., 10000 graphs in total.

  5. The global Graph Analytics market size is USD 2522 million in 2024 and will...

    • cognitivemarketresearch.com
    pdf,excel,csv,ppt
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cognitive Market Research, The global Graph Analytics market size is USD 2522 million in 2024 and will expand at a compound annual growth rate (CAGR) of 34.0% from 2024 to 2031. [Dataset]. https://www.cognitivemarketresearch.com/graph-analytics-market-report
    Explore at:
    pdf,excel,csv,pptAvailable download formats
    Dataset authored and provided by
    Cognitive Market Research
    License

    https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy

    Time period covered
    2021 - 2033
    Area covered
    Global
    Description

    According to Cognitive Market Research, the global Graph Analytics market size will be USD 2522 million in 2024 and will expand at a compound annual growth rate (CAGR) of 34.0% from 2024 to 2031. Market Dynamics of Graph Analytics Market

    Key Drivers for Graph Analytics Market

    Increasing Recognition of the Advantages of Graph Databases- One of the main reasons for the Graph Analytics market is the increasing recognition of the advantages of graph databases. Unlike traditional relational databases, graph databases excel at handling complex relationships and interconnected data, making them ideal for use cases such as fraud detection, recommendation engines, and social network analysis. Businesses are leveraging these capabilities to uncover insights and patterns that were previously difficult to detect. The rise of big data and the need for real-time analytics are further driving the adoption of graph databases, as they offer enhanced performance and scalability for large-scale data sets. Additionally, advancements in artificial intelligence and machine learning are amplifying the value of graph databases, enabling more sophisticated data modeling and predictive analytics.
    Growing Uptake of Big Data Tools to Drive the Graph Analytics Market's Expansion in the Years Ahead.
    

    Key Restraints for Graph Analytics Market

    Limited Awareness and Understanding pose a serious threat to the Graph Analytics industry.
    The market also faces significant difficulties related to data security and privacy.
    

    Introduction of the Graph Analytics Market

    The Graph Analytics Market is rapidly expanding, driven by the growing need for advanced data analysis techniques in various sectors. Graph analytics leverages graph structures to represent and analyze relationships and dependencies, providing deeper insights than traditional data analysis methods. Key factors propelling this market include the rise of big data, the increasing adoption of artificial intelligence and machine learning, and the demand for real-time data processing. Industries such as finance, healthcare, telecommunications, and retail are major contributors, utilizing graph analytics for fraud detection, personalized recommendations, network optimization, and more. Leading vendors are continually innovating to offer scalable, efficient solutions, incorporating advanced features like graph databases and visualization tools.

  6. P

    SCG Dataset

    • paperswithcode.com
    Updated Nov 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SCG Dataset [Dataset]. https://paperswithcode.com/dataset/scg
    Explore at:
    Dataset updated
    Nov 12, 2024
    Description

    Abstract: Graph Neural Networks (GNNs) have recently gained traction in transportation, bioinformatics, language and image processing, but research on their application to supply chain management remains limited. Supply chains are inherently graph-like, making them ideal for GNN methodologies, which can optimize and solve complex problems. The barriers include a lack of proper conceptual foundations, familiarity with graph applications in SCM, and real-world benchmark datasets for GNN-based supply chain research. To address this, we discuss and connect supply chains with graph structures for effective GNN application, providing detailed formulations, examples, mathematical definitions, and task guidelines. Additionally, we present a multi-perspective real-world benchmark dataset from a leading FMCG company in Bangladesh, focusing on supply chain planning. We discuss various supply chain tasks using GNNs and benchmark several state-of-the-art models on homogeneous and heterogeneous graphs across six supply chain analytics tasks. Our analysis shows that GNN-based models consistently outperform statistical ML and other deep learning models by around 10-30% in regression, 10-30% in classification and detection tasks, and 15-40% in anomaly detection tasks on designated metrics. With this work, we lay the groundwork for solving supply chain problems using GNNs, supported by conceptual discussions, methodological insights, and a comprehensive dataset.

  7. P

    Color-connectivity Dataset

    • paperswithcode.com
    Updated Jul 14, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Color-connectivity Dataset [Dataset]. https://paperswithcode.com/dataset/color-connectivity
    Explore at:
    Dataset updated
    Jul 14, 2021
    Authors
    Ladislav Rampášek; Guy Wolf
    Description

    Synthetic graph classification datasets with the task of recognizing the connectivity of same-colored nodes in 4 graphs of varying topology.

    The four Color-connectivity datasets were created by taking a graph and randomly coloring half of its nodes one color, e.g., red, and the other nodes blue, such that the red nodes either form a single connected island or two disjoint islands. The binary classification task is then distinguishing between these two cases. The node colorings were sampled by running two red-coloring random walks starting from two random nodes. For the underlying graph topology we used: 1) 16x16 2D grid, 2) 32x32 2D grid, 3) Euroroad road network (Šubelj et al. 2011), and 4) Minnesota road network. We sampled a balanced set of 15,000 coloring examples for each graph, except for Minnesota network for which we generated 6,000 examples due to memory constraints. The Color-connectivity task requires combination of local and long-range graph information processing to which most existing message-passing Graph Neural Networks (GNNs) do not scale. These datasets can serve as a common-sense validation for new and more powerful GNN methods. These testbed datasets can still be improved, as the node features are minimal (only a binary color) and recognition of particular topological patterns (e.g., rings or other subgraphs) is not needed to solve the task.

  8. Graph topological features extracted from expression profiles of...

    • zenodo.org
    • explore.openaire.eu
    bin, tsv
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Léon-Charles Tranchevent; Léon-Charles Tranchevent; Francisco Azuaje; Francisco Azuaje; Jagath C Rajapakse; Jagath C Rajapakse (2020). Graph topological features extracted from expression profiles of neuroblastoma patients [Dataset]. http://doi.org/10.5281/zenodo.3357674
    Explore at:
    tsv, binAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Léon-Charles Tranchevent; Léon-Charles Tranchevent; Francisco Azuaje; Francisco Azuaje; Jagath C Rajapakse; Jagath C Rajapakse
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction

    This dataset contains the data described in the paper titled "A deep neural network approach to predicting clinical outcomes of neuroblastoma patients." by Tranchevent, Azuaje and Rajapakse. More precisely, this dataset contains the topological features extracted from graphs built from publicly available expression data (see details below). This dataset does not contain the original expression data, which are available elsewhere. We thank the scientists who did generate and share these data (please see below the relevant links and publications).

    Content

    File names start with the name of the publicly available dataset they are built on (among "Fischer", "Maris" and "Versteeg"). This name is followed by a tag representing whether they contain raw data ("raw", which means, in this case, the raw topological features) or TF formatted data ("TF", which stands for TensorFlow). This tag is then followed by a unique identifier representing a unique configuration. The configuration file "Global_configuration.tsv" contains details about these configurations such as which topological features are present and which clinical outcome is considered.

    The code associated to the same manuscript that uses these data is at https://gitlab.com/biomodlih/SingalunDeep. The procedure by which the raw data are transformed into the TensorFlow ready data is described in the paper.

    File format

    All files are TSV files that correspond to matrices with samples as rows and features as columns (or clinical data as columns for clinical data files). The data files contain various sets of topological features that were extracted from the sample graphs (or Patient Similarity Networks - PSN). The clinical files contain relevant clinical outcomes.

    The raw data files only contain the topological data. For instance, the file "Fischer_raw_2d0000_data_tsv" contains 24 values for each sample corresponding to the 12 centralities computed for both the microarray (Fischer-M) and RNA-seq (Fischer-R) datasets. The TensorFlow ready files do not contain the sample identifiers in the first column. However, they contain two extra columns at the end. The first extra column is the sample weights (for the classifiers and because we very often have a dominant class). The second extra column is the class labels (binary), based on the clinical outcome of interest.

    Dataset details

    The Fischer dataset is used to train, evaluate and validate the models, so the dataset is split into train / eval / valid files, which contains respectively 249, 125 and 124 rows (samples) of the original 498 samples. In contrast, the other two datasets (Maris and Versteeg) are smaller and are only used for validation (and therefore have no training or evaluation file).

    The Fischer dataset also has more data files because various configurations were tested (see manuscript). In contrast, the validation, using the Maris and Versteeg datasets is only done for a single configuration and there are therefore less files.

    For Fischer, a few configurations are listed in the global configuration file but there is no corresponding raw data. This is because these items are derived from concatenations of the original raw data (see global configuration file and manuscript for details).

    References

    This dataset is associated with Tranchevent L., Azuaje F.. Rajapakse J.C., A deep neural network approach to predicting clinical outcomes of neuroblastoma patients.

    If you use these data in your research, please do not forget to also cite the researchers who have generated the original expression datasets.

    Fischer dataset:

    • Zhang W. et al., Comparison of RNA-seq and microarray-based models for clinical endpoint prediction. Genome Biology 16(1) (2015). doi:10.1186/s13059-015-0694-1
    • Wang C. et al., The concordance between RNA-seq and microarray data depends on chemical treatment and transcript abundance. Nat. Biotechnol. 32(9), 926–932. doi:10.1038/nbt.3001

    Versteeg dataset:

    • Molenaar J.J. et al., Sequencing of neuroblastoma identifies chromothripsis and defects in neuritogenesis genes. Nature 483(7391), 589–593. doi:10.1038/nature10910

    Maris dataset:

    • Wang Q. et al., Integrative genomics identifies distinct molecular classes of neuroblastoma and shows that multiple genes are targeted by regional alterations in DNA copy number. Cancer Res. 66(12), 6050–6062. doi:10.1158/0008-5472.CAN-05-4618
  9. Freebase Datasets for Robust Evaluation of Knowledge Graph Link Prediction...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Nov 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li; Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li (2023). Freebase Datasets for Robust Evaluation of Knowledge Graph Link Prediction Models [Dataset]. http://doi.org/10.5281/zenodo.7909511
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 29, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li; Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Freebase is amongst the largest public cross-domain knowledge graphs. It possesses three main data modeling idiosyncrasies. It has a strong type system; its properties are purposefully represented in reverse pairs; and it uses mediator objects to represent multiary relationships. These design choices are important in modeling the real-world. But they also pose nontrivial challenges in research of embedding models for knowledge graph completion, especially when models are developed and evaluated agnostically of these idiosyncrasies. We make available several variants of the Freebase dataset by inclusion and exclusion of these data modeling idiosyncrasies. This is the first-ever publicly available full-scale Freebase dataset that has gone through proper preparation.

    Dataset Details

    The dataset consists of the four variants of Freebase dataset as well as related mapping/support files. For each variant, we made three kinds of files available:

    • Subject matter triples file
      • fb+/-CVT+/-REV One folder for each variant. In each folder there are 5 files: train.txt, valid.txt, test.txt, entity2id.txt, relation2id.txt Subject matter triples are the triples belong to subject matters domains—domains describing real-world facts.
        • Example of a row in train.txt, valid.txt, and test.txt:
          • 2, 192, 0
        • Example of a row in entity2id.txt:
          • /g/112yfy2xr, 2
        • Example of a row in relation2id.txt:
          • /music/album/release_type, 192
        • Explaination
          • "/g/112yfy2xr" and "/m/02lx2r" are the MID of the subject entity and object entity, respectively. "/music/album/release_type" is the realtionship between the two entities. 2, 192, and 0 are the IDs assigned by the authors to the objects.
    • Type system file
      • freebase_endtypes: Each row maps an edge type to its required subject type and object type.
        • Example
          • 92, 47178872, 90
        • Explanation
          • "92" and "90" are the type id of the subject and object which has the relationship id "47178872".
    • Metadata files
      • object_types: Each row maps the MID of a Freebase object to a type it belongs to.
        • Example
          • /g/11b41c22g, /type/object/type, /people/person
        • Explanation
          • The entity with MID "/g/11b41c22g" has a type "/people/person"
      • object_names: Each row maps the MID of a Freebase object to its textual label.
        • Example
          • /g/11b78qtr5m, /type/object/name, "Viroliano Tries Jazz"@en
        • Explanation
          • The entity with MID "/g/11b78qtr5m" has name "Viroliano Tries Jazz" in English.
      • object_ids: Each row maps the MID of a Freebase object to its user-friendly identifier.
        • Example
          • /m/05v3y9r, /type/object/id, "/music/live_album/concert"
        • Explanation
          • The entity with MID "/m/05v3y9r" can be interpreted by human as a music concert live album.
      • domains_id_label: Each row maps the MID of a Freebase domain to its label.
        • Example
          • /m/05v4pmy, geology, 77
        • Explanation
          • The object with MID "/m/05v4pmy" in Freebase is the domain "geology", and has id "77" in our dataset.
      • types_id_label: Each row maps the MID of a Freebase type to its label.
        • Example
          • /m/01xljxh, /government/political_party, 147
        • Explanation
          • The object with MID "/m/01xljxh" in Freebase is the type "/government/political_party", and has id "147" in our dataset.
      • entities_id_label: Each row maps the MID of a Freebase entity to its label.
        • Example
          • /g/11b78qtr5m, Viroliano Tries Jazz, 2234
        • Explanation
          • The entity with MID "/g/11b78qtr5m" in Freebase is "Viroliano Tries Jazz", and has id "2234" in our dataset.
        • properties_id_label: Each row maps the MID of a Freebase property to its label.
          • Example
            • /m/010h8tp2, /comedy/comedy_group/members, 47178867
          • Explanation
            • The object with MID "/m/010h8tp2" in Freebase is a property(relation/edge), it has label "/comedy/comedy_group/members" and has id "47178867" in our dataset.
        • uri_original2simplified and uri_simplified2original: The mapping between original URI and simplified URI and the mapping between simplified URI and original URI repectively.

  10. Z

    TestWUG EN: Test Word Usage Graphs for English

    • data.niaid.nih.gov
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Schlechtweg, Dominik (2024). TestWUG EN: Test Word Usage Graphs for English [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_7900959
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset authored and provided by
    Schlechtweg, Dominik
    License

    Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
    License information was derived automatically

    Description

    This data collection contains test Word Usage Graphs (WUGs) for English. Find a description of the data format, code to process the data and further datasets on the WUGsite.

    The data is provided for testing purposes and thus contains specific data cases, which are sometimes artificially created, sometimes picked from existing data sets. The data contains the following cases:

    afternoon_nn: sampled from DWUG EN 2.0.1. 200 uses partly annotated by multiple annotators with 427 judgments. Has clear cluster structure with only one cluster, no graded change, no binary change, and medium agreement of 0.62 Krippendorff's alpha.

    arm: standard textbook example for semantic proximity (see reference below). Fully connected graph with six words uses, annotated by author.

    plane_nn: sampled from DWUG EN 2.0.1. 200 uses partly annotated by multiple annotators with 1152 judgments. Has clear cluster structure, high graded change, binary change, and high agreement of 0.82 Krippendorff's alpha.

    target: similar to arm, but with only three repeated sentences. Fully connected graph with 8 word uses, annotated by author. Same sentence (exactly same string) is annotated with 4, different string is annotated with 1.

    Please find more information in the paper referenced below.

    Version: 1.2.0, 30.06.2023. Remove instances files as these should be inferred from judgments when aggregating.

    Reference

    Dominik Schlechtweg. 2023. Human and Computational Measurement of Lexical Semantic Change. PhD thesis. University of Stuttgart.

  11. Transaction Graph Dataset for the Ethereum Blockchain - Dataset - CryptoData...

    • cryptodata.center
    Updated Dec 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    cryptodata.center (2024). Transaction Graph Dataset for the Ethereum Blockchain - Dataset - CryptoData Hub [Dataset]. https://cryptodata.center/dataset/transaction-graph-dataset-for-the-ethereum-blockchain
    Explore at:
    Dataset updated
    Dec 4, 2024
    Dataset provided by
    CryptoDATA
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Уникальный идентификатор https://doi.org/10.5281/zenodo.4718440 Набор данных обновлен Dec 19, 2022 Набор данных предоставлен Zenodo Авторы Can Özturan; Can Özturan; Alper Şen; Alper Şen; Baran Kılıç; Baran Kılıç Лицензия Attribution 4.0 (CC BY 4.0) Информация о лицензии была получена автоматически Описание This dataset contains ether as well as popular ERC20 token transfer transactions extracted from the Ethereum Mainnet blockchain. Only send ether, contract function call, contract deployment transactions are present in the dataset. Miner reward (static block reward) and "uncle block inclusion reward" are added as transactions to the dataset. Transaction fee reward and "uncles reward" are not currently included in the dataset. Details of the datasets are given below: FILENAME FORMAT: The filenames have the following format: eth-tx- where For example file eth-tx-1000000-1099999.txt.bz2 contains transactions from block 1000000 to block 1099999 inclusive. The files are compressed with bzip2. They can be uncompressed using command bunzip2. TRANSACTION FORMAT: Each line in a file corresponds to a transaction. The transaction has the following format: units. ERC20 tokens transfers (transfer and transferFrom function calls in ERC20 contract) are indicated by token symbol. For example GUSD is Gemini USD stable coin. The JSON file erc20tokens.json given below contains the details of ERC20 tokens. Failed transactions are prefixed with "F-". BLOCK TIME FORMAT: The block time file has the following format: erc20tokens.json FILE: This file contains the list of popular ERC20 token contracts whose transfer/transferFrom transactions appear in the data files. ERC20 token list: USDT TRYb XAUt BNB LEO LINK HT HEDG MKR CRO VEN INO PAX INB SNX REP MOF ZRX SXP OKB XIN OMG SAI HOT DAI EURS HPT BUSD USDC SUSD HDG QCAD PLUS BTCB WBTC cWBTC renBTC sBTC imBTC pBTC IMPORTANT NOTE: Public Ethereum Mainnet blockchain data is open and can be obtained by connecting as a node on the blockchain or by using the block explorer web sites such as http://etherscan.io . The downloaders and users of this dataset accept the full responsibility of using the data in GDPR compliant manner or any other regulations. We provide the data as is and we cannot be held responsible for anything. NOTE: If you use this dataset, please do not forget to add the DOI number to the citation. If you use our dataset in your research, please also cite our paper: https://link.springer.com/article/10.1007/s10586-021-03511-0 @article{kilic2022parallel, title={Parallel Analysis of Ethereum Blockchain Transaction Data using Cluster Computing}, journal={Cluster Computing}, author={K{\i}l{\i}{\c{c}}, Baran and {"O}zturan, Can and Sen, Alper}, year={2022}, month={Jan} }

  12. Sample graphs and sequences for testing sequence-to-graph alignment

    • zenodo.org
    application/gzip, bin
    Updated Jun 6, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heng Li; Heng Li (2022). Sample graphs and sequences for testing sequence-to-graph alignment [Dataset]. http://doi.org/10.5281/zenodo.6056061
    Explore at:
    application/gzip, binAvailable download formats
    Dataset updated
    Jun 6, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Heng Li; Heng Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    File descriptions:

    • MHC-61.agc: 61 complete MHC sequences, including GRCh38, CHM13 and 59 haplotypes extracted from HPRC year-1 assemblies. Use AGC to extract individual haplotype sequences.
    • MHC-57.gfa.gz: sequence graph constructed by minigraph-r434, excluding sample HG002 and HG005
    • C4-96.agc: 96 complete C4 sequences obtained from HPRC
    • C4-90.gfa.gz: sequence graph extracted from the full HPRC year-1 minigraph graph around the C4A/C4B genes. The graph takes GRCh38 as the reference and excludes sample HG002, HG005 and NA19240.
  13. P

    Reddit Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Jun 9, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Reddit Dataset [Dataset]. https://paperswithcode.com/dataset/reddit
    Explore at:
    Dataset updated
    Jun 9, 2017
    Authors
    William L. Hamilton; Rex Ying; Jure Leskovec
    Description

    The Reddit dataset is a graph dataset from Reddit posts made in the month of September, 2014. The node label in this case is the community, or “subreddit”, that a post belongs to. 50 large communities have been sampled to build a post-to-post graph, connecting posts if the same user comments on both. In total this dataset contains 232,965 posts with an average degree of 492. The first 20 days are used for training and the remaining days for testing (with 30% used for validation). For features, off-the-shelf 300-dimensional GloVe CommonCrawl word vectors are used.

  14. Transaction Graph Dataset for the Bitcoin Blockchain - Part 2 of 4 - Dataset...

    • cryptodata.center
    Updated Dec 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    cryptodata.center (2024). Transaction Graph Dataset for the Bitcoin Blockchain - Part 2 of 4 - Dataset - CryptoData Hub [Dataset]. https://cryptodata.center/dataset/transaction-graph-dataset-for-the-bitcoin-blockchain-part-2-of-4
    Explore at:
    Dataset updated
    Dec 4, 2024
    Dataset provided by
    CryptoDATA
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains bitcoin transfer transactions extracted from the Bitcoin Mainnet blockchain. Details of the datasets are given below: FILENAME FORMAT: The filenames have the following format: btc-tx- where For example file btc-tx-100000-149999-aa.bz2 and the rest of the parts if any contain transactions from block 100000 to block 149999 inclusive. The files are compressed with bzip2. They can be uncompressed using command bunzip2. TRANSACTION FORMAT: Each line in a file corresponds to a transaction. The transaction has the following format: BLOCK TIME FORMAT: The block time file has the following format: IMPORTANT NOTE: Public Bitcoin Mainnet blockchain data is open and can be obtained by connecting as a node on the blockchain or by using the block explorer web sites such as https://btcscan.org . The downloaders and users of this dataset accept the full responsibility of using the data in GDPR compliant manner or any other regulations. We provide the data as is and we cannot be held responsible for anything. NOTE: If you use this dataset, please do not forget to add the DOI number to the citation. If you use our dataset in your research, please also cite our paper: https://link.springer.com/chapter/10.1007/978-3-030-94590-9_14 @incollection{kilicc2022analyzing, title={Analyzing Large-Scale Blockchain Transaction Graphs for Fraudulent Activities}, author={K{\i}l{\i}{\c{c}}, Baran and {"O}zturan, Can and {\c{S}}en, Alper}, booktitle={Big Data and Artificial Intelligence in Digital Finance}, pages={253--267}, year={2022}, publisher={Springer, Cham} }

  15. d

    Replication Data for: Sequential Monte Carlo for Sampling Balanced and...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    McCartan, Cory; Imai, Kosuke (2023). Replication Data for: Sequential Monte Carlo for Sampling Balanced and Compact Redistricting Plans [Dataset]. https://search.dataone.org/view/sha256%3A4530346effc876e9c9a2eae42dc54d709e48f5c9f444ee7e9fcc4078f0a9e195
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    McCartan, Cory; Imai, Kosuke
    Description

    Random sampling of graph partitions under constraints has become a popular tool for evaluating legislative redistricting plans. Analysts detect partisan gerrymandering by comparing a proposed redistricting plan with an ensemble of sampled alternative plans. For successful application, sampling methods must scale to maps with a moderate or large number of districts, incorporate realistic legal constraints, and accurately and efficiently sample from a selected target distribution. Unfortunately, most existing methods struggle in at least one of these areas. We present a new Sequential Monte Carlo (SMC) algorithm that generates a sample of redistricting plans converging to a realistic target distribution. Because it draws many plans in parallel, the SMC algorithm can efficiently explore the relevant space of redistricting plans better than the existing Markov chain Monte Carlo (MCMC) algorithms that generate plans sequentially. Our algorithm can simultaneously incorporate several constraints commonly imposed in real-world redistricting problems, including equal population, compactness, and preservation of administrative boundaries. We validate the accuracy of the proposed algorithm by using a small map where all redistricting plans can be enumerated. We then apply the SMC algorithm to evaluate the partisan implications of several maps submitted by relevant parties in a recent high-profile redistricting case in the state of Pennsylvania. We find that the proposed algorithm converges faster and with fewer samples than a comparable MCMC algorithm. Open-source software is available for implementing the proposed methodology.

  16. The Software Heritage Graph Dataset

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, bin +2
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antoine Pietri; Antoine Pietri; Diomidis Spinellis; Diomidis Spinellis; Stefano Zacchiroli; Stefano Zacchiroli (2020). The Software Heritage Graph Dataset [Dataset]. http://doi.org/10.5281/zenodo.2583978
    Explore at:
    application/gzip, zip, tar, binAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Antoine Pietri; Antoine Pietri; Diomidis Spinellis; Diomidis Spinellis; Stefano Zacchiroli; Stefano Zacchiroli
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Software Heritage is the largest existing public archive of software source
    code and accompanying development history: it currently spans more than five
    billion unique source code files and one billion unique commits, coming from
    more than 80 million software projects.

    This is the Software Heritage graph dataset: a fully-deduplicated
    Merkle DAG representation of the Software Heritage archive. The dataset links
    together file content identifiers, source code directories, Version Control
    System (VCS) commits tracking evolution over time, up to the full states of VCS
    repositories as observed by Software Heritage during periodic crawls. The
    dataset’s contents come from major development forges (including GitHub and
    GitLab), FOSS distributions (e.g., Debian), and language-specific package
    managers (e.g., PyPI). Crawling information is also included, providing
    timestamps about when and where all archived source code artifacts have been
    observed in the wild.

    The Software Heritage graph dataset is available in multiple formats, including
    downloadable CSV dumps and Apache Parquet files for local use, as well as a
    public instance on Amazon Athena interactive query service for ready-to-use
    powerful analytical processing.

    By accessing the dataset, you agree with the Software Heritage "https://www.softwareheritage.org/legal/users-ethical-charter/">Ethical Charter
    for using the archive data, and the terms of use for bulk access.

    If you use this dataset for research purposes, please cite the following paper:

    • Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli.
      The Software Heritage Graph Dataset: Public software development under one roof.
      In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Co-located with ICSE 2019.
      preprint, bibtex

    You can also refer to the above paper for more information the dataset and sample queries.

  17. 4

    Dataset for Identify structures underlying out-of-equilibrium reaction...

    • data.4tu.nl
    • 4tu.edu.hpc.n-helix.com
    zip
    Updated Jan 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Éverton F. Da Cunha; Yanna J. Kraakman; Dmitrii Kriukov; Thomas van Poppel; Clara Stegehuis; Albert S. Y. Wong (2025). Dataset for Identify structures underlying out-of-equilibrium reaction networks with random graph analysis [Dataset]. http://doi.org/10.4121/ac3c7c42-f367-41d7-bd3b-fa54714b3a1b.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 8, 2025
    Dataset provided by
    4TU.ResearchData
    Authors
    Éverton F. Da Cunha; Yanna J. Kraakman; Dmitrii Kriukov; Thomas van Poppel; Clara Stegehuis; Albert S. Y. Wong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Python scripts were developed to analyze and visualize data encoded in the provided .txt files. MATLAB scripts generate the raw time series data. Two examples of simulated results are provided Graphics Interchange Format files. For more details, two READ ME files are included:

    • 20241129_EF_YK_ChemSci_Networks_DatasetREADME.txt provides description of this dataset
    • README.txt provides instructions on use of the scripts included.

  18. A study on real graphs of fake news spreading on Twitter

    • zenodo.org
    bin
    Updated Aug 20, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amirhosein Bodaghi; Amirhosein Bodaghi (2021). A study on real graphs of fake news spreading on Twitter [Dataset]. http://doi.org/10.5281/zenodo.5225338
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 20, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Amirhosein Bodaghi; Amirhosein Bodaghi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    *** Fake News on Twitter ***

    These 5 datasets are the results of an empirical study on the spreading process of newly fake news on Twitter. Particularly, we have focused on those fake news which have given rise to a truth spreading simultaneously against them. The story of each fake news is as follow:

    1- FN1: A Muslim waitress refused to seat a church group at a restaurant, claiming "religious freedom" allowed her to do so.

    2- FN2: Actor Denzel Washington said electing President Trump saved the U.S. from becoming an "Orwellian police state."

    3- FN3: Joy Behar of "The View" sent a crass tweet about a fatal fire in Trump Tower.

    4- FN4: The animated children's program 'VeggieTales' introduced a cannabis character in August 2018.

    5- FN5: In September 2018, the University of Alabama football program ended its uniform contract with Nike, in response to Nike's endorsement deal with Colin Kaepernick.

    The data collection has been done in two stages that each provided a new dataset: 1- attaining Dataset of Diffusion (DD) that includes information of fake news/truth tweets and retweets 2- Query of neighbors for spreaders of tweets that provides us with Dataset of Graph (DG).

    DD

    DD for each fake news story is an excel file, named FNx_DD where x is the number of fake news, and has the following structure:

    The structure of excel files for each dataset is as follow:

    • Each row belongs to one captured tweet/retweet related to the rumor, and each column of the dataset presents a specific information about the tweet/retweet. These columns from left to right present the following information about the tweet/retweet:
    • User ID (user who has posted the current tweet/retweet)
    • The number of published tweet/retweet by the user at the time of posting the current tweet/retweet
    • Language of the tweet/retweet
    • Number of followers
    • Number of followings (friends)
    • Date and time of posting the current tweet/retweet
    • Number of like (favorite) the current tweet had been acquired before crawling it
    • Number of times the current tweet had been retweeted before crawling it
    • Is there any other tweet inside of the current tweet/retweet (for example this happens when the current tweet is a quote or reply or retweet)
    • The source (OS) of device by which the current tweet/retweet was posted
    • Tweet/Retweet ID
    • Retweet ID (if the post is a retweet then this feature gives the ID of the tweet that is retweeted by the current post)
    • Quote ID (if the post is a quote then this feature gives the ID of the tweet that is quoted by the current post)
    • Reply ID (if the post is a reply then this feature gives the ID of the tweet that is replied by the current post)
    • Frequency of tweet occurrences which means the number of times the current tweet is repeated in the dataset (for example the number of times that a tweet exists in the dataset in the form of retweet posted by others)
    • State of the tweet which can be one of the following forms (achieved by an agreement between the annotators):
    • r : The tweet/retweet is a fake news post
    • a : The tweet/retweet is a truth post
    • q : The tweet/retweet is a question about the fake news, however neither confirm nor deny it
    • n : The tweet/retweet is not related to the fake news (even though it contains the queries related to the rumor, but does not refer to the given fake news)

    DG

    DG for each fake news contains two files:

    • A file in graph format (.graph) which includes the information of graph such as who is linked to whom. (This file named FNx_DG.graph, where x is the number of fake news)
    • A file in Jsonl format (.jsonl) which includes the real user IDs of nodes in the graph file. (This file named FNx_Labels.jsonl, where x is the number of fake news)

    Because in the graph file, the label of each node is the number of its entrance in the graph. For example if node with user ID 12345637 be the first node which has been entered into the graph file then its label in the graph is 0 and its real ID (12345637) would be at the row number 1 (because the row number 0 belongs to column labels) in the jsonl file and so on other node IDs would be at the next rows of the file (each row corresponds to 1 user id). Therefore, if we want to know for example what the user id of node 200 (labeled 200 in the graph) is, then in jsonl file we should look at row number 202.

    The user IDs of spreaders in DG (those who have had a post in DD) would be available in DD to get extra information about them and their tweet/retweet. The other user IDs in DG are the neighbors of these spreaders and might not exist in DD.

  19. Parameters of the sample used in the random graphs test.

    • plos.figshare.com
    • figshare.com
    xls
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    José Antonio Martín H. (2023). Parameters of the sample used in the random graphs test. [Dataset]. http://doi.org/10.1371/journal.pone.0053437.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    José Antonio Martín H.
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Graphs are sampled from the well-known Erdős-Rënyi random graphs distribution. The sample consist of graph instances for 100 vertices, generating in total 10000 graphs. For each number of vertices, the average degree is varied from 3 to 6.

  20. The Yelp Collaborative Knowledge Graph

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Olesen, Magnus (2023). The Yelp Collaborative Knowledge Graph [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7878446
    Explore at:
    Dataset updated
    Jun 17, 2023
    Dataset provided by
    Nielsen Holdingshttp://nielsen.com/
    Olesen, Magnus
    Heede, Thomas
    Corfixen, Mads
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the The Yelp Collaborative Knowledge Graph (YCKG) - a transformation of the Yelp Open Dataset into RDF format using Y2KG.

    Paper Abstract

    The Yelp Open Dataset (YOD) contains data about businesses, reviews, and users from the Yelp website and is available for research purposes. This dataset has been widely used to develop and test Recommender Systems (RS), especially those using Knowledge Graphs (KGs), e.g., integrating taxonomies, product categories, business locations, and social network information. Unfortunately, researchers applied naive or wrong mappings while converting YOD in KGs, consequently obtaining unrealistic results. Among the various issues, the conversion processes usually do not follow state-of-the-art methodologies, fail to properly link to other KGs and reuse existing vocabularies. In this work, we overcome these issues by introducing Y2KG, a utility to convert the Yelp dataset into a KG. Y2KG consists of two components. The first is a dataset including (1) a vocabulary that extends Schema.org with properties to describe the concepts in YOD and (2) mappings between the Yelp entities and Wikidata. The second component is a set of scripts to transform YOD in RDF and obtain the Yelp Collaborative Knowledge Graph (YCKG). The design of Y2KG was driven by 16 core competency questions. YCKG includes 150k businesses and 16.9M reviews from 1.9M distinct real users, resulting in over 244 million triples (with 144 distinct predicates) for about 72 million resources, with an average in-degree and out-degree of 3.3 and 12.2, respectively.

    Links

    Latest GitHub release: https://github.com/MadsCorfixen/The-Yelp-Collaborative-Knowledge-Graph/releases/latest

    PURL domain: https://purl.archive.org/domain/yckg

    Files

    Graph Data Triple Files

    One sample file for each of the Yelp domains (Businesses, Users, Reviews, Tips and Checkins), each containing 20 entities.

    yelp_schema_mappings.nt.gz containing the mappings from Yelp categories to Schema things.

    schema_hierarchy.nt.gz containing the full hierarchy of the mapped Schema things.

    yelp_wiki_mappings.nt.gz containing the mappings from Yelp categories to Wikidata entities.

    wikidata_location_mappings.nt.gz containing the mappings from Yelp locations to Wikidata entities.

    Graph Metadata Triple Files

    yelp_categories.ttl contains metadata for all Yelp categories.

    yelp_entities.ttl contains metadata regarding the dataset

    yelp_vocabulary.ttl contains metadata on the created Yelp vocabulary and properties.

    Utility Files

    yelp_category_schema_mappings.csv. This file contains the 310 mappings from Yelp categories to Schema types. These mappings have been manually verified to be correct.

    yelp_predicate_schema_mappings.csv. This file contains the 14 mappings from Yelp attributes to Schema properties. These mappings are manually found.

    ground_truth_yelp_category_schema_mappings.csv. This file contains the ground truth, based on 200 manually verified mappings from Yelp categories to Schema things. The ground truth mappings were used to calculate precision and recall for the semantic mappings.

    manually_split_categories.csv. This file contains all Yelp categories containing either a & or /, and their manually split versions. The split versions have been used in the semantic mappings to Schema things.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Edwin Carreño; Edwin Carreño (2024). Sample Graph Datasets in CSV Format [Dataset]. http://doi.org/10.5281/zenodo.14335015
Organization logo

Sample Graph Datasets in CSV Format

Explore at:
csvAvailable download formats
Dataset updated
Dec 9, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Edwin Carreño; Edwin Carreño
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Sample Graph Datasets in CSV Format

Note: none of the data sets published here contain actual data, they are for testing purposes only.

Description

This data repository contains graph datasets, where each graph is represented by two CSV files: one for node information and another for edge details. To link the files to the same graph, their names include a common identifier based on the number of nodes. For example:

  • dataset_30_nodes_interactions.csv:contains 30 rows (nodes).
  • dataset_30_edges_interactions.csv: contains 47 rows (edges).
  • the common identifier dataset_30 refers to the same graph.

CSV nodes

Each dataset contains the following columns:

Name of the ColumnTypeDescription
UniProt IDstringprotein identification
labelstringprotein label (type of node)
propertiesstringa dictionary containing properties related to the protein.

CSV edges

Each dataset contains the following columns:

Name of the ColumnTypeDescription
Relationship IDstringrelationship identification
Source IDstringidentification of the source protein in the relationship
Target IDstringidentification of the target protein in the relationship
labelstringrelationship label (type of relationship)
propertiesstringa dictionary containing properties related to the relationship.

Metadata

GraphNumber of NodesNumber of EdgesSparse graph

dataset_30*

30

47

Y

dataset_60*

60

181

Y

dataset_120*

120

689

Y

dataset_240*

240

2819

Y

dataset_300*

300

4658

Y

dataset_600*

600

18004

Y

dataset_1200*

1200

71785

Y

dataset_2400*

2400

288600

Y

dataset_3000*

3000

449727

Y

dataset_6000*

6000

1799413

Y

dataset_12000*

12000

7199863

Y

dataset_24000*

24000

28792361

Y

dataset_30000*

30000

44991744

Y

This repository include two (2) additional tiny graph datasets to experiment before dealing with larger datasets.

CSV nodes (tiny graphs)

Each dataset contains the following columns:

Name of the ColumnTypeDescription
IDstringnode identification
labelstringnode label (type of node)
propertiesstringa dictionary containing properties related to the node.

CSV edges (tiny graphs)

Each dataset contains the following columns:

Name of the ColumnTypeDescription
IDstringrelationship identification
sourcestringidentification of the source node in the relationship
targetstringidentification of the target node in the relationship
labelstringrelationship label (type of relationship)
propertiesstringa dictionary containing properties related to the relationship.

Metadata (tiny graphs)

GraphNumber of NodesNumber of EdgesSparse graph
dataset_dummy*36N
dataset_dummy2*36N
Search
Clear search
Close search
Google apps
Main menu