58 datasets found
  1. Sample Graph Datasets in CSV Format

    • zenodo.org
    csv
    Updated Dec 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Edwin Carreño; Edwin Carreño (2024). Sample Graph Datasets in CSV Format [Dataset]. http://doi.org/10.5281/zenodo.14335015
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 9, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Edwin Carreño; Edwin Carreño
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sample Graph Datasets in CSV Format

    Note: none of the data sets published here contain actual data, they are for testing purposes only.

    Description

    This data repository contains graph datasets, where each graph is represented by two CSV files: one for node information and another for edge details. To link the files to the same graph, their names include a common identifier based on the number of nodes. For example:

    • dataset_30_nodes_interactions.csv:contains 30 rows (nodes).
    • dataset_30_edges_interactions.csv: contains 47 rows (edges).
    • the common identifier dataset_30 refers to the same graph.

    CSV nodes

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    UniProt IDstringprotein identification
    labelstringprotein label (type of node)
    propertiesstringa dictionary containing properties related to the protein.

    CSV edges

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    Relationship IDstringrelationship identification
    Source IDstringidentification of the source protein in the relationship
    Target IDstringidentification of the target protein in the relationship
    labelstringrelationship label (type of relationship)
    propertiesstringa dictionary containing properties related to the relationship.

    Metadata

    GraphNumber of NodesNumber of EdgesSparse graph

    dataset_30*

    30

    47

    Y

    dataset_60*

    60

    181

    Y

    dataset_120*

    120

    689

    Y

    dataset_240*

    240

    2819

    Y

    dataset_300*

    300

    4658

    Y

    dataset_600*

    600

    18004

    Y

    dataset_1200*

    1200

    71785

    Y

    dataset_2400*

    2400

    288600

    Y

    dataset_3000*

    3000

    449727

    Y

    dataset_6000*

    6000

    1799413

    Y

    dataset_12000*

    12000

    7199863

    Y

    dataset_24000*

    24000

    28792361

    Y

    dataset_30000*

    30000

    44991744

    Y

    This repository include two (2) additional tiny graph datasets to experiment before dealing with larger datasets.

    CSV nodes (tiny graphs)

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    IDstringnode identification
    labelstringnode label (type of node)
    propertiesstringa dictionary containing properties related to the node.

    CSV edges (tiny graphs)

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    IDstringrelationship identification
    sourcestringidentification of the source node in the relationship
    targetstringidentification of the target node in the relationship
    labelstringrelationship label (type of relationship)
    propertiesstringa dictionary containing properties related to the relationship.

    Metadata (tiny graphs)

    GraphNumber of NodesNumber of EdgesSparse graph
    dataset_dummy*36N
    dataset_dummy2*36N
  2. NASS Data Visualization

    • agdatacommons.nal.usda.gov
    bin
    Updated Feb 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    USDA National Agricultural Statistics Service (2024). NASS Data Visualization [Dataset]. https://agdatacommons.nal.usda.gov/articles/dataset/NASS_Data_Visualization/24660801
    Explore at:
    binAvailable download formats
    Dataset updated
    Feb 9, 2024
    Dataset provided by
    United States Department of Agriculturehttp://usda.gov/
    National Agricultural Statistics Servicehttp://www.nass.usda.gov/
    Authors
    USDA National Agricultural Statistics Service
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    NASS Data Visualization provides a dynamic web query interface supporting searches by Commodity (e.g. Cotton, Corn, Farms & Land, Grapefruit, Hogs, Oranges, Soybeans, Wheat), Statistic type (automatically refreshed based upon choice of Commodity - e.g. Inventory, Head, Acres Planted, Acres Harvested, Production, Yield) to generate chart, table, and map visualizations by year (2001-2016), as well as a link to download the resulting data in CSV format compatible for updating databases and spreadsheets. Resources in this dataset:Resource Title: NASS Data Visualization web site. File Name: Web Page, url: https://nass.usda.gov/Data_Visualization/index.php Query interface with visualization of results as charts, tables, and maps.

  3. T

    Sweden - Distribution of population by household types: Three or more adults...

    • tradingeconomics.com
    csv, excel, json, xml
    Updated Aug 26, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TRADING ECONOMICS (2020). Sweden - Distribution of population by household types: Three or more adults [Dataset]. https://tradingeconomics.com/sweden/distribution-of-population-by-household-types-three-or-more-adults-eurostat-data.html
    Explore at:
    excel, xml, json, csvAvailable download formats
    Dataset updated
    Aug 26, 2020
    Dataset authored and provided by
    TRADING ECONOMICS
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 1976 - Dec 31, 2025
    Area covered
    Sweden
    Description

    Sweden - Distribution of population by household types: Three or more adults was 3.00% in December of 2024, according to the EUROSTAT. Trading Economics provides the current actual value, an historical data chart and related indicators for Sweden - Distribution of population by household types: Three or more adults - last updated from the EUROSTAT on March of 2025. Historically, Sweden - Distribution of population by household types: Three or more adults reached a record high of 3.70% in December of 2017 and a record low of 2.50% in December of 2009.

  4. d

    Grammar transformations of topographic feature type annotations of the U.S....

    • catalog.data.gov
    • data.usgs.gov
    • +2more
    Updated Jul 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Grammar transformations of topographic feature type annotations of the U.S. to structured graph data. [Dataset]. https://catalog.data.gov/dataset/grammar-transformations-of-topographic-feature-type-annotations-of-the-u-s-to-structured-g
    Explore at:
    Dataset updated
    Jul 20, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    United States
    Description

    These data were used to examine grammatical structures and patterns within a set of geospatial glossary definitions. Objectives of our study were to analyze the semantic structure of input definitions, use this information to build triple structures of RDF graph data, upload our lexicon to a knowledge graph software, and perform SPARQL queries on the data. Upon completion of this study, SPARQL queries were proven to effectively convey graph triples which displayed semantic significance. These data represent and characterize the lexicon of our input text which are used to form graph triples. These data were collected in 2024 by passing text through multiple Python programs utilizing spaCy (a natural language processing library) and its pre-trained English transformer pipeline. Before data was processed by the Python programs, input definitions were first rewritten as natural language and formatted as tabular data. Passages were then tokenized and characterized by their part-of-speech, tag, dependency relation, dependency head, and lemma. Each word within the lexicon was tokenized. A stop-words list was utilized only to remove punctuation and symbols from the text, excluding hyphenated words (ex. bowl-shaped) which remained as such. The tokens’ lemmas were then aggregated and totaled to find their recurrences within the lexicon. This procedure was repeated for tokenizing noun chunks using the same glossary definitions.

  5. d

    Device Graph Data | 10+ Identity Types | 1500M+ Global Devices| CCPA...

    • datarade.ai
    Updated Aug 21, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DRAKO (2024). Device Graph Data | 10+ Identity Types | 1500M+ Global Devices| CCPA Compliant [Dataset]. https://datarade.ai/data-products/drako-device-graph-data-usa-canada-comprehensive-insi-drako
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Aug 21, 2024
    Dataset authored and provided by
    DRAKO
    Area covered
    Philippines, Mozambique, Bahamas, South Sudan, Brazil, Eritrea, Cyprus, Lao People's Democratic Republic, Tonga, Aruba
    Description

    DRAKO is a leader in providing Device Graph Data, focusing on understanding the relationships between consumer devices and identities. Our data allows businesses to create holistic profiles of users, track engagement across platforms, and measure the effectiveness of advertising efforts.

    Device Graph Data is essential for accurate audience targeting, cross-device attribution, and understanding consumer journeys. By integrating data from multiple sources, we provide a unified view of user interactions, helping businesses make informed decisions.

    Key Features: - Comprehensive device mapping to understand user behaviour across multiple platforms - Detailed Identity Graph Data for cross-device identification and engagement tracking - Integration with Connected TV Data for enhanced insights into video consumption habits - Mobile Attribution Data to measure the effectiveness of mobile campaigns - Customizable analytics to segment audiences based on device usage and demographics - Some ID types offered: AAID, idfa, Unified ID 2.0, AFAI, MSAI, RIDA, AAID_CTV, IDFA_CTV

    Use Cases: - Cross-device marketing strategies - Attribution modelling and campaign performance measurement - Audience segmentation and targeting - Enhanced insights for Connected TV advertising - Comprehensive consumer journey mapping

    Data Compliance: All of our Device Graph Data is sourced responsibly and adheres to industry standards for data privacy and protection. We ensure that user identities are handled with care, providing insights without compromising individual privacy.

    Data Quality: DRAKO employs robust validation techniques to ensure the accuracy and reliability of our Device Graph Data. Our quality assurance processes include continuous monitoring and updates to maintain data integrity and relevance.

  6. Z

    CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mark Granroth-Wilding (2024). CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company Similarity Quantification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7957401
    Explore at:
    Dataset updated
    Jun 4, 2024
    Dataset provided by
    Mark Granroth-Wilding
    Dhiana Deva Cavacanti Rocha
    Armin Catovic
    Drew McCornack
    Lele Cao
    Vilhelm von Ehrenheim
    Richard Anselmo Stahl
    Description

    CompanyKG is a heterogeneous graph consisting of 1,169,931 nodes and 50,815,503 undirected edges, with each node representing a real-world company and each edge signifying a relationship between the connected pair of companies.

    Edges: We model 15 different inter-company relations as undirected edges, each of which corresponds to a unique edge type. These edge types capture various forms of similarity between connected company pairs. Associated with each edge of a certain type, we calculate a real-numbered weight as an approximation of the similarity level of that type. It is important to note that the constructed edges do not represent an exhaustive list of all possible edges due to incomplete information. Consequently, this leads to a sparse and occasionally skewed distribution of edges for individual relation/edge types. Such characteristics pose additional challenges for downstream learning tasks. Please refer to our paper for a detailed definition of edge types and weight calculations.

    Nodes: The graph includes all companies connected by edges defined previously. Each node represents a company and is associated with a descriptive text, such as "Klarna is a fintech company that provides support for direct and post-purchase payments ...". To comply with privacy and confidentiality requirements, we encoded the text into numerical embeddings using four different pre-trained text embedding models: mSBERT (multilingual Sentence BERT), ADA2, SimCSE (fine-tuned on the raw company descriptions) and PAUSE.

    Evaluation Tasks. The primary goal of CompanyKG is to develop algorithms and models for quantifying the similarity between pairs of companies. In order to evaluate the effectiveness of these methods, we have carefully curated three evaluation tasks:

    Similarity Prediction (SP). To assess the accuracy of pairwise company similarity, we constructed the SP evaluation set comprising 3,219 pairs of companies that are labeled either as positive (similar, denoted by "1") or negative (dissimilar, denoted by "0"). Of these pairs, 1,522 are positive and 1,697 are negative.

    Competitor Retrieval (CR). Each sample contains one target company and one of its direct competitors. It contains 76 distinct target companies, each of which has 5.3 competitors annotated in average. For a given target company A with N direct competitors in this CR evaluation set, we expect a competent method to retrieve all N competitors when searching for similar companies to A.

    Similarity Ranking (SR) is designed to assess the ability of any method to rank candidate companies (numbered 0 and 1) based on their similarity to a query company. Paid human annotators, with backgrounds in engineering, science, and investment, were tasked with determining which candidate company is more similar to the query company. It resulted in an evaluation set comprising 1,856 rigorously labeled ranking questions. We retained 20% (368 samples) of this set as a validation set for model development.

    Edge Prediction (EP) evaluates a model's ability to predict future or missing relationships between companies, providing forward-looking insights for investment professionals. The EP dataset, derived (and sampled) from new edges collected between April 6, 2023, and May 25, 2024, includes 40,000 samples, with edges not present in the pre-existing CompanyKG (a snapshot up until April 5, 2023).

    Background and Motivation

    In the investment industry, it is often essential to identify similar companies for a variety of purposes, such as market/competitor mapping and Mergers & Acquisitions (M&A). Identifying comparable companies is a critical task, as it can inform investment decisions, help identify potential synergies, and reveal areas for growth and improvement. The accurate quantification of inter-company similarity, also referred to as company similarity quantification, is the cornerstone to successfully executing such tasks. However, company similarity quantification is often a challenging and time-consuming process, given the vast amount of data available on each company, and the complex and diversified relationships among them.

    While there is no universally agreed definition of company similarity, researchers and practitioners in PE industry have adopted various criteria to measure similarity, typically reflecting the companies' operations and relationships. These criteria can embody one or more dimensions such as industry sectors, employee profiles, keywords/tags, customers' review, financial performance, co-appearance in news, and so on. Investment professionals usually begin with a limited number of companies of interest (a.k.a. seed companies) and require an algorithmic approach to expand their search to a larger list of companies for potential investment.

    In recent years, transformer-based Language Models (LMs) have become the preferred method for encoding textual company descriptions into vector-space embeddings. Then companies that are similar to the seed companies can be searched in the embedding space using distance metrics like cosine similarity. The rapid advancements in Large LMs (LLMs), such as GPT-3/4 and LLaMA, have significantly enhanced the performance of general-purpose conversational models. These models, such as ChatGPT, can be employed to answer questions related to similar company discovery and quantification in a Q&A format.

    However, graph is still the most natural choice for representing and learning diverse company relations due to its ability to model complex relationships between a large number of entities. By representing companies as nodes and their relationships as edges, we can form a Knowledge Graph (KG). Utilizing this KG allows us to efficiently capture and analyze the network structure of the business landscape. Moreover, KG-based approaches allow us to leverage powerful tools from network science, graph theory, and graph-based machine learning, such as Graph Neural Networks (GNNs), to extract insights and patterns to facilitate similar company analysis. While there are various company datasets (mostly commercial/proprietary and non-relational) and graph datasets available (mostly for single link/node/graph-level predictions), there is a scarcity of datasets and benchmarks that combine both to create a large-scale KG dataset expressing rich pairwise company relations.

    Source Code and Tutorial:https://github.com/llcresearch/CompanyKG2

    Paper: to be published

  7. Z

    Data from: Entity Typing Datasets

    • data.niaid.nih.gov
    • explore.openaire.eu
    • +1more
    Updated Mar 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Russa Biswas (2023). Entity Typing Datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7688589
    Explore at:
    Dataset updated
    Mar 2, 2023
    Dataset authored and provided by
    Russa Biswas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These are the datasets used in the Entity Type Prediction task for Knowledge Graph Completion.

    DB630k_Fine-grained_Hierarchical.zip dataset has been used in the papers [1] and [2]. It is an extended version of DBpedia630k dataset originally created for Text classification and is available here.

    FIGER.zip dataset has also been used in the papers [1] and [2].

    MultilingualETdata.zip dataset has been used in the paper [3]

    NamesETdata.zip dataset has been used in the paper [4]. The CaLiGraph test dataset can also be downloaded here.

    [1] Biswas R, Sofronova R, Sack H, Alam M. Cat2type: Wikipedia category embeddings for entity typing in knowledge graphs. InProceedings of the 11th on Knowledge Capture Conference 2021 Dec 2 (pp. 81-88).

    [2] Biswas R, Portisch J, Paulheim H, Sack H, Alam M. Entity type prediction leveraging graph walks and entity descriptions. In The Semantic Web–ISWC 2022: 21st International Semantic Web Conference, Virtual Event, October 23–27, 2022, Proceedings 2022 Oct 16 (pp. 392-410). Cham: Springer International Publishing.

    [3] Biswas R, Chen Y, Paulheim H, Sack H, Alam M. It’s All in the Name: Entity Typing Using Multilingual Language Models. In The Semantic Web: ESWC 2022 Satellite Events: Hersonissos, Crete, Greece, May 29–June 2, 2022, Proceedings 2022 Jul 20 (pp. 36-41). Cham: Springer International Publishing.

    [4] Biswas R, Sofronova R, Alam M, Heist N, Paulheim H, Sack H. Do judge an entity by its name! entity typing using language models. In The Semantic Web: ESWC 2021 Satellite Events: Virtual Event, June 6–10, 2021, Revised Selected Papers 18 2021 (pp. 65-70). Springer International Publishing.

  8. Amount of data created, consumed, and stored 2010-2023, with forecasts to...

    • statista.com
    Updated Nov 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Amount of data created, consumed, and stored 2010-2023, with forecasts to 2028 [Dataset]. https://www.statista.com/statistics/871513/worldwide-data-created/
    Explore at:
    Dataset updated
    Nov 21, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    May 2024
    Area covered
    Worldwide
    Description

    The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly, reaching 149 zettabytes in 2024. Over the next five years up to 2028, global data creation is projected to grow to more than 394 zettabytes. In 2020, the amount of data created and replicated reached a new high. The growth was higher than previously expected, caused by the increased demand due to the COVID-19 pandemic, as more people worked and learned from home and used home entertainment options more often. Storage capacity also growing Only a small percentage of this newly created data is kept though, as just two percent of the data produced and consumed in 2020 was saved and retained into 2021. In line with the strong growth of the data volume, the installed base of storage capacity is forecast to increase, growing at a compound annual growth rate of 19.2 percent over the forecast period from 2020 to 2025. In 2020, the installed base of storage capacity reached 6.7 zettabytes.

  9. Data from: CKGG: A Chinese Knowledge Graph for High-School Geography...

    • data.subak.org
    • zenodo.org
    csv
    Updated Feb 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    State Key Laboratory for Novel Software Technology, Nanjing University, China (2023). CKGG: A Chinese Knowledge Graph for High-School Geography Education and Beyond [Dataset]. https://data.subak.org/dataset/ckgg-a-chinese-knowledge-graph-for-high-school-geography-education-and-beyond
    Explore at:
    csvAvailable download formats
    Dataset updated
    Feb 16, 2023
    Dataset provided by
    Nanjing Universityhttp://www.nju.edu.cn/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    List of files

    • admin.nt.gz: part of administrative divisions (country, admin1, admin2, admin3, admin4)
    • admin_code.nt.gz: administrative division code
    • altitude.nt.gz: altitude in meters
    • belong_by_admin.nt.gz: "part of lowest administrative divisions" triples converted to triples with "part of" predicate
    • climate.nt.gz: climate type of location, also metadata about climate types (name, description, polygon)
    • cn_stat_data.nt.gz: statistical data of province from National Bureau of Statistics of China
    • coord.nt.gz: coordinate (latitude and longitude)
    • entity_rank_score.nt.gz: ranking score of entity
    • feature_belong.nt.gz: part of calculated using polygons of places
    • geonames_code.nt.gz: geonames feature code
    • gis.nt.gz: polygons of place
    • globalsolaraltas.nt.gz: annual solar radiation data in kWh/m2
    • label.nt.gz: labels
    • node_align.nt.gz: alignment to geonames and wikidata
    • ocean_current.nt.gz: influenced by ocean current, also metadata about ocean currents (name, linestring)
    • pop.nt.gz: population
    • precip_xxx.nt.gz: monthly and annual precipitation data of entities in meters
    • temperature_xxx.nt.gz: monthly and annual temperature data of entities in celsius degree
    • to_ocean_distance.nt.gz: location distance to ocean in meters
    • types.nt.gz: rdf:type of locations
    • tz.nt.gz: metadata of timezones (label, offset to gmt)
    • tz_loc.nt.gz: timezone of locations
    • wikilinks.nt.gz: wikipedia links of locations

    Ontology

    • ontology.owl: CKGG ontology
    • mapping.owl: CKGG ontology mapping to other ontologies (Clinga, DBpedia, GeoSPARQL, SKOS)

    For more details, please refer to http://w3id.org/ckgg

  10. Wikipedia Knowledge Graph dataset

    • zenodo.org
    • explore.openaire.eu
    • +1more
    pdf, tsv
    Updated Jul 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas (2024). Wikipedia Knowledge Graph dataset [Dataset]. http://doi.org/10.5281/zenodo.6346900
    Explore at:
    tsv, pdfAvailable download formats
    Dataset updated
    Jul 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Wikipedia is the largest and most read online free encyclopedia currently existing. As such, Wikipedia offers a large amount of data on all its own contents and interactions around them, as well as different types of open data sources. This makes Wikipedia a unique data source that can be analyzed with quantitative data science techniques. However, the enormous amount of data makes it difficult to have an overview, and sometimes many of the analytical possibilities that Wikipedia offers remain unknown. In order to reduce the complexity of identifying and collecting data on Wikipedia and expanding its analytical potential, after collecting different data from various sources and processing them, we have generated a dedicated Wikipedia Knowledge Graph aimed at facilitating the analysis, contextualization of the activity and relations of Wikipedia pages, in this case limited to its English edition. We share this Knowledge Graph dataset in an open way, aiming to be useful for a wide range of researchers, such as informetricians, sociologists or data scientists.

    There are a total of 9 files, all of them in tsv format, and they have been built under a relational structure. The main one that acts as the core of the dataset is the page file, after it there are 4 files with different entities related to the Wikipedia pages (category, url, pub and page_property files) and 4 other files that act as "intermediate tables" making it possible to connect the pages both with the latter and between pages (page_category, page_url, page_pub and page_link files).

    The document Dataset_summary includes a detailed description of the dataset.

    Thanks to Nees Jan van Eck and the Centre for Science and Technology Studies (CWTS) for the valuable comments and suggestions.

  11. Data from: OpenAIRE Graph Beginner's Kit Dataset

    • zenodo.org
    • pub.uni-bielefeld.de
    tar
    Updated Aug 20, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Miriam Baglioni; Miriam Baglioni; Claudio Atzori; Claudio Atzori; Alessia Bardi; Alessia Bardi; Gianbattista Bloisi; Sandro La Bruzzo; Sandro La Bruzzo; Paolo Manghi; Paolo Manghi; Harry Dimitropoulos; Andrea Mannocci; Andrea Mannocci; Ioannis Foufoulas; Marek Horst; Michele De Bonis; Michele De Bonis; Michele Artini; Thanasis Vergoulis; Thanasis Vergoulis; Serafeim Chatzopoulos; Serafeim Chatzopoulos; Dimitris Pierrakos; Antonis Lempesis; Antonis Lempesis; Andreas Czerniak; Andreas Czerniak; Jochen Schirrwagen; Alexandros Ioannidis; Katerina Iatropoulou; Argiro Kokogiannaki; Argiro Kokogiannaki; Gianbattista Bloisi; Harry Dimitropoulos; Ioannis Foufoulas; Marek Horst; Michele Artini; Dimitris Pierrakos; Jochen Schirrwagen; Alexandros Ioannidis; Katerina Iatropoulou (2023). OpenAIRE Graph Beginner's Kit Dataset [Dataset]. http://doi.org/10.5281/zenodo.8223812
    Explore at:
    tarAvailable download formats
    Dataset updated
    Aug 20, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Miriam Baglioni; Miriam Baglioni; Claudio Atzori; Claudio Atzori; Alessia Bardi; Alessia Bardi; Gianbattista Bloisi; Sandro La Bruzzo; Sandro La Bruzzo; Paolo Manghi; Paolo Manghi; Harry Dimitropoulos; Andrea Mannocci; Andrea Mannocci; Ioannis Foufoulas; Marek Horst; Michele De Bonis; Michele De Bonis; Michele Artini; Thanasis Vergoulis; Thanasis Vergoulis; Serafeim Chatzopoulos; Serafeim Chatzopoulos; Dimitris Pierrakos; Antonis Lempesis; Antonis Lempesis; Andreas Czerniak; Andreas Czerniak; Jochen Schirrwagen; Alexandros Ioannidis; Katerina Iatropoulou; Argiro Kokogiannaki; Argiro Kokogiannaki; Gianbattista Bloisi; Harry Dimitropoulos; Ioannis Foufoulas; Marek Horst; Michele Artini; Dimitris Pierrakos; Jochen Schirrwagen; Alexandros Ioannidis; Katerina Iatropoulou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The OpenAIRE Graph is an Open Access dataset containing metadata about research products (literature, datasets, software, etc.) linked to other entities of the research ecosystem like organisations, project grants, and data sources.

    The large size of the OpenAIRE Graph is a major impediment for beginners to familiarise with the underlying data model and explore its contents. Working with the Graph in its full size typically requires access to a huge distributed computing infrastructure which cannot be easily accessible to everyone.

    The OpenAIRE Beginner’s Kit aims to address this issue. It consists of two components:

    • A subset of the OpenAIRE Graph composed of the research products published between 2022-12-28 and 2023-07-31, all the entities connected to them and the respective relationships. The subset is composed of the following parts:

      • publication.tar: metadata records about research literature (includes types of publications listed here)

      • dataset.tar: metadata records about research data (includes the subtypes listed here)

      • software.tar: metadata records about research software (includes the subtypes listed here)

      • otherresearchproduct.tar: metadata records about research products that cannot be classified as research literature, data or software (includes types of products listed here)

      • organization.tar: metadata records about organizations involved in the research life-cycle, such as universities, research organizations, funders.

      • datasource.tar: metadata records about data sources whose content is available in the OpenAIRE Graph. They include institutional and thematic repositories, journals, aggregators, funders' databases.

      • project.tar: metadata records about project grants.

      • relation.tar: metadata records about relations between entities in the graph.

      • communities_infrastructures.tar: metadata records about research communities and research infrastructures

        Each file is a tar archive containing gz files, each with one json per line. Each json is compliant to the schema available at http://doi.org/10.5281/zenodo.8238874.

    • The code to analyse the data. It is available on GitHub. Just download the archive, unzip/untar it and follow the instruction on the README file (no need to clone the GitHub repository)


  12. Semantic Knowledge Graphing Market is Growing at a CAGR of 14.80% from 2024...

    • cognitivemarketresearch.com
    pdf,excel,csv,ppt
    Updated Mar 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cognitive Market Research (2024). Semantic Knowledge Graphing Market is Growing at a CAGR of 14.80% from 2024 to 2031. [Dataset]. https://www.cognitivemarketresearch.com/semantic-knowledge-graphing-market-report
    Explore at:
    pdf,excel,csv,pptAvailable download formats
    Dataset updated
    Mar 4, 2024
    Dataset provided by
    Decipher Market Research
    Authors
    Cognitive Market Research
    License

    https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy

    Time period covered
    2021 - 2033
    Area covered
    Global
    Description

    According to Cognitive Market Research, the global semantic knowledge graphing market size is USD 1512.2 million in 2024 and will expand at a compound annual growth rate (CAGR) of 14.80% from 2024 to 2031.

    North America held the major market of around 40% of the global revenue with a market size of USD 604.88 million in 2024 and will grow at a compound annual growth rate (CAGR) of 13.0% from 2024 to 2031.
    Europe accounted for a share of over 30% of the global market size of USD 453.66 million.
    Asia Pacific held the market of around 23% of the global revenue with a market size of USD 347.81 million in 2024 and will grow at a compound annual growth rate (CAGR) of 16.8% from 2024 to 2031.
    Latin America market of around 5% of the global revenue with a market size of USD 75.61 million in 2024 and will grow at a compound annual growth rate (CAGR) of 14.2% from 2024 to 2031.
    Middle East and Africa held the major market of around 2% of the global revenue with a market size of USD 30.24 million in 2024 and will grow at a compound annual growth rate (CAGR) of 14.5% from 2024 to 2031.
    The natural language processing knowledge graphing held the highest growth rate in semantic knowledge graphing market in 2024.
    

    Market Dynamics of Semantic Knowledge Graphing Market

    Key Drivers of Semantic Knowledge Graphing Market

    Growing Volumes of Structured, Semi-structured, and Unstructured Data to Increase the Global Demand
    

    The global demand for semantic knowledge graphing is escalating in response to the exponential growth of structured, semi-structured, and unstructured data. Enterprises are inundated with vast amounts of data from diverse sources such as social media, IoT devices, and enterprise applications. Structured data from databases, semi-structured data like XML and JSON, and unstructured data from documents, emails, and multimedia files present significant challenges in terms of organization, analysis, and deriving actionable insights. Semantic knowledge graphing addresses these challenges by providing a unified framework for representing, integrating, and analyzing disparate data types. By leveraging semantic technologies, businesses can unlock the value hidden within their data, enabling advanced analytics, natural language processing, and knowledge discovery. As organizations increasingly recognize the importance of harnessing data for strategic decision-making, the demand for semantic knowledge graphing solutions continues to surge globally.

    Demand for Contextual Insights to Propel the Growth
    

    The burgeoning demand for contextual insights is propelling the growth of semantic knowledge graphing solutions. In today's data-driven landscape, businesses are striving to extract deeper contextual meaning from their vast datasets to gain a competitive edge. Semantic knowledge graphing enables organizations to connect disparate data points, understand relationships, and derive valuable insights within the appropriate context. This contextual understanding is crucial for various applications such as personalized recommendations, predictive analytics, and targeted marketing campaigns. By leveraging semantic technologies, companies can not only enhance decision-making processes but also improve customer experiences and operational efficiency. As industries across sectors increasingly recognize the importance of contextual insights in driving innovation and business success, the adoption of semantic knowledge graphing solutions is poised to witness significant growth. This trend underscores the pivotal role of semantic technologies in unlocking the true potential of data for strategic advantage in today's dynamic marketplace.

    Restraint Factors Of Semantic Knowledge Graphing Market

    Stringent Data Privacy Regulations to Hinder the Market Growth
    

    Stringent data privacy regulations present a significant hurdle to the growth of the Semantic Knowledge Graphing market. Regulations such as GDPR (General Data Protection Regulation) in Europe and CCPA (California Consumer Privacy Act) in the United States impose strict requirements on how organizations collect, store, process, and share personal data. Compliance with these regulations necessitates robust data protection measures, including anonymization, encryption, and access controls, which can complicate the implementation of semantic knowledge graphing systems. Moreover, concerns about data breach...

  13. T

    Spain - Distribution of population by household types: Single person

    • tradingeconomics.com
    csv, excel, json, xml
    Updated Sep 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TRADING ECONOMICS (2020). Spain - Distribution of population by household types: Single person [Dataset]. https://tradingeconomics.com/spain/distribution-of-population-by-household-types-single-person-eurostat-data.html
    Explore at:
    csv, json, xml, excelAvailable download formats
    Dataset updated
    Sep 16, 2020
    Dataset authored and provided by
    TRADING ECONOMICS
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 1976 - Dec 31, 2025
    Area covered
    Spain
    Description

    Spain - Distribution of population by household types: Single person was 11.30% in December of 2024, according to the EUROSTAT. Trading Economics provides the current actual value, an historical data chart and related indicators for Spain - Distribution of population by household types: Single person - last updated from the EUROSTAT on March of 2025. Historically, Spain - Distribution of population by household types: Single person reached a record high of 11.30% in December of 2024 and a record low of 8.40% in December of 2009.

  14. d

    Area Age Gender Statistics Chart - Epidemic Typhus - Statistics by Onset...

    • data.gov.tw
    csv, json
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Centers for Disease Control, Area Age Gender Statistics Chart - Epidemic Typhus - Statistics by Onset Date (in months) [Dataset]. https://data.gov.tw/en/datasets/8671
    Explore at:
    json, csvAvailable download formats
    Dataset authored and provided by
    Centers for Disease Control
    License

    https://data.gov.tw/licensehttps://data.gov.tw/license

    Description

    Statistical table of the number of cases by region, age group, and gender since 2003 (Disease name: Scrub typhus, Date type: Onset date, Case type: Confirmed case, Source of infection: Domestic, Imported).

  15. T

    Hungary - Distribution of population by household types: Single person

    • tradingeconomics.com
    csv, excel, json, xml
    Updated Sep 15, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TRADING ECONOMICS (2020). Hungary - Distribution of population by household types: Single person [Dataset]. https://tradingeconomics.com/hungary/distribution-of-population-by-household-types-single-person-eurostat-data.html
    Explore at:
    xml, json, excel, csvAvailable download formats
    Dataset updated
    Sep 15, 2020
    Dataset authored and provided by
    TRADING ECONOMICS
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 1976 - Dec 31, 2025
    Area covered
    Hungary
    Description

    Hungary - Distribution of population by household types: Single person was 13.80% in December of 2023, according to the EUROSTAT. Trading Economics provides the current actual value, an historical data chart and related indicators for Hungary - Distribution of population by household types: Single person - last updated from the EUROSTAT on March of 2025. Historically, Hungary - Distribution of population by household types: Single person reached a record high of 14.50% in December of 2017 and a record low of 9.20% in December of 2010.

  16. w

    Part I and Part II crimes bar chart

    • data.wu.ac.at
    csv, json, xml
    Updated Aug 27, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    County of San Mateo Sheriff's Office (2016). Part I and Part II crimes bar chart [Dataset]. https://data.wu.ac.at/schema/performance_smcgov_org/bnZoNi01OW1m
    Explore at:
    csv, json, xmlAvailable download formats
    Dataset updated
    Aug 27, 2016
    Dataset provided by
    County of San Mateo Sheriff's Office
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    Counts of Part I committed in San Mateo County from 1985 on. This dataset also includes Part II crimes from 2013 on.

    Part I crimes include: homicide, rape, robbery, aggravated assault, burglary, motor vehicle theft, larceny-theft, and arson. These counts include crimes committed at San Francisco International Airport (SFO), Unincorporated San Mateo County, Woodside, Portola Valley, San Carlos from 10/31/10 forward; Half Moon Bay from 6/12/11 forward; and Millbrae from 3/4/12 forward.

    Part II crimes do not include San Francisco International Airport (SFO) cases and is an estimate only. An estimate is required because there are no specific data types used when keying in Type II crime types. Therefore, Records Manager judgment is used.

  17. Κ

    The Enhanced Microsoft Academic Knowledge Graph

    • datacatalogue.sodanet.gr
    • datacatalogue.cessda.eu
    Updated Apr 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Κατάλογος Δεδομένων SoDaNet (2024). The Enhanced Microsoft Academic Knowledge Graph [Dataset]. http://doi.org/10.17903/FK2/TZWQPD
    Explore at:
    Dataset updated
    Apr 30, 2024
    Dataset provided by
    Κατάλογος Δεδομένων SoDaNet
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 1800 - Dec 31, 2021
    Area covered
    Worldwide
    Dataset funded by
    European Commission
    Description

    The Enhanced Microsoft Academic Knowledge Graph (EMAKG) is a large dataset of scientific publications and related entities, including authors, institutions, journals, conferences, and fields of study. The proposed dataset originates from the Microsoft Academic Knowledge Graph (MAKG), one of the most extensive freely available knowledge graphs of scholarly data. To build the dataset, we first assessed the limitations of the current MAKG. Then, based on these, several methods were designed to enhance data and facilitate the number of use case scenarios, particularly in mobility and network analysis. EMAKG provides two main advantages: It has improved usability, facilitating access to non-expert users It includes an increased number of types of information obtained by integrating various datasets and sources, which help expand the application domains. For instance, geographical information could help mobility and migration research. The knowledge graph completeness is improved by retrieving and merging information on publications and other entities no longer available in the latest version of MAKG. Furthermore, geographical and collaboration networks details are employed to provide data on authors as well as their annual locations and career nationalities, together with worldwide yearly stocks and flows. Among others, the dataset also includes: fields of study (and publications) labelled by their discipline(s); abstracts and linguistic features, i.e., standard language codes, tokens , and types entities’ general information, e.g., date of foundation and type of institutions; and academia related metrics, i.e., h-index. The resulting dataset maintains all the characteristics of the parent datasets and includes a set of additional subsets and data that can be used for new case studies relating to network analysis, knowledge exchange, linguistics, computational linguistics, and mobility and human migration, among others.

  18. Freebase Datasets for Robust Evaluation of Knowledge Graph Link Prediction...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Nov 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li; Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li (2023). Freebase Datasets for Robust Evaluation of Knowledge Graph Link Prediction Models [Dataset]. http://doi.org/10.5281/zenodo.7909511
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 29, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li; Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Freebase is amongst the largest public cross-domain knowledge graphs. It possesses three main data modeling idiosyncrasies. It has a strong type system; its properties are purposefully represented in reverse pairs; and it uses mediator objects to represent multiary relationships. These design choices are important in modeling the real-world. But they also pose nontrivial challenges in research of embedding models for knowledge graph completion, especially when models are developed and evaluated agnostically of these idiosyncrasies. We make available several variants of the Freebase dataset by inclusion and exclusion of these data modeling idiosyncrasies. This is the first-ever publicly available full-scale Freebase dataset that has gone through proper preparation.

    Dataset Details

    The dataset consists of the four variants of Freebase dataset as well as related mapping/support files. For each variant, we made three kinds of files available:

    • Subject matter triples file
      • fb+/-CVT+/-REV One folder for each variant. In each folder there are 5 files: train.txt, valid.txt, test.txt, entity2id.txt, relation2id.txt Subject matter triples are the triples belong to subject matters domains—domains describing real-world facts.
        • Example of a row in train.txt, valid.txt, and test.txt:
          • 2, 192, 0
        • Example of a row in entity2id.txt:
          • /g/112yfy2xr, 2
        • Example of a row in relation2id.txt:
          • /music/album/release_type, 192
        • Explaination
          • "/g/112yfy2xr" and "/m/02lx2r" are the MID of the subject entity and object entity, respectively. "/music/album/release_type" is the realtionship between the two entities. 2, 192, and 0 are the IDs assigned by the authors to the objects.
    • Type system file
      • freebase_endtypes: Each row maps an edge type to its required subject type and object type.
        • Example
          • 92, 47178872, 90
        • Explanation
          • "92" and "90" are the type id of the subject and object which has the relationship id "47178872".
    • Metadata files
      • object_types: Each row maps the MID of a Freebase object to a type it belongs to.
        • Example
          • /g/11b41c22g, /type/object/type, /people/person
        • Explanation
          • The entity with MID "/g/11b41c22g" has a type "/people/person"
      • object_names: Each row maps the MID of a Freebase object to its textual label.
        • Example
          • /g/11b78qtr5m, /type/object/name, "Viroliano Tries Jazz"@en
        • Explanation
          • The entity with MID "/g/11b78qtr5m" has name "Viroliano Tries Jazz" in English.
      • object_ids: Each row maps the MID of a Freebase object to its user-friendly identifier.
        • Example
          • /m/05v3y9r, /type/object/id, "/music/live_album/concert"
        • Explanation
          • The entity with MID "/m/05v3y9r" can be interpreted by human as a music concert live album.
      • domains_id_label: Each row maps the MID of a Freebase domain to its label.
        • Example
          • /m/05v4pmy, geology, 77
        • Explanation
          • The object with MID "/m/05v4pmy" in Freebase is the domain "geology", and has id "77" in our dataset.
      • types_id_label: Each row maps the MID of a Freebase type to its label.
        • Example
          • /m/01xljxh, /government/political_party, 147
        • Explanation
          • The object with MID "/m/01xljxh" in Freebase is the type "/government/political_party", and has id "147" in our dataset.
      • entities_id_label: Each row maps the MID of a Freebase entity to its label.
        • Example
          • /g/11b78qtr5m, Viroliano Tries Jazz, 2234
        • Explanation
          • The entity with MID "/g/11b78qtr5m" in Freebase is "Viroliano Tries Jazz", and has id "2234" in our dataset.
        • properties_id_label: Each row maps the MID of a Freebase property to its label.
          • Example
            • /m/010h8tp2, /comedy/comedy_group/members, 47178867
          • Explanation
            • The object with MID "/m/010h8tp2" in Freebase is a property(relation/edge), it has label "/comedy/comedy_group/members" and has id "47178867" in our dataset.
        • uri_original2simplified and uri_simplified2original: The mapping between original URI and simplified URI and the mapping between simplified URI and original URI repectively.

  19. Z

    OpenAIRE Graph Dataset

    • data.niaid.nih.gov
    Updated Sep 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenAIRE Graph Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3516917
    Explore at:
    Dataset updated
    Sep 24, 2024
    Dataset provided by
    Vergoulis, Thanasis
    Ioannidis, Alexandros
    Principe, Pedro
    Foufoulas, Ioannis
    Chatzopoulos, Serafeim
    Lempesis, Antonis
    La Bruzzo, Sandro
    Iatropoulou, Katerina
    Artini, Michele
    Dimitropoulos, Harry
    Baglioni, Miriam
    Atzori, Claudio
    Manola, Natalia
    Bardi, Alessia
    Manghi, Paolo
    Kokogiannaki, Argiro
    De Bonis, Michele
    Mannocci, Andrea
    Horst, Marek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The OpenAIRE Graph is exported as several dataset, so you can download the parts you are interested into.

    publication_[part].tar: metadata records about research literature (includes types of publications listed here)dataset_[part].tar: metadata records about research data (includes the subtypes listed here) software.tar: metadata records about research software (includes the subtypes listed here)otherresearchproduct_[part].tar: metadata records about research products that cannot be classified as research literature, data or software (includes types of products listed here)organization.tar: metadata records about organizations involved in the research life-cycle, such as universities, research organizations, funders.datasource.tar: metadata records about data sources whose content is available in the OpenAIRE Graph. They include institutional and thematic repositories, journals, aggregators, funders' databases.project.tar: metadata records about project grants.relation_[part].tar: metadata records about relations between entities in the graph.communities_infrastructures.tar: metadata records about research communities and research infrastructures

    Each file is a tar archive containing gz files, each with one json per line. Each json is compliant to the schema available at http://doi.org/10.5281/zenodo.13120886. The documentation for the model is available at https://graph.openaire.eu/docs/data-model/

    Learn more about the OpenAIRE Graph at https://graph.openaire.eu.

    Discover the graph's content on OpenAIRE EXPLORE and our API for developers.

  20. Number of data compromises and impacted individuals in U.S. 2005-2023

    • statista.com
    Updated Dec 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Number of data compromises and impacted individuals in U.S. 2005-2023 [Dataset]. https://www.statista.com/statistics/273550/data-breaches-recorded-in-the-united-states-by-number-of-breaches-and-records-exposed/
    Explore at:
    Dataset updated
    Dec 10, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    United States
    Description

    In 2023, the number of data compromises in the United States stood at 3,205 cases. Meanwhile, over 353 million individuals were affected in the same year by data compromises, including data breaches, leakage, and exposure. While these are three different events, they have one thing in common. As a result of all three incidents, the sensitive data is accessed by an unauthorized threat actor. Industries most vulnerable to data breaches Some industry sectors usually see more significant cases of private data violations than others. This is determined by the type and volume of the personal information organizations of these sectors store. In 2022, healthcare, financial services, and manufacturing were the three industry sectors that recorded most data breaches. The number of healthcare data breaches in the United States has gradually increased within the past few years. In the financial sector, data compromises increased almost twice between 2020 and 2022, while manufacturing saw an increase of more than three times in data compromise incidents. Largest data exposures worldwide In 2020, an adult streaming website, CAM4, experienced a leakage of nearly 11 billion records. This, by far, is the most extensive reported data leakage. This case, though, is unique because cyber security researchers found the vulnerability before the cyber criminals. The second-largest data breach is the Yahoo data breach, dating back to 2013. The company first reported about one billion exposed records, then later, in 2017, came up with an updated number of leaked records, which was three billion. In March 2018, the third biggest data breach happened, involving India’s national identification database Aadhaar. As a result of this incident, over 1.1 billion records were exposed.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Edwin Carreño; Edwin Carreño (2024). Sample Graph Datasets in CSV Format [Dataset]. http://doi.org/10.5281/zenodo.14335015
Organization logo

Sample Graph Datasets in CSV Format

Explore at:
csvAvailable download formats
Dataset updated
Dec 9, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Edwin Carreño; Edwin Carreño
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Sample Graph Datasets in CSV Format

Note: none of the data sets published here contain actual data, they are for testing purposes only.

Description

This data repository contains graph datasets, where each graph is represented by two CSV files: one for node information and another for edge details. To link the files to the same graph, their names include a common identifier based on the number of nodes. For example:

  • dataset_30_nodes_interactions.csv:contains 30 rows (nodes).
  • dataset_30_edges_interactions.csv: contains 47 rows (edges).
  • the common identifier dataset_30 refers to the same graph.

CSV nodes

Each dataset contains the following columns:

Name of the ColumnTypeDescription
UniProt IDstringprotein identification
labelstringprotein label (type of node)
propertiesstringa dictionary containing properties related to the protein.

CSV edges

Each dataset contains the following columns:

Name of the ColumnTypeDescription
Relationship IDstringrelationship identification
Source IDstringidentification of the source protein in the relationship
Target IDstringidentification of the target protein in the relationship
labelstringrelationship label (type of relationship)
propertiesstringa dictionary containing properties related to the relationship.

Metadata

GraphNumber of NodesNumber of EdgesSparse graph

dataset_30*

30

47

Y

dataset_60*

60

181

Y

dataset_120*

120

689

Y

dataset_240*

240

2819

Y

dataset_300*

300

4658

Y

dataset_600*

600

18004

Y

dataset_1200*

1200

71785

Y

dataset_2400*

2400

288600

Y

dataset_3000*

3000

449727

Y

dataset_6000*

6000

1799413

Y

dataset_12000*

12000

7199863

Y

dataset_24000*

24000

28792361

Y

dataset_30000*

30000

44991744

Y

This repository include two (2) additional tiny graph datasets to experiment before dealing with larger datasets.

CSV nodes (tiny graphs)

Each dataset contains the following columns:

Name of the ColumnTypeDescription
IDstringnode identification
labelstringnode label (type of node)
propertiesstringa dictionary containing properties related to the node.

CSV edges (tiny graphs)

Each dataset contains the following columns:

Name of the ColumnTypeDescription
IDstringrelationship identification
sourcestringidentification of the source node in the relationship
targetstringidentification of the target node in the relationship
labelstringrelationship label (type of relationship)
propertiesstringa dictionary containing properties related to the relationship.

Metadata (tiny graphs)

GraphNumber of NodesNumber of EdgesSparse graph
dataset_dummy*36N
dataset_dummy2*36N
Search
Clear search
Close search
Google apps
Main menu