100+ datasets found
  1. Wikipedia Link Graph Dataset - 100K Pages

    • kaggle.com
    zip
    Updated Dec 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kutay Şahin (2025). Wikipedia Link Graph Dataset - 100K Pages [Dataset]. https://www.kaggle.com/datasets/kutayahin/wikipedia-link-graph-100k
    Explore at:
    zip(908552367 bytes)Available download formats
    Dataset updated
    Dec 4, 2025
    Authors
    Kutay Şahin
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    A comprehensive Wikipedia dataset containing 100,000 pages with 28.9 million links, collected using breadth-first search crawling algorithm. This dataset includes complete page metadata, link relationships, and a network graph representation suitable for network analysis, graph algorithms, NLP research, and machine learning applications.

    Dataset Overview

    • Total Pages: 100,000
    • Total Links: 28,855,738 (directed edges)
    • Average Words per Page: 3,531
    • Language: English (en.wikipedia.org)
    • Collection Method: BFS (Breadth-First Search) crawling, depth 5
    • Data Quality Score: 99.76/100

    Files Description

    1. pages_export.csv

    Complete page metadata including: - id: Unique page ID - title: Page title - language: Language code (en) - content_length: Content length in characters - word_count: Word count - categories: JSON array of categories - infobox: JSON object of infobox data - created_at: Timestamp - url: Full Wikipedia URL

    Size: ~70 MB | Rows: 100,000

    2. links_export.csv

    Complete link graph with URLs: - id: Unique link ID - source_title: Source page title - target_title: Target page title - language: Language code - position: Link position on page - depth: Crawl depth where link was discovered - created_at: Timestamp - source_url: Full source page URL - target_url: Full target page URL

    Size: ~4.5 GB | Rows: 28,855,738

    3. graph.json

    Network graph in JSON format: - nodes: Array of node objects with id field - edges: Array of edge objects with source and target fields

    Size: ~2.1 GB | Edges: 28,855,738

    Data Quality

    • Content Coverage: 99.99% (99,992 pages have quality content)
    • Link Quality: 99.22%
    • Uniqueness: 100% (all links are unique)
    • Content Quality: 100% (average 3,531 words per page)
    • Duplicate Pages: Minimal (cleaned)
    • Self-Links: 4,326 (removed)
    • Data Validation: ✅ All entries validated and cleaned

    Use Cases

    1. Network Analysis: Study Wikipedia link structure and page connectivity
    2. Graph Algorithms: Test shortest path, centrality, community detection algorithms
    3. NLP Research: Analyze Wikipedia content, categories, and relationships
    4. Machine Learning: Train models on Wikipedia link prediction
    5. Knowledge Graph: Build knowledge graphs from Wikipedia structure
    6. PageRank: Implement and test PageRank algorithms
    7. Recommendation Systems: Build content recommendation systems

    Collection Methodology

    1. Seed Selection: Started with 5 Wikipedia pages
    2. Crawling: BFS algorithm, depth 5
    3. Rate Limiting: Balanced (0.82 pages/second)
    4. Parallel Processing: Optimized concurrent workers
    5. Caching: HTML content cached for efficiency
    6. Validation: All data validated and deduplicated
    7. Quality Control: Automated quality checks and cleaning

    Technical Details

    • Database: SQLite with WAL mode
    • Crawl Duration: ~29 hours
    • Crawl Rate: 0.82 pages/second
    • Checkpoint System: Resume-capable crawling
    • Data Cleaning: Automated duplicate removal and quality checks
  2. 540 Images Of Popular Graph Theory Graphs

    • kaggle.com
    zip
    Updated Dec 31, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas Konstantin (2020). 540 Images Of Popular Graph Theory Graphs [Dataset]. https://www.kaggle.com/thomaskonstantin/390-images-of-popular-graph-theory-graphs
    Explore at:
    zip(394677 bytes)Available download formats
    Dataset updated
    Dec 31, 2020
    Authors
    Thomas Konstantin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Content

    This dataset contains 540 images of popular graphs from the world of graph theory. Many different types of graphs from various graph families and complexities and even more hidden stories and questions.

    Inspiration

    There are various tasks when researching less trivial graphs and usually very computationally expensive, especially when dealing with higher-order graphs. Can we use our state of the art computer vision pipeline and algorithms to extract insight from graph images? Insight such as smaller and simpler parts like the number of vertices and edges to more difficult questions like clique sizes and graph radius and paths.

  3. u

    MIVIA ARG Dataset

    • mivia.unisa.it
    • zenodo.org
    text/vf-format
    Updated Jan 1, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MIVIA Lab (2013). MIVIA ARG Dataset [Dataset]. http://doi.org/10.1016/S0167-8655(02)00253-2
    Explore at:
    text/vf-formatAvailable download formats
    Dataset updated
    Jan 1, 2013
    Dataset authored and provided by
    MIVIA Lab
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The ARG Database is a huge collection of labeled and unlabeled graphs realized by the MIVIA Group. The aim of this collection is to provide the graph research community with a standard test ground for the benchmarking of graph matching algorithms.

  4. i

    MS-BioGraphs: Trillion-Scale Sequence Similarity Graph Datasets

    • ieee-dataport.org
    Updated Jan 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohsen Koohi (2025). MS-BioGraphs: Trillion-Scale Sequence Similarity Graph Datasets [Dataset]. https://ieee-dataport.org/open-access/ms-biographs-trillion-scale-sequence-similarity-graph-datasets
    Explore at:
    Dataset updated
    Jan 26, 2025
    Authors
    Mohsen Koohi
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    MS-BioGraphs are a family of sequence similarity graph datasets with up to 2.5 trillion edges. The graphs are weighted edges and presented in compressed WebGraph format. The dataset include symmetric and asymmetric graphs. The largest graph has been created by matching sequences in Metaclust dataset with 1.7 billion sequences. These real-world graph dataset are useful for measuring contributions in High-Performance Computing and High-Performance Graph Processing.

  5. D

    History of work (all graph datasets)

    • druid.datalegend.net
    • iisg.amsterdam
    • +1more
    application/n-quads +5
    Updated Nov 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    History of Work (2025). History of work (all graph datasets) [Dataset]. https://druid.datalegend.net/HistoryOfWork/historyOfWork-all-latest
    Explore at:
    application/n-quads, application/n-triples, application/trig, ttl, jsonld, application/sparql-results+jsonAvailable download formats
    Dataset updated
    Nov 4, 2025
    Dataset authored and provided by
    History of Work
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    History of Work

    Here you find the History of Work resources as Linked Open Data. It enables you to look ups for HISCO and HISCAM scores for an incredible amount of occupational titles in numerous languages.

    Data can be queried (obtained) via the SPARQL endpoint or via the example queries. If the Linked Open Data format is new to you, you might enjoy these data stories on History of Work as Linked Open Data and this user question on Is there a list of female occupations?.

    NEW version - CHANGE notes

    This version is dated Apr 2025 and is not backwards compatible with the previous version (Feb 2021). The major changes are: - incredible simplification of graph representation (from 81 to 12); - use of sdo (https://schema.org/) rather than schema (http://schema.org); - replacement of prov:wasDerivedFrom with sdo:isPartOf to link occupational titles to originating datasets; - etl files (used for conversion to Linked Data) now publicly available via https://github.com/rlzijdeman/rdf-hisco; - update of issues with language tags; - specfication of language tags for english (eg. @en-gb, instead of @en); - new preferred API: https://api.druid.datalegend.net/datasets/HistoryOfWork/historyOfWork-all-latest/sparql (old API will be deprecated at some point: https://api.druid.datalegend.net/datasets/HistoryOfWork/historyOfWork-all-latest/services/historyOfWork-all-latest/sparql ) .

    There are bound to be some issues. Please leave report them here.

    Figure 1. Part of model illustrating the basic relation between occupations, schema.org and HISCO. https://druid.datalegend.net/HistoryOfWork/historyOfWork-all-latest/assets/601beed0f7d371035bca5521" alt="hisco-basic">

    Figure 2. Part of model illustrating the relation between occupation, provenance and HISCO auxiliary variables. https://druid.datalegend.net/HistoryOfWork/historyOfWork-all-latest/assets/601beed0f7d371035bca551e" alt="hisco-aux">

  6. h

    graph

    • huggingface.co
    Updated Jul 16, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mighty Morphin Power Rangers (2025). graph [Dataset]. https://huggingface.co/datasets/mmpr/graph
    Explore at:
    Dataset updated
    Jul 16, 2025
    Dataset authored and provided by
    Mighty Morphin Power Rangers
    Description

    mmpr/graph dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. e

    Graph Database Market Forecast Report | Graph Database (GDB) Industry Share...

    • emergenresearch.com
    pdf,excel,csv,ppt
    Updated Jan 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emergen Research (2022). Graph Database Market Forecast Report | Graph Database (GDB) Industry Share by 2030 [Dataset]. https://www.emergenresearch.com/industry-report/graph-database-market
    Explore at:
    pdf,excel,csv,pptAvailable download formats
    Dataset updated
    Jan 21, 2022
    Dataset authored and provided by
    Emergen Research
    License

    https://www.emergenresearch.com/privacy-policyhttps://www.emergenresearch.com/privacy-policy

    Area covered
    Global
    Variables measured
    Base Year, No. of Pages, Growth Drivers, Forecast Period, Segments covered, Historical Data for, Pitfalls Challenges, 2030 Value Projection, Tables, Charts, and Figures, Forecast Period 2021 - 2030 CAGR, and 1 more
    Description

    The global Graph Database market size reached USD 1.59 Billion in 2020 and revenue is forecasted to reach USD 11.25 Billion in 2030 registering a CAGR of 21.9%. Graph Database (GDB) industry report classifies global market by share, trend, growth and on the basis of component, deployment, graph type...

  8. Twitter Graph Example v2 43

    • kaggle.com
    zip
    Updated Jun 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mathias Weiß (2022). Twitter Graph Example v2 43 [Dataset]. https://www.kaggle.com/datasets/weissmedia/twitter-graph-example-v2-43
    Explore at:
    zip(17943518 bytes)Available download formats
    Dataset updated
    Jun 29, 2022
    Authors
    Mathias Weiß
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This project is inspired on https://github.com/neo4j-graph-examples/twitter-v2.

    Twitter Graph

    Show data from your personal Twitter account

    The Graph Your Network application inserts your Twitter activity into Neo4j.

    https://neo4jsandbox.com/guides/twitter/img/twitter-data-model.svg" alt="">

    Content

    ~10 MB of graphs data (CSV)

    43.325 node labels - Hashtag - Link - Me - Source - Tweet - User

    57.896 relationship types - AMPLIFIES - CONTAINS - FOLLOWS - INTERACTS_WITH - MENTIONS - POSTS - REPLY_TO - RETWEETS - RT_MENTIONS - SIMILAR_TO - TAGS - USING

  9. i

    Graph-level Classification Datasets

    • ieee-dataport.org
    Updated Dec 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhiqiang Wang (2025). Graph-level Classification Datasets [Dataset]. https://ieee-dataport.org/documents/graph-level-classification-datasets
    Explore at:
    Dataset updated
    Dec 15, 2025
    Authors
    Zhiqiang Wang
    Description

    BA3-motif

  10. r

    Waxman Random Graph Dataset

    • resodate.org
    • service.tib.eu
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guo; Du; Zhou (2024). Waxman Random Graph Dataset [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9zZXJ2aWNlLnRpYi5ldS9sZG1zZXJ2aWNlL2RhdGFzZXQvd2F4bWFuLXJhbmRvbS1ncmFwaC1kYXRhc2V0
    Explore at:
    Dataset updated
    Dec 16, 2024
    Dataset provided by
    Leibniz Data Manager
    Authors
    Guo; Du; Zhou
    Description

    The dataset used in the paper is a Waxman random graph dataset, which includes graphs with features and edge features.

  11. G

    Graph Technology Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Dec 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Graph Technology Report [Dataset]. https://www.datainsightsmarket.com/reports/graph-technology-1956854
    Explore at:
    ppt, pdf, docAvailable download formats
    Dataset updated
    Dec 22, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    Discover the explosive growth of the graph technology market! Our in-depth analysis reveals key drivers, trends, and challenges impacting this dynamic sector, including leading players like Neo4j, Amazon AWS, and more. Explore market size projections, CAGR forecasts, and regional breakdowns to understand investment opportunities in graph databases and AI/ML.

  12. Oregon Autonomous Systems Graphs (SNAP)

    • kaggle.com
    zip
    Updated Dec 16, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subhajit Sahu (2021). Oregon Autonomous Systems Graphs (SNAP) [Dataset]. https://www.kaggle.com/datasets/wolfram77/graphs-snap-oregon
    Explore at:
    zip(1682183 bytes)Available download formats
    Dataset updated
    Dec 16, 2021
    Authors
    Subhajit Sahu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Oregon
    Description

    Autonomous systems - Oregon-1

    Dataset information

    9 graphs of Autonomous Systems (AS) peering information inferred from Oregon
    route-views between March 31 2001 and May 26 2001.

    Dataset statistics are calculated for the graph with the lowest (March 31 2001) and highest (from May 26 2001) number of nodes: Dataset statistics for graph
    witdh lowest number of nodes - 3 31 2001)

    Nodes 10670
    Edges 22002
    Nodes in largest WCC 10670 (1.000)
    Edges in largest WCC 22002 (1.000)
    Nodes in largest SCC 10670 (1.000)
    Edges in largest SCC 22002 (1.000)
    Average clustering coefficient 0.4559
    Number of triangles 17144
    Fraction of closed triangles 0.009306
    Diameter (longest shortest path) 9
    90-percentile effective diameter 4.5

    Dataset statistics for graph with highest number of nodes - 5 26 2001

    Nodes 11174
    Edges 23409
    Nodes in largest WCC 11174 (1.000)
    Edges in largest WCC 23409 (1.000)
    Nodes in largest SCC 11174 (1.000)
    Edges in largest SCC 23409 (1.000)
    Average clustering coefficient 0.4532
    Number of triangles 19894
    Fraction of closed triangles 0.009636
    Diameter (longest shortest path) 10
    90-percentile effective diameter 4.4

    Source (citation)

    J. Leskovec, J. Kleinberg and C. Faloutsos. Graphs over Time: Densification
    Laws, Shrinking Diameters and Possible Explanations. ACM SIGKDD International
    Conference on Knowledge Discovery and Data Mining (KDD), 2005.

    Files
    File Description
    * AS peering information inferred from Oregon route-views ...
    oregon1_010331.txt.gz from March 31 2001
    oregon1_010407.txt.gz from April 7 2001
    oregon1_010414.txt.gz from April 14 2001
    oregon1_010421.txt.gz from April 21 2001
    oregon1_010428.txt.gz from April 28 2001
    oregon1_010505.txt.gz from May 05 2001
    oregon1_010512.txt.gz from May 12 2001
    oregon1_010519.txt.gz from May 19 2001
    oregon1_010526.txt.gz from May 26 2001

    NOTE: for the UF Sparse Matrix Collection, the primary matrix in this problem
    set (Problem.A) is the last matrix in the sequence, oregon1_010526, from May 26 2001.

    The nodes are uniform across all graphs in the sequence in the UF collection.
    That is, nodes do...

  13. K

    Knowledge Graph Technology Report

    • marketreportanalytics.com
    doc, pdf, ppt
    Updated Apr 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Report Analytics (2025). Knowledge Graph Technology Report [Dataset]. https://www.marketreportanalytics.com/reports/knowledge-graph-technology-53389
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    Apr 2, 2025
    Dataset authored and provided by
    Market Report Analytics
    License

    https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy

    Time period covered
    2026 - 2034
    Area covered
    Global
    Variables measured
    Market Size
    Description

    Discover the booming Knowledge Graph Technology market! This comprehensive analysis reveals key trends, growth drivers, and regional market shares from 2025-2033. Learn about market size, CAGR, and top players shaping this transformative technology.

  14. h

    GOOD-Graph

    • huggingface.co
    Updated Jan 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhikai chen (2024). GOOD-Graph [Dataset]. https://huggingface.co/datasets/zkchen/GOOD-Graph
    Explore at:
    Dataset updated
    Jan 27, 2024
    Authors
    Zhikai chen
    Description

    zkchen/GOOD-Graph dataset hosted on Hugging Face and contributed by the HF Datasets community

  15. u

    Graph Database Market Growth and Forecast to 2033

    • univdatos.com
    Updated Nov 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UnivDatos (2025). Graph Database Market Growth and Forecast to 2033 [Dataset]. https://univdatos.com/reports/graph-database-market
    Explore at:
    Dataset updated
    Nov 6, 2025
    Dataset authored and provided by
    UnivDatos
    License

    https://univdatos.com/privacy-policyhttps://univdatos.com/privacy-policy

    Description

    The Global Graph Database Market was valued at USD 2,257.78 million in 2024 and is expected to grow at a strong CAGR of around 17.5% during 2025-2033.

  16. G

    Graph Database for Telecom Networks Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Aug 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Graph Database for Telecom Networks Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/graph-database-for-telecom-networks-market
    Explore at:
    pdf, csv, pptxAvailable download formats
    Dataset updated
    Aug 22, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Graph Database for Telecom Networks Market Outlook



    According to our latest research, the global graph database for telecom networks market size is valued at USD 1.34 billion in 2024, reflecting a robust adoption rate across the telecom sector. The market is experiencing a strong upward trajectory with a CAGR of 22.7% from 2025 to 2033. By 2033, the market is projected to reach a substantial USD 10.15 billion, driven by the increasing complexity of telecom networks and the urgent need for advanced data management and analytics solutions. The primary growth factor is the surging demand for real-time network analytics and fraud detection capabilities, which are critical for telecom operators seeking operational efficiency and competitive advantage.




    The rapid proliferation of connected devices, 5G rollouts, and the exponential growth of data traffic are fundamentally transforming the telecom industry landscape. Telecom networks are evolving into highly complex, dynamic ecosystems that generate vast amounts of interconnected data. Traditional relational databases are often inadequate for handling such intricate relationships and real-time analytics requirements. Graph database solutions are uniquely positioned to address these challenges by enabling telecom operators to model, analyze, and visualize complex network topologies, customer interactions, and transactional data with unparalleled speed and flexibility. This technological shift is a key growth driver, as telecom providers increasingly seek scalable, agile, and intelligent data management platforms to enhance customer experience, optimize network performance, and accelerate digital transformation initiatives.




    Another significant growth factor for the graph database for telecom networks market is the escalating threat landscape, particularly in the domain of fraud detection and cybersecurity. Telecom operators are frequent targets of sophisticated fraud schemes, including SIM card cloning, subscription fraud, and network intrusion attempts. Graph databases excel at identifying hidden patterns, relationships, and anomalies within massive datasets, enabling telecom companies to detect and mitigate fraud in real time. The ability to perform advanced analytics on interconnected data sets is empowering telecom operators to proactively safeguard their networks, reduce financial losses, and comply with stringent regulatory requirements. As the complexity of cyber threats intensifies, the adoption of graph database solutions for security and fraud prevention is expected to surge, further fueling market growth.




    The growing emphasis on customer-centricity and personalized service delivery is also propelling market expansion. Telecom operators are leveraging graph databases to gain a 360-degree view of customer journeys, preferences, and interactions across multiple touchpoints. This holistic understanding facilitates targeted marketing, churn prediction, and tailored service offerings, which are essential for customer retention and revenue growth in a highly competitive market. The convergence of telecom networks with emerging technologies such as artificial intelligence, machine learning, and the Internet of Things (IoT) is amplifying the need for graph-based analytics, as these technologies rely on real-time, context-aware insights derived from complex data relationships. As a result, the integration of graph databases into telecom network architectures is becoming a strategic imperative for industry leaders.




    From a regional perspective, North America currently leads the global graph database for telecom networks market, accounting for the largest revenue share in 2024. The region’s dominance is attributed to the early adoption of advanced analytics technologies, robust digital infrastructure, and the presence of major telecom and technology companies. Asia Pacific is emerging as the fastest-growing region, driven by massive investments in 5G networks, expanding mobile subscriber base, and increasing focus on digital transformation across telecom operators. Europe is also witnessing significant adoption of graph database solutions, particularly in the context of regulatory compliance and network optimization. Meanwhile, Latin America and the Middle East & Africa are gradually catching up, supported by ongoing telecom sector modernization and rising demand for advanced data analytics. The global market outlook remains highly promising, with all regions poised to contribute to sustained growth over the forecast period.<b

  17. Z

    OpenAIRE Graph Dataset

    • data.niaid.nih.gov
    • pub.uni-bielefeld.de
    • +4more
    Updated Feb 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manghi, Paolo; Atzori, Claudio; Bardi, Alessia; Baglioni, Miriam; Dimitropoulos, Harry; La Bruzzo, Sandro; Foufoulas, Ioannis; Mannocci, Andrea; Horst, Marek; Iatropoulou, Katerina; Kokogiannaki, Argiro; De Bonis, Michele; Artini, Michele; Lempesis, Antonis; Ioannidis, Alexandros; Manola, Natalia; Principe, Pedro; Vergoulis, Thanasis; Chatzopoulos, Serafeim (2025). OpenAIRE Graph Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3516917
    Explore at:
    Dataset updated
    Feb 12, 2025
    Dataset provided by
    CNR - ISTI
    University of Minho
    CERN
    Athena Research and Innovation Centre
    University of Warsaw
    Authors
    Manghi, Paolo; Atzori, Claudio; Bardi, Alessia; Baglioni, Miriam; Dimitropoulos, Harry; La Bruzzo, Sandro; Foufoulas, Ioannis; Mannocci, Andrea; Horst, Marek; Iatropoulou, Katerina; Kokogiannaki, Argiro; De Bonis, Michele; Artini, Michele; Lempesis, Antonis; Ioannidis, Alexandros; Manola, Natalia; Principe, Pedro; Vergoulis, Thanasis; Chatzopoulos, Serafeim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The OpenAIRE Graph is exported as several files, so you can download the parts you are interested into.

    publication_[part].tar: metadata records about research literature (includes types of publications listed here)dataset_[part].tar: metadata records about research data (includes the subtypes listed here) software.tar: metadata records about research software (includes the subtypes listed here)otherresearchproduct_[part].tar: metadata records about research products that cannot be classified as research literature, data or software (includes types of products listed here)organization.tar: metadata records about organizations involved in the research life-cycle, such as universities, research organizations, funders.datasource.tar: metadata records about data sources whose content is available in the OpenAIRE Graph. They include institutional and thematic repositories, journals, aggregators, funders' databases.project.tar: metadata records about project grants.relation_[part].tar: metadata records about relations between entities in the graph.communities_infrastructures.tar: metadata records about research communities and research infrastructures

    Each file is a tar archive containing gz files, each with one json per line. Each json is compliant to the schema available at http://doi.org/10.5281/zenodo.14608526. The documentation for the model is available at https://graph.openaire.eu/docs/data-model/

    Learn more about the OpenAIRE Graph at https://graph.openaire.eu.

    Discover the graph's content on OpenAIRE EXPLORE and our API for developers.

    This deposition contains:

    192,934,523 publications,

    73,443,566 datasets,

    596,316 software,

    24,797,142 other research products,

    141,568 datasources,

    3,482,537 projects,

    454,601 organizations,

    34 communities,

    7,241,517,003 relations

  18. Graph datasets

    • kaggle.com
    zip
    Updated Sep 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jam0222 (2024). Graph datasets [Dataset]. https://www.kaggle.com/datasets/jam0222/graph-datasets
    Explore at:
    zip(7148830977 bytes)Available download formats
    Dataset updated
    Sep 16, 2024
    Authors
    Jam0222
    Description

    Dataset

    This dataset was created by Jam0222

    Contents

  19. I

    A Citation Graph from OpenAlex (Works)

    • databank.illinois.edu
    Updated Jul 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lorran Caetano Machado Lopes; George Chacko (2024). A Citation Graph from OpenAlex (Works) [Dataset]. http://doi.org/10.13012/B2IDB-7362697_V1
    Explore at:
    Dataset updated
    Jul 29, 2024
    Authors
    Lorran Caetano Machado Lopes; George Chacko
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    Illinois: Insper Collaboration
    Description

    This dataset consists of a citation graph. It was constructed by downloading and parsing the Works section of the Open Alex catalog of the global research system. Open Alex (see citation below) contains detailed information about scholarly research, including articles, authors, journals, institutions, and their relationships. The data were downloaded on 2024-07-15. The dataset comprises two compressed (.xz) files. 1) filename: openalexID_integer_id_hasDOI.parquet.xz. The tabular data within contains three columns: openalex_id, integer_id, and hasDOI. Each row represents a record with the following data types: • openalex_id: A unique identifier from the Open Alex catalog. • integer_id: An integer representing the new identifier (assigned by the authors) • hasDOI: An integer (0 or 1) indicating whether the record has a DOI (0 for no, 1 for yes). 2) filename: citation_table.tsv.xz This edgelist of citations has two columns (no header) of integer values that represent citing and cited integer_id, respectively. Summary Features • Total Nodes (Documents): 256,997,006 • Total Edges (citations): 2,148,871,058 • Documents with DOIs: 163,495,446 • Edges between documents with DOIs: 1,936,722,541 [corrected to 2,148,788,148 edges Nov 13, 2025] • Count of unique nodes in edgelist 111,453,719 [updated Nov 13, 2025] Note: Nov 13, 2025. An improved curation process will be applied to a future version of this dataset Note: Nov 13, 2025. The code used to generate these files can be found here: https://github.com/illinois-or-research-analytics/lorran_openalex/

  20. Z

    Simple connected graph invariants up to order ten

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hoppe, Travis; Petrone, Anna (2020). Simple connected graph invariants up to order ten [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_11238
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    NIH, NIDDK
    University of Maryland, Department of Civil Engineering
    Authors
    Hoppe, Travis; Petrone, Anna
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the database file for the Encyclopedia of Finite Graphs and the upcoming paper Integer sequence discovery from small graphs. It contains a collection of invariants for all simple connected graphs up to order 10 and the integer sequences one can make.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Kutay Şahin (2025). Wikipedia Link Graph Dataset - 100K Pages [Dataset]. https://www.kaggle.com/datasets/kutayahin/wikipedia-link-graph-100k
Organization logo

Wikipedia Link Graph Dataset - 100K Pages

100K Wikipedia pages with 28.9M links - Network graph dataset

Explore at:
zip(908552367 bytes)Available download formats
Dataset updated
Dec 4, 2025
Authors
Kutay Şahin
License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Description

A comprehensive Wikipedia dataset containing 100,000 pages with 28.9 million links, collected using breadth-first search crawling algorithm. This dataset includes complete page metadata, link relationships, and a network graph representation suitable for network analysis, graph algorithms, NLP research, and machine learning applications.

Dataset Overview

  • Total Pages: 100,000
  • Total Links: 28,855,738 (directed edges)
  • Average Words per Page: 3,531
  • Language: English (en.wikipedia.org)
  • Collection Method: BFS (Breadth-First Search) crawling, depth 5
  • Data Quality Score: 99.76/100

Files Description

1. pages_export.csv

Complete page metadata including: - id: Unique page ID - title: Page title - language: Language code (en) - content_length: Content length in characters - word_count: Word count - categories: JSON array of categories - infobox: JSON object of infobox data - created_at: Timestamp - url: Full Wikipedia URL

Size: ~70 MB | Rows: 100,000

2. links_export.csv

Complete link graph with URLs: - id: Unique link ID - source_title: Source page title - target_title: Target page title - language: Language code - position: Link position on page - depth: Crawl depth where link was discovered - created_at: Timestamp - source_url: Full source page URL - target_url: Full target page URL

Size: ~4.5 GB | Rows: 28,855,738

3. graph.json

Network graph in JSON format: - nodes: Array of node objects with id field - edges: Array of edge objects with source and target fields

Size: ~2.1 GB | Edges: 28,855,738

Data Quality

  • Content Coverage: 99.99% (99,992 pages have quality content)
  • Link Quality: 99.22%
  • Uniqueness: 100% (all links are unique)
  • Content Quality: 100% (average 3,531 words per page)
  • Duplicate Pages: Minimal (cleaned)
  • Self-Links: 4,326 (removed)
  • Data Validation: ✅ All entries validated and cleaned

Use Cases

  1. Network Analysis: Study Wikipedia link structure and page connectivity
  2. Graph Algorithms: Test shortest path, centrality, community detection algorithms
  3. NLP Research: Analyze Wikipedia content, categories, and relationships
  4. Machine Learning: Train models on Wikipedia link prediction
  5. Knowledge Graph: Build knowledge graphs from Wikipedia structure
  6. PageRank: Implement and test PageRank algorithms
  7. Recommendation Systems: Build content recommendation systems

Collection Methodology

  1. Seed Selection: Started with 5 Wikipedia pages
  2. Crawling: BFS algorithm, depth 5
  3. Rate Limiting: Balanced (0.82 pages/second)
  4. Parallel Processing: Optimized concurrent workers
  5. Caching: HTML content cached for efficiency
  6. Validation: All data validated and deduplicated
  7. Quality Control: Automated quality checks and cleaning

Technical Details

  • Database: SQLite with WAL mode
  • Crawl Duration: ~29 hours
  • Crawl Rate: 0.82 pages/second
  • Checkpoint System: Resume-capable crawling
  • Data Cleaning: Automated duplicate removal and quality checks
Search
Clear search
Close search
Google apps
Main menu