2 datasets found

Z
Link-prediction on Biomedical Knowledge Graphs
data.niaid.nih.gov
zenodo.org
Updated Jun 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cattaneo, Alberto; Justus, Daniel; Bonner, Stephen; Martynec, Thomas (2024). Link-prediction on Biomedical Knowledge Graphs [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12097376
Explore at:
Dataset updated
Jun 25, 2024
Dataset provided by
Graphcore (United Kingdom)
Authors
Cattaneo, Alberto; Justus, Daniel; Bonner, Stephen; Martynec, Thomas
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Release of the experimental data from the paper Towards Linking Graph Topology to Model Performance for Biomedical Knowledge Graph Completion (accepted at Machine Learning for Life and Material Sciences workshop @ ICML2024).

Knowledge Graph Completion has been increasingly adopted as a useful method for several tasks in biomedical research, like drug repurposing or drug-target identification. To that end, a variety of datasets and Knowledge Graph Embedding models has been proposed over the years. However, little is known about the properties that render a dataset useful for a given task and, even though theoretical properties of Knowledge Graph Embedding models are well understood, their practical utility in this field remains controversial. We conduct a comprehensive investigation into the topological properties of publicly available biomedical Knowledge Graphs and establish links to the accuracy observed in real-world applications. By releasing all model predictions we invite the community to build upon our work and continue improving the understanding of these crucial applications.

Experiments were conducted on six datasets: five from the biomedical domain (Hetionet, PrimeKG, PharmKG, OpenBioLink2020 HQ, PharMeBINet) and one trivia KG (FB15k-237). All datasets were randomly split into training, validation and test set (80% / 10% / 10%; in the case of PharMeBINet, 99.3% / 0.35% / 0.35% to mitigate the increased inference cost on the larger dataset).

On each dataset, four different KGE models were compared: TransE, DistMult, RotatE, TripleRE. Hyperparameters were tuned on the validation split and we release results for tail predictions on the test split. In particular, each test query (h,r,?) is scored against all entities in the KG and we compute the rank of the score of the correct completion (h,r,t) , after masking out scores of other (h,r,t') triples contained in the graph.

Note: the ranks provided are computed as the average between the optimistic and pessimistic ranks of triple scores.

Inside experimental_data.zip, the following files are provided for each dataset:

{dataset}_preprocessing.ipynb: a Jupyter notebook for downloading and preprocessing the dataset. In particular, this generates the custom label->ID mapping for entities and relations, and the numerical tensor of (h_ID,r_ID,t_ID) triples for all edges in the graph, which can be used to compute graph topological metrics (e.g., using kg-topology-toolbox) and compare them with the edge prediction accuracy.

test_ranks.csv: csv table with columns ["h", "r", "t"] specifying the head, relation, tail IDs of the test triples, and columns ["DistMult", "TransE", "RotatE", "TripleRE"] with the rank of the ground-truth tail in the ordered list of predictions made by the four models;

entity_dict.csv: the list of entity labels, ordered by entity ID (as generated in the preprocessing notebook);

relation_dict.csv: the list of relation labels, ordered by relation ID (as generated in the preprocessing notebook).

The separate top_100_tail_predictions.zip archive contains, for each of the test queries in the corresponding test_ranks.csv table, the IDs of the top-100 tail predictions made by each of the four KGE models, ordered by decreasing likelihood. The predictions are released in a .npz archive of numpy arrays (one array of shape (n_test_triples, 100) for each of the KGE models).

All experiments (training and inference) have been run on Graphcore IPU hardware using the BESS-KGE distribution framework.

Articles metadata from CrossRef

kaggle.com

zip

Updated Aug 1, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Kea Kohv (2025). Articles metadata from CrossRef [Dataset]. https://www.kaggle.com/datasets/keakohv/articles-doi-metadata

Explore at:

zip(72982417 bytes)Available download formats

Dataset updated

Aug 1, 2025

Authors

Kea Kohv

Description

This data originates from Crossref API. It has metadata on the articles contained in Data Citation Corpus where the citation pair dataset is a DOI.

How to recreate this dataset in Jupyter Notebook:

1) Prepare list of articles to query ```python import pandas as pd

See: https://www.kaggle.com/datasets/keakohv/data-citation-coprus-v4-1-eupmc-and-datacite

CITATIONS_PARQUET = "data_citation_corpus_filtered_v4.1.parquet"

Load the citation pairs from the Parquet file

citation_pairs = pd.read_parquet(CITATIONS_PARQUET)

Remove all rows where https is in the 'publication' column but no "doi.org" is present

citation_pairs = citation_pairs[ ~((citation_pairs['dataset'].str.contains("https")) & (~citation_pairs['dataset'].str.contains("doi.org"))) ]

Remove all rows where figshare is in the dataset name

citation_pairs = citation_pairs[ ~citation_pairs['dataset'].str.contains("figshare") ]

citation_pairs['is_doi'] = citation_pairs['dataset'].str.contains('doi.org', na=False)

citation_pairs_doi = citation_pairs[citation_pairs['is_doi'] == True].copy()

articles = list(set(citation_pairs_doi['publication'].to_list()))

articles = [doi.replace("_", "/") for doi in articles]

Save list articles to a file

with open("articles.txt", "w") as f: for article in articles: f.write(f"{article} ") ```

2) Query articles from CrossRef API


%%writefile enrich.py
#!pip install -q aiolimiter
import sys, pathlib, asyncio, aiohttp, orjson, sqlite3, time
from aiolimiter import AsyncLimiter

# ---------- config ----------
HEADERS  = {"User-Agent": "ForDataCiteEnrichment (mailto:your_email)"} # Put your email here
MAX_RPS  = 45           # polite pool limit (50), leave head-room
BATCH_SIZE = 10_000         # rows per INSERT
DB_PATH  = pathlib.Path("crossref.sqlite").resolve()
ARTICLES  = pathlib.Path("articles.txt")
# -----------------------------

# ---- platform tweak: prefer selector loop on Windows ----
if sys.platform == "win32":
  asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())

# ---- read the DOI list ----
with ARTICLES.open(encoding="utf-8") as f:
  DOIS = [line.strip() for line in f if line.strip()]

# ---- make sure DB & table exist BEFORE the async part ----
DB_PATH.parent.mkdir(parents=True, exist_ok=True)
with sqlite3.connect(DB_PATH) as db:
  db.execute("""
    CREATE TABLE IF NOT EXISTS works (
      doi  TEXT PRIMARY KEY,
      json TEXT
    )
  """)
  db.execute("PRAGMA journal_mode=WAL;")   # better concurrency

# ---------- async section ----------
limiter = AsyncLimiter(MAX_RPS, 1)       # 45 req / second
sem   = asyncio.Semaphore(100)        # cap overall concurrency

async def fetch_one(session, doi: str):
  url = f"https://api.crossref.org/works/{doi}"
  async with limiter, sem:
    try:
      async with session.get(url, headers=HEADERS, timeout=10) as r:
        if r.status == 404:         # common “not found”
          return doi, None
        r.raise_for_status()        # propagate other 4xx/5xx
        return doi, await r.json()
    except Exception as e:
      return doi, None            # log later, don’t crash

async def main():
  start = time.perf_counter()
  db  = sqlite3.connect(DB_PATH)        # KEEP ONE connection
  db.execute("PRAGMA synchronous = NORMAL;")   # speed tweak

  async with aiohttp.ClientSession(json_serialize=orjson.dumps) as s:
    for chunk_start in range(0, len(DOIS), BATCH_SIZE):
      slice_ = DOIS[chunk_start:chunk_start + BATCH_SIZE]
      tasks = [asyncio.create_task(fetch_one(s, d)) for d in slice_]
      results = await asyncio.gather(*tasks)    # all tuples, no exc

      good_rows, bad_dois = [], []
      for doi, payload in results:
        if payload is None:
          bad_dois.append(doi)
        else:
          good_rows.append((doi, orjson.dumps(payload).decode()))

      if good_rows:
        db.executemany(
          "INSERT OR IGNORE INTO works (doi, json) VALUES (?, ?)",
          good_rows,
        )
        db.commit()

      if bad_dois:                # append for later retry
        with open("failures.log", "a", encoding="utf-8") as fh:
          fh.writelines(f"{d}
" for d in bad_dois)

      done = chunk_start + len(slice_)
      rate = done / (time.perf_counter() - start)
      print(f"{done:,}/{len(DOIS):,} ({rate:,.1f} DOI/s)")

  db.close()

if _name_ == "_main_":
  asyncio.run(main())

Then run: python !python enrich.py

3) Finally extract the necessary fields

import sqlite3
import orjson
i...

Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Cattaneo, Alberto; Justus, Daniel; Bonner, Stephen; Martynec, Thomas (2024). Link-prediction on Biomedical Knowledge Graphs [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12097376

Link-prediction on Biomedical Knowledge Graphs

Explore at:

Dataset updated

Jun 25, 2024

Dataset provided by

Graphcore (United Kingdom)

Authors

Cattaneo, Alberto; Justus, Daniel; Bonner, Stephen; Martynec, Thomas

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Release of the experimental data from the paper Towards Linking Graph Topology to Model Performance for Biomedical Knowledge Graph Completion (accepted at Machine Learning for Life and Material Sciences workshop @ ICML2024).

Knowledge Graph Completion has been increasingly adopted as a useful method for several tasks in biomedical research, like drug repurposing or drug-target identification. To that end, a variety of datasets and Knowledge Graph Embedding models has been proposed over the years. However, little is known about the properties that render a dataset useful for a given task and, even though theoretical properties of Knowledge Graph Embedding models are well understood, their practical utility in this field remains controversial. We conduct a comprehensive investigation into the topological properties of publicly available biomedical Knowledge Graphs and establish links to the accuracy observed in real-world applications. By releasing all model predictions we invite the community to build upon our work and continue improving the understanding of these crucial applications.

Experiments were conducted on six datasets: five from the biomedical domain (Hetionet, PrimeKG, PharmKG, OpenBioLink2020 HQ, PharMeBINet) and one trivia KG (FB15k-237). All datasets were randomly split into training, validation and test set (80% / 10% / 10%; in the case of PharMeBINet, 99.3% / 0.35% / 0.35% to mitigate the increased inference cost on the larger dataset).

On each dataset, four different KGE models were compared: TransE, DistMult, RotatE, TripleRE. Hyperparameters were tuned on the validation split and we release results for tail predictions on the test split. In particular, each test query (h,r,?) is scored against all entities in the KG and we compute the rank of the score of the correct completion (h,r,t) , after masking out scores of other (h,r,t') triples contained in the graph.

Note: the ranks provided are computed as the average between the optimistic and pessimistic ranks of triple scores.

Inside experimental_data.zip, the following files are provided for each dataset:

{dataset}_preprocessing.ipynb: a Jupyter notebook for downloading and preprocessing the dataset. In particular, this generates the custom label->ID mapping for entities and relations, and the numerical tensor of (h_ID,r_ID,t_ID) triples for all edges in the graph, which can be used to compute graph topological metrics (e.g., using kg-topology-toolbox) and compare them with the edge prediction accuracy.

test_ranks.csv: csv table with columns ["h", "r", "t"] specifying the head, relation, tail IDs of the test triples, and columns ["DistMult", "TransE", "RotatE", "TripleRE"] with the rank of the ground-truth tail in the ordered list of predictions made by the four models;

entity_dict.csv: the list of entity labels, ordered by entity ID (as generated in the preprocessing notebook);

relation_dict.csv: the list of relation labels, ordered by relation ID (as generated in the preprocessing notebook).

The separate top_100_tail_predictions.zip archive contains, for each of the test queries in the corresponding test_ranks.csv table, the IDs of the top-100 tail predictions made by each of the four KGE models, ordered by decreasing likelihood. The predictions are released in a .npz archive of numpy arrays (one array of shape (n_test_triples, 100) for each of the KGE models).

All experiments (training and inference) have been run on Graphcore IPU hardware using the BESS-KGE distribution framework.

Clear search

Close search

Google apps

Main menu