4 datasets found
  1. Z

    Data Cleaning, Translation & Split of the Dataset for the Automatic...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Köhler, Juliane (2022). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6957841
    Explore at:
    Dataset updated
    Aug 8, 2022
    Dataset authored and provided by
    Köhler, Juliane
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.

    Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.

    ger_train.csv – The German training set as CSV file.

    ger_validation.csv – The German validation set as CSV file.

    en_test.csv – The English test set as CSV file.

    en_train.csv – The English training set as CSV file.

    en_validation.csv – The English validation set as CSV file.

    splitting.py – The python code for splitting a dataset into train, test and validation set.

    DataSetTrans_de.csv – The final German dataset as a CSV file.

    DataSetTrans_en.csv – The final English dataset as a CSV file.

    translation.py – The python code for translating the cleaned dataset.

  2. Data for the manuscript "Spatially resolved uncertainties for machine...

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, bin +1
    Updated May 3, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Esther Heid; Esther Heid; Johannes Schörghuber; Johannes Schörghuber; Ralf Wanzenböck; Ralf Wanzenböck; Georg K. H. Madsen; Georg K. H. Madsen (2024). Data for the manuscript "Spatially resolved uncertainties for machine learning potentials" [Dataset]. http://doi.org/10.5281/zenodo.11093925
    Explore at:
    application/gzip, bin, text/x-pythonAvailable download formats
    Dataset updated
    May 3, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Esther Heid; Esther Heid; Johannes Schörghuber; Johannes Schörghuber; Ralf Wanzenböck; Ralf Wanzenböck; Georg K. H. Madsen; Georg K. H. Madsen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository accompanies the manuscript "Spatially resolved uncertainties for machine learning potentials" by E. Heid, J. Schörghuber, R. Wanzenböck, and G. K. H. Madsen. The following files are available:

    • mc_experiment.ipynb is a Jupyter notebook for the Monte Carlo experiment described in the study (artificial model with only variance as error source).

    • aggregate_cut_relax.py contains code to cut and relax boxes for the water active learning cycle.

    • data_t1x.tar.gz contains reaction pathways for 10,073 reactions from a subset of the Transition1x dataset, split into training, validation and test sets. The training and validation sets contain the indices 1, 2, 9, and 10 from a 10-image nudged-elastic band search (40k datapoints), while the test set contains indices 3-8 (60k datapoints). The test set is ordered according to the reaction and index, i.e. rxn1_index3, rxn1_index4, [...] rxn1_index8, rxn2_index3, [...].

    • data_sto.tar.gz contains surface reconstructions of SrTiO3, randomly split into a training and validation set, as well as a test set.

    • data_h2o.tar.gz contains:

      • full_db.extxyz: The full dataset of 1.5k structures.

      • iter00_train.extxyz and iter00_validation.extxyz: The initial training and validation set for the active learning cycle.

      • the subfolders in the folders random and uncertain contain the training and validation sets for the random and uncertainty-based active learning loops.

  3. Z

    Blog-1K

    • data.niaid.nih.gov
    • zenodo.org
    Updated Dec 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haining Wang (2022). Blog-1K [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7455622
    Explore at:
    Dataset updated
    Dec 21, 2022
    Dataset authored and provided by
    Haining Wang
    License

    https://www.isc.org/downloads/software-support-policy/isc-license/https://www.isc.org/downloads/software-support-policy/isc-license/

    Description

    The Blog-1K corpus is a redistributable authorship identification testbed for contemporary English prose. It has 1,000 candidate authors, 16K+ posts, and a pre-defined data split (train/dev/test proportional to ca. 8:1:1). It is a subset of the Blog Authorship Corpus from Kaggle. The MD5 for Blog-1K is '0a9e38740af9f921b6316b7f400acf06'.

    1. Preprocessing

    We first filter out texts shorter than 1,000 characters. Then we select one thousand authors whose writings meet the following criteria: - accumulatively at least 10,000 characters, - accumulatively at most 49,410 characters, - accumulatively at least 16 posts, - accumulatively at most 40 posts, and - each text has at least 50 function words found in the Koppel512 list (to filter out non-English prose).

    Blog-1K has three columns: 'id', 'text', and 'split', where 'id' corresponds to its parent corpus.

    1. Statistics

    Its creation and statistics can be found in the Jupyter Notebook.

        Split
        # Authors
        # Posts
        # Characters
        Avg. Characters Per Author (Std.)
        Avg. Characters Per Post (Std.)
    
    
        Train
        1,000
        16,132
        30,092,057
        30,092 (5,884)
        1,865 (1,007)
    
    
        Validation
        935
        2,017
        3,755,362
        4,016 (2,269)
        1,862 (999)
    
    
        Test
        924
        2,017
        3,732,448
        4,039 (2,188)
        1,850 (936)
    
    1. Usage

    import pandas as pd

    df = pd.read_csv('blog1000.csv.gz', compression='infer')

    read in training data

    train_text, train_label = zip(*df.loc[df.split=='train'][['text', 'id']].itertuples(index=False))

    1. License All the materials is licensed under the ISC License.

    2. Contact Please contact its maintainer for questions.

  4. Link-prediction on Biomedical Knowledge Graphs

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alberto Cattaneo; Daniel Justus; Stephen Bonner; Stephen Bonner; Thomas Martynec; Thomas Martynec; Alberto Cattaneo; Daniel Justus (2024). Link-prediction on Biomedical Knowledge Graphs [Dataset]. http://doi.org/10.5281/zenodo.12097377
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 25, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alberto Cattaneo; Daniel Justus; Stephen Bonner; Stephen Bonner; Thomas Martynec; Thomas Martynec; Alberto Cattaneo; Daniel Justus
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Time period covered
    Jun 25, 2021
    Description

    Release of the experimental data from the paper Towards Linking Graph Topology to Model Performance for Biomedical Knowledge Graph Completion (accepted at Machine Learning for Life and Material Sciences workshop @ ICML2024).

    Knowledge Graph Completion has been increasingly adopted as a useful method for several tasks in biomedical research, like drug repurposing or drug-target identification. To that end, a variety of datasets and Knowledge Graph Embedding models has been proposed over the years. However, little is known about the properties that render a dataset useful for a given task and, even though theoretical properties of Knowledge Graph Embedding models are well understood, their practical utility in this field remains controversial. We conduct a comprehensive investigation into the topological properties of publicly available biomedical Knowledge Graphs and establish links to the accuracy observed in real-world applications. By releasing all model predictions we invite the community to build upon our work and continue improving the understanding of these crucial applications.
    Experiments were conducted on six datasets: five from the biomedical domain (Hetionet, PrimeKG, PharmKG, OpenBioLink2020 HQ, PharMeBINet) and one trivia KG (FB15k-237). All datasets were randomly split into training, validation and test set (80% / 10% / 10%; in the case of PharMeBINet, 99.3% / 0.35% / 0.35% to mitigate the increased inference cost on the larger dataset).
    On each dataset, four different KGE models were compared: TransE, DistMult, RotatE, TripleRE. Hyperparameters were tuned on the validation split and we release results for tail predictions on the test split. In particular, each test query (h,r,?) is scored against all entities in the KG and we compute the rank of the score of the correct completion (h,r,t) , after masking out scores of other (h,r,t') triples contained in the graph.
    Note: the ranks provided are computed as the average between the optimistic and pessimistic ranks of triple scores.
    Inside experimental_data.zip, the following files are provided for each dataset:
    • {dataset}_preprocessing.ipynb: a Jupyter notebook for downloading and preprocessing the dataset. In particular, this generates the custom label->ID mapping for entities and relations, and the numerical tensor of (h_ID,r_ID,t_ID) triples for all edges in the graph, which can be used to compute graph topological metrics (e.g., using kg-topology-toolbox) and compare them with the edge prediction accuracy.
    • test_ranks.csv: csv table with columns ["h", "r", "t"] specifying the head, relation, tail IDs of the test triples, and columns ["DistMult", "TransE", "RotatE", "TripleRE"] with the rank of the ground-truth tail in the ordered list of predictions made by the four models;
    • entity_dict.csv: the list of entity labels, ordered by entity ID (as generated in the preprocessing notebook);
    • relation_dict.csv: the list of relation labels, ordered by relation ID (as generated in the preprocessing notebook).

    The separate top_100_tail_predictions.zip archive contains, for each of the test queries in the corresponding test_ranks.csv table, the IDs of the top-100 tail predictions made by each of the four KGE models, ordered by decreasing likelihood. The predictions are released in a .npz archive of numpy arrays (one array of shape (n_test_triples, 100) for each of the KGE models).

    All experiments (training and inference) have been run on Graphcore IPU hardware using the BESS-KGE distribution framework.

  5. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Köhler, Juliane (2022). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6957841

Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft

Explore at:
Dataset updated
Aug 8, 2022
Dataset authored and provided by
Köhler, Juliane
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.

Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.

ger_train.csv – The German training set as CSV file.

ger_validation.csv – The German validation set as CSV file.

en_test.csv – The English test set as CSV file.

en_train.csv – The English training set as CSV file.

en_validation.csv – The English validation set as CSV file.

splitting.py – The python code for splitting a dataset into train, test and validation set.

DataSetTrans_de.csv – The final German dataset as a CSV file.

DataSetTrans_en.csv – The final English dataset as a CSV file.

translation.py – The python code for translating the cleaned dataset.

Search
Clear search
Close search
Google apps
Main menu