Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The various performance criteria applied in this analysis include the probability of reaching the ultimate target, the costs, elapsed times and system vulnerability resulting from any intrusion. This Excel file contains all the logical, probabilistic and statistical data entered by a user, and required for the evaluation of the criteria. It also reports the results of all the computations.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains an open and curated scholarly graph we built as a training and test set for data discovery, data connection, author disambiguation, and link prediction tasks. This graph represents the European Marine Science community included in the OpenAIRE Graph. The nodes of the graph we release represent publications, datasets, software, and authors respectively; edges interconnecting research products always have the publication as source, and the dataset/software as target. In addition, edges are labeled with semantics that outline whether the publication is referencing, citing, documenting, or supplementing the related outcome. To curate and enrich nodes metadata and edges semantics, we relied on the information extracted from the PDF of the publications and the datasets/software webpages respectively. We curated the authors so to remove duplicated nodes representing the same person. The resource we release counts 4,047 publications, 5,488 datasets, 22 software, 21,561 authors, and 9,692 edges connect publications to datasets/software. This graph is in the curated_MES folder. We provide this resource as: a property graph: we provide the dump that can be imported in neo4j 5 jsonl files containing publications, datasets, software, authors, and relationships respectively. Each line of a jsonl file contains a JSON object representing a node and contains the metadata of that node (or a relationship). We provide two additional scholarly graphs: The curated MES graph with the removed edges. During the curation we removed some edges since they were labeled with an inconsistent or imprecise semantics. This graph includes the same nodes and edges as the previous one, and, in addition, it contains the edges removed during the curation pipeline; these edges are marked as Removed. This graph is in the curated_MES_with_removed_semantics folder. The original MES community of OpenAIRE. It represents the MES community extracted from the OpenAIRE Research Graph. This graph has not been curated, and the metadata and semantics are those of the OpenAIRE Research Graph. This graph is in the original_MES_community folder.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Categorical scatterplots with R for biologists: a step-by-step guide
Benjamin Petre1, Aurore Coince2, Sophien Kamoun1
1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK
Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.
Protocol
• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.
• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.
• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.
Notes
• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.
• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.
replicates
graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()
References
Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.
Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035
Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Check the GitHub link to see how I generated the dataset: https://github.com/walidgeuttala/Synthetic-Benchmark-for-Graph-Classification
The Synthetic Network Datasets comprise two distinct sets containing synthetic networks generated via abstract generative models from Network Science. These datasets serve a dual purpose: the first set is utilized for both training and testing the performance of Graph Neural Network (GNN) models on previously unseen samples, while the second set is solely employed to evaluate the generalization ability of the trained models.
Within these datasets, networks are crafted using Erdős-Rényi (ER), Watts-Strogatz (WS), and Barabási-Albert (BA) models. Parameters are deliberately selected to emphasize the unique features of each network family while maintaining consistency in fundamental network statistics across the dataset. Key features considered include average path length (ℓ), transitivity (T), and the structure of the degree distribution, distinguishing between small-world properties, high transitivity, and scale-free distributions.
The datasets encompass eight distinct combinations based on the high and low instances of these three properties. To balance these features, regular lattices were introduced to represent high average path lengths, ensuring each node possesses an equal number of neighbors. This addition involved two interpretations of neighborhood, leading to varying transitivity values.
These datasets are divided into Small-Sized Graphs and Medium-Sized Graphs. The Small Dataset contains 250 samples from each of the eight network types, totaling 2000 synthetic networks, with network sizes randomly selected between 250 and 1024 nodes. Meanwhile, the Medium Dataset includes 250 samples from each network type, summing up to 2000 synthetic networks, but with sizes randomly selected between 1024 and 2048 nodes.
During training, the Small Dataset was split uniformly into training, validation, and testing graphs. The Medium Dataset acts as additional test data to evaluate the models' generalization capability. Parameters governing the average degree were meticulously chosen for each network type within both datasets.
The detailed structure and diverse characteristics of these Synthetic Network Datasets provide a comprehensive platform for training and evaluating GNN models across various network types, aiding in the exploration and understanding of their performance and generalization abilities.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the repository for ISWC 2023 Resource Track submission for Text2KGBench: Benchmark for Ontology-Driven Knowledge Graph Generation from Text. Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.
It contains two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences.
An example
An example test sentence:
Test Sentence:
{"id": "ont_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by
American songwriters Gerry Goffin and Carole King."}
An example of ontology:
Ontology: Music Ontology
Expected Output:
{
"id": "ont_k_music_test_n",
"sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King.",
"triples": [
{
"sub": "The Loco-Motion",
"rel": "publication date",
"obj": "01 January 1962"
},{
"sub": "The Loco-Motion",
"rel": "lyrics by",
"obj": "Gerry Goffin"
},{
"sub": "The Loco-Motion",
"rel": "lyrics by",
"obj": "Carole King"
},]
}
The data is released under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY 4.0) License.
The structure of the repo is as the following.
benchmark the code used to generate the benchmarkevaluation evaluation scripts for calculating the resultsThis benchmark contains data derived from the TekGen corpus (part of the KELM corpus) [1] released under CC BY-SA 2.0 license and WebNLG 3.0 corpus [2] released under CC BY-NC-SA 4.0 license.
[1] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.
[2] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains 122 CAIDA AS graphs, from January 2004 to November 2007 - http://www.caida.org/data/active/as-relationships/ . Each file contains a full AS graph derived from a set of RouteViews BGP table snapshots.
Dataset statistics are calculated for the graph with the highest number of
nodes - dataset from November 5 2007. Dataset statistics for graph with
highest number of nodes - 11 5 2007
Nodes 26475
Edges 106762
Nodes in largest WCC 26475 (1.000)
Edges in largest WCC 106762 (1.000)
Nodes in largest SCC 26475 (1.000)
Edges in largest SCC 106762 (1.000)
Average clustering coefficient 0.2082
Number of triangles 36365
Fraction of closed triangles 0.007319
Diameter (longest shortest path) 17
90-percentile effective diameter 4.6
Source (citation)
J. Leskovec, J. Kleinberg and C. Faloutsos. Graphs over Time: Densification
Laws, Shrinking Diameters and Possible Explanations. ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD), 2005.
Files
File Description
as-caida20071105.txt.gz CAIDA AS graph from November 5 2007
as-caida.tar.gz 122 CAIDA AS graphs from January 2004 to November 2007
NOTE for UF Sparse Matrix Collection: these graphs are weighted. In the
original SNAP data set, the edge weights are in the set {-1, 0, 1, 2}. Note
that "0" is an edge weight. This can be handled in the UF collection for the
primary sparse matrix in a Problem, but not when the matrices are in a sequence
in the Problem.aux MATLAB struct. The entries with zero edge weight would
become lost. To correct for this, the weights are modified by adding 2 to each
weight. This preserves the structure of the original graphs, so that edges
with weight zero are not lost. (A non-edge is not the same as an edge with
weight zero in this problem).
old new weights:
-1 1
0 2
1 3
2 4
So to obtain the original weights, subtract 2 from each entry.
The primary sparse matrix for this problem is the as-caida20071105 matrix, or
Problem.aux.G{121}, the second-to-the-last graph in the sequence.
The nodes are uniform across all graphs in the sequence in the UF collection.
That is, nodes do not come and go. A node that is "gone" simply has no edges.
This is to allow comparisons across each node in the graphs.
Problem.aux.nodenames gives the node numbers of the original problem. So
row/column i in the matrix is always node number Problem.aux.nodenames(i) in
all the graphs.
Problem.aux.G{k} is the kth graph in the sequence.
Problem.aux.Gname(k,:) is the name of the kth graph.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
I have spent some time scrapping and shaping PubChem data into a Neo4j graph database. The process took a lot of time, mainly downloading, and loading it into Neo4j. The whole process took weeks. If you want to build your own I will show you how to download mine and set it up in less than an hour (most of the time you’ll just have to wait). The process of how this dataset is created is described in the following blogs: - https://medium.com/@nijhof.dns/exploring-neodash-for-197m-chemical-full-text-graph-e3baed9615b8 - https://medium.com/neo4j/combining-3-biochemical-datasets-in-a-graph-database-8e9aafbb5788 - https://medium.com/p/d9ee9779dfbe
The full database is a merge of 3 datasets, PubChem (compounds + synonyms), NCI60 (GI50), and ChEMBL (cell lines). It contains 6 nodes of interest: ● Compound: This is related to a compound of PubChem. It has 1 property. ○ pubChemCompId: The id within pubchem. So “compound:cid162366967” links to https://pubchem.ncbi.nlm.nih.gov/compound/162366967. This number can be used with both PubChem RDF and PUG. ● Synonym: A name found in the literature. This name can refer to zero, one, or more compounds. This helps find relations between natural language names and absolute compounds they are related to. ○ Name: Natural language name. Can contain letters, spaces, numbers, and any other Unicode character. ○ pubChemSynId: PubChem synonym id as used within the RDF ● CellLine: These are the ChEMBL cell lines. They hold a lot of information. ○ Name: The name of the cell line. ○ Uri: A unique URI for every element within the ChEMBL RDF. ○ cellosaurusId: The id to connect it to the Cellosaurus dataset. This is one of the most extensive cell line datasets out there. ● Measurement: A measurement you can do within a biomedical experiment. Currently, only GI50 (the concentration needed for Growth Inhibition of 50%) is added. ○ Name: Name of the measurement. ● Condition: A single condition of an experiment. A condition is part of an experiment. Examples are: an individual of the control group, a sample with drug A, or a sample with more CO2 ● Experiment: A collection of multiple conditions all done at the same time with the same bias. Meaning we assume all uncontrolled variables are the same. ○ Name: Name of experiment.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F442733%2F7dd804811e105390dfe20bb5cd1a68c0%2FUntitled%20graph.png?generation=1680113457794452&alt=media" alt="">
How do download it Warning, you need 120 GB of free memory. The compressed file you download is already 30 GB. The uncompressed file is 30 GB. The database afterward is 60 GB. 60 GB is only for temporary files, the other 60 is for the database. If you do this on an HDD hard disk it will be slow.
If you load this into Neo4j desktop as a local database (like I do) it will scream and yell at you, just ignore this. We are pushing it far further than it is designed for, but it will still work.
Go to this Kaggle dataset and download the dump file. Unzip the file, then delete the zipped file. This part needs 60 GB but only takes 30 by the end of it.
Create a database
Open the Neo4j desktop app, and click “Reveal files in File Explorer”. Move the .dump you downloaded into this folder.
Click on the ... behind the .dump file and click Create new DBMS from dump. This database is a dump from Neo4j V4, so your database also needs to be V4.x.x!
It will now create the database. This will take a long time, it might even say it has timed out. Do not believe this lie! In the background, it is still running. Every time you start it, it will time out. Just let it run and press start later again. The second time it will be started up directly.
Every time I start it up I get the timed-out error. After waiting 10 minutes and clicking start again the database, and with it, more than 200 million nodes, is ready. And you are done! Good luck and let me know what you build with it
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description: this corpus was designed as an experimental benchmark for a task of signed graph classification. It is composed of three datasets derived from external sources and adapted to our needs:
SpaceOrigin Conversations [1]: set of conversational graphs, each one associated to a situation of verbal abuse vs. normal situation. These conversations model interactions happening in chatrooms hosted by an MMORPG/ The graphs were originally unsigned: we attributed signed to the edges based on the polarity of the exchanged messages.
Correlation Clustering Instances [2]: set of graph generated randomly as instances of the Correlation Clustering problem, which consists in partitioning signed graphs. These graphs are not associated in any class in the original paper. We proposed a class based on certain features of the space of optimal solutions explored in [2].
European Parliament Roll-Calls [3]: vote networks extracted from the activity of French Members of the European Parliament. The original data does not have any class associated to the networks: we proposed one based on the number of political factions identified in each network in [3].
These data were used in [4] in order to train and assess various representation learning methods. The authors proposed Signed Graph2vec, a signed variant of Graph2vec; WSGCN, a whole-graph variant of Signed Graph Convolutional Networks (SGCN), and use an aggregated version of Signed Network Embeddings (SiNE) as a baseline. The article provides more information regarding the properties of the datasets, and how they were constituted.
Software: the software used to train the representation learning methods and classifiers is publicly available online: SWGE.
References:
Papegnies, É.; Labatut, V.; Dufour, R. & Linarès, G. Conversational Networks for Automatic Online Moderation. IEEE Transactions on Computational Social Systems, 2019, 6:38-55. DOI: 10.1109/TCSS.2018.2887240 ⟨hal-01999546⟩
Arınık, N.; Figueiredo, R. & Labatut, V. Multiplicity and Diversity: Analyzing the Optimal Solution Space of the Correlation Clustering Problem on Complete Signed Graphs. Journal of Complex Networks, 2020, 8(6):cnaa025. DOI: 10.1093/comnet/cnaa025 ⟨hal-02994011⟩
Arınık, N.; Figueiredo, R. & Labatut, V. Multiple partitioning of multiplex signed networks: Application to European parliament votes. Social Networks, 2020, 60:83-102. DOI: 10.1016/j.socnet.2019.02.001 ⟨hal-02082574⟩
Cécillon, N.; Labatut, V.; Dufour, R. & Arınık, N. Whole-Graph Representation Learning For the Classification of Signed Networks. IEEE Access, 2024, 12:151303-151316. DOI: 10.1109/ACCESS.2024.3472474 ⟨hal-04712854⟩
Funding: part of this work was funded by a grant from the Provence-Alpes-Côte-d'Azur region (PACA, France) and the Nectar de Code company.
Citation: If you use this data or the associated source code, please cite article [4]:
@Article{Cecillon2024, author = {Cécillon, Noé and Labatut, Vincent and Dufour, Richard and Arınık, Nejat}, title = {Whole-Graph Representation Learning For the Classification of Signed Networks}, journal = {IEEE Access}, year = {2024}, volume = {12}, pages = {151303-151316}, doi = {10.1109/ACCESS.2024.3472474},}
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data-set contains graph topologies of several networks that were analysed in two scientific works and used in several more. The data in the 'topologies' folder contains two sets of data: The '2014' folder contains about 5000 snapshots of three community networks, namely Freifunk Wien, Freifunk Graz and ninux Rome. This data-set was collected between 2014 and 2015 and is at the base of the work "A week in the life of three large Wireless Community Networks" (link to the paper below), it describes three large-scale wireless mesh networks running in three cities. The data-set is fully described in the paper, here I report the information needed to use it. - For FFWien and ninux, each snapshot is taken once every 5 minutes, for Graz, one every 10. - one snapshot corresponds to the real state of the network in a specific moment, correlating the database of active nodes with the topology exported by the routing protocol. Some elaboration has been made to merge into one logical nodes some nodes that were running multiple instances of the routing protocol in the same physical location (see the paper for details) - the format is the well known graphml XML format, you can open the files with networkx, gephi and many more tools - the link weight represents the ETX metric (high = bad, see the paper) The network-evolution folder contains the network graphs collected for the two networks of Wien and Graz only, but in a different period of time, and with a much larger time-span between the snapshots. This data-set was used for the paper "On the Technical and Social Structure of Community Networks", and again, represents the physical structure of the network, annotated with link quality from the routing protocol. Format is graphml, metric is ETX. Finally, the mailing_list folder contains the ninux-ml.xml that contains the interactions in the mailing list of the ninux network, as described in the same paper. The second part of the data-set was collected and elaborated during the netCommons (see http://netcommons.eu) research project, while the first was collected before, but contributed to the results of the project too. If you use the data, pleas cite the relevant papers below. If you need more information, feel free to contact me: Leonardo Maccari, Assistant Professor @DISI, University of Trento Tel: +39 0461 285323, www.disi.unitn.it/~maccari, gpg ID: AABE2BD7 leonardo.maccari(at)unitn.it. Related Papers: "A week in the life of three large Wireless Community Networks" https://ans.disi.unitn.it/users/maccari/assets/files/bibliography/Maccari2014Week.pdf "On the Technical and Social Structure of Community Networks" https://ans.disi.unitn.it/users/maccari/assets/files/bibliography/Maccari2016Technical.pdf
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains two datasets for our recent work "Learning Properties of Ordered and Disordered Materials from Multi-fidelity Data". The first data set is a multi-fidelity band gap data for crystals, and the second data set is the molecular energy data set for molecules.1. Multi-fidelity band gap data for crystalsThe full band gap data used in the paper is located at band_gap_no_structs.gz. Users can use the following code to extract it. import gzipimport jsonwith gzip.open("band_gap_no_structs.gz", "rb") as f: data = json.loads(f.read())data is a dictionary with the following format{"pbe": {mp_id: PBE band gap, mp_id: PBE band gap, ...},"hse": {mp_id: HSE band gap, mp_id: HSE band gap, ...},"gllb-sc": {mp_id: GLLB-SC band gap, mp_id: GLLB-SC band gap, ...},"scan": {mp_id: SCAN band gap, mp_id: SCAN band gap, ...},"ordered_exp": {icsd_id: Exp band gap, icsd_id: Exp band gap, ...},"disordered_exp": {icsd_id: Exp band gap, icsd_id: Exp band gap, ...}}where mp_id is the Materials Project materials ID for the material, and icsd_id is the ICSD materials ID. For example, the PBE band gap of NaCl (mp-22862, band gap 5.003 eV) can be accessed by data['pbe']['mp-22862']. Note that the Materials Project database is evolving with time and it is possible that certain ID is removed in latest release and there may also be some band gap value change for the same material. To get the structure that corresponds to the specific material id in Materials Project, users can use the pymatgen REST API. 1.1. Register at Materials Project https://www.materialsproject.org and get an API key.1.2. In python, do the following to get the corresponding computational structure. from pymatgen import MPRester mpr = MPRester(#Your API Key) structure = mpr.get_structure_by_material_id(#mp_id)A dump of all the material ids and structures for 2019.04.01 MP version is provided here: https://ndownloader.figshare.com/files/15108200. Users can download the file and extract the material_id and structure from this file for all materials. The structure in this case is a cif file. Users can use again pymatgen to read the cif string and get the structure. from pymatgen.core import Structurestructure = Structure.from_str(#cif_string, fmt='cif')For the ICSD structures, the users are required to have commercial ICSD access. Hence the structures will not be provided here.2. Multi-fidelity molecular energy dataThe molecule_data.zip contains two datasets in json format. 2.1 G4MP2.json contains two fidelity G4MP2 (6095) and B3LYP (130831) calculations results on QM9 molecules {"G4MP2": {"U0": {ID: G4MP2 energy (eV), ...}, { "molecules": {ID: Pymatgen molecule dict, ...}},"B3LYP": {"U0": {ID: B3LYP energy (eV), ...} {"molecules": {ID: Pymatgen molecule dict, ...}}}2.2 qm7b.json contains the molecule energy calculation resultsi for 7211 molecules using HF, MP2 and CCSD(T) methods with 6-31g, sto-3g and cc-pvdz bases. {"molecules": {ID: Pymatgen molecule dict, ...},"targets": {ID: {"HF": {"sto3g": Atomization energy (kcal/mol), "631g": Atomization energy (kcal/mol), "cc-pvdz": Atomization energy (kcal/mol)}, "MP2": {"sto3g": Atomization energy (kcal/mol), "631g": Atomization energy (kcal/mol), "cc-pvdz": Atomization energy (kcal/mol)}, "CCSD(T)": {"sto3g": Atomization energy (kcal/mol), "631g": Atomization energy (kcal/mol), "cc-pvdz": Atomization energy (kcal/mol)}, ...}}}
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This database collates 3552 development indicators from different studies with data by country and year, including single year and multiple year time series. The data is presented as charts, the data can be downloaded from linked project pages/references for each set, and the data for each presented graph is available as a CSV file as well as a visual download of the graph (both available via the download link under each chart).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Identifying change points and/or anomalies in dynamic network structures has become increasingly popular across various domains, from neuroscience to telecommunication to finance. One particular objective of anomaly detection from a neuroscience perspective is the reconstruction of the dynamic manner of brain region interactions. However, most statistical methods for detecting anomalies have the following unrealistic limitation for brain studies and beyond: that is, network snapshots at different time points are assumed to be independent. To circumvent this limitation, we propose a distribution-free framework for anomaly detection in dynamic networks. First, we present each network snapshot of the data as a linear object and find its respective univariate characterization via local and global network topological summaries. Second, we adopt a change point detection method for (weakly) dependent time series based on efficient scores, and enhance the finite sample properties of change point method by approximating the asymptotic distribution of the test statistic using the sieve bootstrap. We apply our method to simulated and to real data, particularly, two functional magnetic resonance imaging (fMRI) datasets and the Enron communication graph. We find that our new method delivers impressively accurate and realistic results in terms of identifying locations of true change points compared to the results reported by competing approaches. The new method promises to offer a deeper insight into the large-scale characterizations and functional dynamics of the brain and, more generally, into the intrinsic structure of complex dynamic networks. Supplemental materials for this article are available online.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Warning: the ground truth is missing in certain of these datasets. This was fixed in version 1.0.1, which you should use instead.
Description: this corpus was designed as an experimental benchmark for a task of signed graph classification. It is composed of three datasets derived from external sources and adapted to our needs:
These data were used in [4] in order to train and assess various representation learning methods. The authors proposed Signed Graph2vec, a signed variant of Graph2vec; WSGCN, a whole-graph variant of Signed Graph Convolutional Networks (SGCN), and use an aggregated version of Signed Network Embeddings (SiNE) as a baseline. The article provides more information regarding the properties of the datasets, and how they were constituted.
Software: the software used to train the representation learning methods and classifiers is publicly available online: SWGE.
References:
Funding: part of this work was funded by a grant from the Provence-Alpes-Côte-d'Azur region (PACA, France) and the Nectar de Code company.
Citation: If you use this data or the associated source code, please cite article [4]:
@Article{Cecillon2024, author = {Cécillon, Noé and Labatut, Vincent and Dufour, Richard and Arınık, Nejat}, title = {Whole-Graph Representation Learning For the Classification of Signed Networks}, journal = {IEEE Access}, year = {2024}, volume = {12}, pages = {151303-151316}, doi = {10.1109/ACCESS.2024.3472474},}
Facebook
TwitterThese datasets contain reviews from the Goodreads book review website, and a variety of attributes describing the items. Critically, these datasets have multiple levels of user interaction, raging from adding to a shelf, rating, and reading.
Metadata includes
reviews
add-to-shelf, read, review actions
book attributes: title, isbn
graph of similar books
Basic Statistics:
Items: 1,561,465
Users: 808,749
Interactions: 225,394,930
Facebook
Twitterhttps://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-4576https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-4576
This entry contains the data used to implement the bachelor thesis. It was investigated how embeddings can be used to analyze supersecondary structures. Abstract of the thesis: This thesis analyzes the behavior of supersecondary structures in the context of embeddings. For this purpose, data from the Protein Topology Graph Library was provided with embeddings. This resulted in a structured graph database, which will be used for future work and analyses. In addition, different projections were made into the two-dimensional space to analyze how the embeddings behave there. In the Jupyter Notebook 1_data_retrival.ipynb the download process of the graph files from the Protein Topology Graph Library (https://ptgl.uni-frankfurt.de) can be found. The downloaded .gml files can also be found in graph_files.zip. These form graphs that represent the relationships of supersecondary structures in the proteins. These form the data basis for further analyses. These graph files are then processed in the Jupyter Notebook 2_data_storage_and_embeddings.ipynb and entered into a graph database. The sequences of the supersecondary and secondary structures from the PTGL can be found in fastas.zip. The embeddings were also calculated using the ESM model of the Facebook Research Group (huggingface.co/facebook/esm2_t12_35M_UR50D), which can be found in three .h5 files. These are then added there subsequently. The whole process in this notebook serves to build up the database, which can then be searched using Cypher querys. In the Jupyter Notebook 3_data_science.ipynb different visualizations and analyses are then carried out, which were made with the help of UMAP. For the installation of all dependencies, it is recommended to create a Conda environment and then install all packages there. To use the project, PyEED should be installed using the snapshot of the original repository (source repository: https://github.com/PyEED/pyeed). The best way to install PyEED is to execute the pip install -e . command in the pyeed_BT folder. The dependencies can also be installed by using poetry and the .toml file. In addition, seaborn, h5py and umap-learn are required. These can be installed using the following commands: pip install h5py==3.12.1 pip install seaborn==0.13.2 umap-learn==0.5.7
Facebook
TwitterWebpage: https://ogb.stanford.edu/docs/nodeprop/#ogbn-products
import os.path as osp
import pandas as pd
import datatable as dt
import torch
import torch_geometric as pyg
from ogb.nodeproppred import PygNodePropPredDataset
class PygOgbnProducts(PygNodePropPredDataset):
def _init_(self, meta_csv = None):
root, name, transform = '/kaggle/input', 'ogbn-products', None
if meta_csv is None:
meta_csv = osp.join(root, name, 'ogbn-master.csv')
master = pd.read_csv(meta_csv, index_col = 0)
meta_dict = master[name]
meta_dict['dir_path'] = osp.join(root, name)
super()._init_(name = name, root = root, transform = transform, meta_dict = meta_dict)
def get_idx_split(self, split_type = None):
if split_type is None:
split_type = self.meta_info['split']
path = osp.join(self.root, 'split', split_type)
if osp.isfile(os.path.join(path, 'split_dict.pt')):
return torch.load(os.path.join(path, 'split_dict.pt'))
if self.is_hetero:
train_idx_dict, valid_idx_dict, test_idx_dict = read_nodesplitidx_split_hetero(path)
for nodetype in train_idx_dict.keys():
train_idx_dict[nodetype] = torch.from_numpy(train_idx_dict[nodetype]).to(torch.long)
valid_idx_dict[nodetype] = torch.from_numpy(valid_idx_dict[nodetype]).to(torch.long)
test_idx_dict[nodetype] = torch.from_numpy(test_idx_dict[nodetype]).to(torch.long)
return {'train': train_idx_dict, 'valid': valid_idx_dict, 'test': test_idx_dict}
else:
train_idx = dt.fread(osp.join(path, 'train.csv'), header = None).to_numpy().T[0]
train_idx = torch.from_numpy(train_idx).to(torch.long)
valid_idx = dt.fread(osp.join(path, 'valid.csv'), header = None).to_numpy().T[0]
valid_idx = torch.from_numpy(valid_idx).to(torch.long)
test_idx = dt.fread(osp.join(path, 'test.csv'), header = None).to_numpy().T[0]
test_idx = torch.from_numpy(test_idx).to(torch.long)
return {'train': train_idx, 'valid': valid_idx, 'test': test_idx}
dataset = PygOgbnProducts()
split_idx = dataset.get_idx_split()
train_idx, valid_idx, test_idx = split_idx['train'], split_idx['valid'], split_idx['test']
graph = dataset[0] # PyG Graph object
Graph: The ogbn-products dataset is an undirected and unweighted graph, representing an Amazon product co-purchasing network [1]. Nodes represent products sold in Amazon, and edges between two products indicate that the products are purchased together. The authors follow [2] to process node features and target categories. Specifically, node features are generated by extracting bag-of-words features from the product descriptions followed by a Principal Component Analysis to reduce the dimension to 100.
Prediction task: The task is to predict the category of a product in a multi-class classification setup, where the 47 top-level categories are used for target labels.
Dataset splitting: The authors consider a more challenging and realistic dataset splitting that differs from the one used in [2] Instead of randomly assigning 90% of the nodes for training and 10% of the nodes for testing (without use of a validation set), use the sales ranking (popularity) to split nodes into training/validation/test sets. Specifically, the authors sort the products according to their sales ranking and use the top 8% for training, next top 2% for validation, and the rest for testing. This is a more challenging splitting procedure that closely matches the real-world application where labels are first assigned to important nodes in the network and ML models are subsequently used to make predictions on less important ones.
Note 1: A very small number of self-connecting edges are repeated (see here); you may remove them if necessary.
Note 2: For undirected graphs, the loaded graphs will have the doubled number of edges because the bidirectional edges will be added automatically.
| Package | #Nodes | #Edges | Split Type | Task Type | Metric |
|---|---|---|---|---|---|
ogb>=1.1.1 | 2,449,029 | 61,859,140 | Sales rank | Multi-class classification | Accuracy |
Website: https://ogb.stanford.edu
The Open Graph Benchmark (OGB) [3] is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner.
[1] http://manikvarma.org/downloads/XC/XMLRepository.html [2] Wei-Lin Chiang, ...
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
This dataset contains complementary data to the paper "The Least Cost Directed Perfect Awareness Problem: Complexity, Algorithms and Computations" [1]. Here, we make available two sets of instances of the combinatorial optimization problem studied in that paper, which deals with the spread of information on social networks. We also provide the best known solutions and bounds obtained through computational experiments for each instance.
The first input set includes 300 synthetic instances composed of graphs that resemble real-world social networks. These graphs were produced with a generator proposed in [2]. The second set consists of 14 instances built from graphs obtained by crawling Twitter [3].
The directories "synthetic_instances" and "twitter_instances" contain files that describe both sets of instances, all of which follow the format: the first two lines correspond to:
where
where
where and
The directories "solutions_for_synthetic_instances" and "solutions_for_twitter_instances" contain files that describe the best known solutions for both sets of instances, all of which follow the format: the first line corresponds to:
where is the number of vertices in the solution. Each of the next lines contains:
where
where
Lastly, two files, namely, "bounds_for_synthetic_instances.csv" and "bounds_for_twitter_instances.csv", enumerate the values of the best known lower and upper bounds for both sets of instances.
This work was supported by grants from Santander Bank, Brazil, Brazilian National Council for Scientific and Technological Development (CNPq), Brazil, São Paulo Research Foundation (FAPESP), Brazil.
Caveat: the opinions, hypotheses and conclusions or recommendations expressed in this material are the responsibility of the authors and do not necessarily reflect the views of Santander, CNPq, or FAPESP.
References
[1] F. C. Pereira, P. J. de Rezende. The Least Cost Directed Perfect Awareness Problem: Complexity, Algorithms and Computations. Submitted. 2023.
[2] B. Bollobás, C. Borgs, J. Chayes, and O. Riordan. Directed scale-free graphs. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’03, pages 132–139, 2003.
[3] C. Schweimer, C. Gfrerer, F. Lugstein, D. Pape, J. A. Velimsky, R. Elsässer, and B. C. Geiger. Generating simple directed social network graphs for information spreading. In Proceedings of the ACM Web Conference 2022, WWW ’22, pages 1475–1485, 2022.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Freebase is amongst the largest public cross-domain knowledge graphs. It possesses three main data modeling idiosyncrasies. It has a strong type system; its properties are purposefully represented in reverse pairs; and it uses mediator objects to represent multiary relationships. These design choices are important in modeling the real-world. But they also pose nontrivial challenges in research of embedding models for knowledge graph completion, especially when models are developed and evaluated agnostically of these idiosyncrasies. We make available several variants of the Freebase dataset by inclusion and exclusion of these data modeling idiosyncrasies. This is the first-ever publicly available full-scale Freebase dataset that has gone through proper preparation.
Dataset Details The dataset consists of the four variants of Freebase dataset as well as related mapping/support files. For each variant, we made three kinds of files available:
Subject matter triples file
fb+/-CVT+/-REV One folder for each variant. In each folder there are 5 files: train.txt, valid.txt, test.txt, entity2id.txt, relation2id.txt Subject matter triples are the triples belong to subject matters domains—domains describing real-world facts.
Example of a row in train.txt, valid.txt, and test.txt:
2, 192, 0 Example of a row in entity2id.txt:
/g/112yfy2xr, 2 Example of a row in relation2id.txt:
/music/album/release_type, 192 Explaination
"/g/112yfy2xr" and "/m/02lx2r" are the MID of the subject entity and object entity, respectively. "/music/album/release_type" is the realtionship between the two entities. 2, 192, and 0 are the IDs assigned by the authors to the objects. Type system file
freebase_endtypes: Each row maps an edge type to its required subject type and object type.
Example
92, 47178872, 90 Explanation
"92" and "90" are the type id of the subject and object which has the relationship id "47178872". Metadata files
object_types: Each row maps the MID of a Freebase object to a type it belongs to.
Example
/g/11b41c22g, /type/object/type, /people/person Explanation
The entity with MID "/g/11b41c22g" has a type "/people/person" object_names: Each row maps the MID of a Freebase object to its textual label.
Example
/g/11b78qtr5m, /type/object/name, "Viroliano Tries Jazz"@en Explanation
The entity with MID "/g/11b78qtr5m" has name "Viroliano Tries Jazz" in English. object_ids: Each row maps the MID of a Freebase object to its user-friendly identifier.
Example
/m/05v3y9r, /type/object/id, "/music/live_album/concert" Explanation
The entity with MID "/m/05v3y9r" can be interpreted by human as a music concert live album. domains_id_label: Each row maps the MID of a Freebase domain to its label.
Example
/m/05v4pmy, geology, 77 Explanation
The object with MID "/m/05v4pmy" in Freebase is the domain "geology", and has id "77" in our dataset. types_id_label: Each row maps the MID of a Freebase type to its label.
Example
/m/01xljxh, /government/political_party, 147 Explanation
The object with MID "/m/01xljxh" in Freebase is the type "/government/political_party", and has id "147" in our dataset. entities_id_label: Each row maps the MID of a Freebase entity to its label.
Example
/g/11b78qtr5m, Viroliano Tries Jazz, 2234 Explanation
The entity with MID "/g/11b78qtr5m" in Freebase is "Viroliano Tries Jazz", and has id "2234" in our dataset. properties_id_label: Each row maps the MID of a Freebase property to its label.
Example
/m/010h8tp2, /comedy/comedy_group/members, 47178867 Explanation
The object with MID "/m/010h8tp2" in Freebase is a property(relation/edge), it has label "/comedy/comedy_group/members" and has id "47178867" in our dataset. uri_original2simplified and uri_simplified2original: The mapping between original URI and simplified URI and the mapping between simplified URI and original URI repectively.
Example
uri_original2simplified
"http://rdf.freebase.com/ns/type.property.unique": "/type/property/unique" uri_simplified2original
"/type/property/unique": "http://rdf.freebase.com/ns/type.property.unique" Explanation
The URI "http://rdf.freebase.com/ns/type.property.unique" in the original Freebase RDF dataset is simplified into "/type/property/unique" in our dataset. The identifier "/type/property/unique" in our dataset has URI http://rdf.freebase.com/ns/type.property.unique in the original Freebase RDF dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Finding a good data source is the first step toward creating a database. Cardiovascular illnesses (CVDs) are the major cause of death worldwide. CVDs include coronary heart disease, cerebrovascular disease, rheumatic heart disease, and other heart and blood vessel problems. According to the World Health Organization, 17.9 million people die each year. Heart attacks and strokes account for more than four out of every five CVD deaths, with one-third of these deaths occurring before the age of 70 A comprehensive database for factors that contribute to a heart attack has been constructed , The main purpose here is to collect characteristics of Heart Attack or factors that contribute to it. As a result, a form is created to accomplish this. Microsoft Excel was used to create this form. Figure 1 depicts the form which It has nine fields, where eight fields for input fields and one field for output field. Age, gender, heart rate, systolic BP, diastolic BP, blood sugar, CK-MB, and Test-Troponin are representing the input fields, while the output field pertains to the presence of heart attack, which is divided into two categories (negative and positive).negative refers to the absence of a heart attack, while positive refers to the presence of a heart attack.Table 1 show the detailed information and max and min of values attributes for 1319 cases in the whole database.To confirm the validity of this data, we looked at the patient files in the hospital archive and compared them with the data stored in the laboratories system. On the other hand, we interviewed the patients and specialized doctors. Table 2 is a sample for 1320 cases, which shows 44 cases and the factors that lead to a heart attack in the whole database,After collecting this data, we checked the data if it has null values (invalid values) or if there was an error during data collection. The value is null if it is unknown. Null values necessitate special treatment. This value is used to indicate that the target isn’t a valid data element. When trying to retrieve data that isn't present, you can come across the keyword null in Processing. If you try to do arithmetic operations on a numeric column with one or more null values, the outcome will be null. An example of a null values processing is shown in Figure 2.The data used in this investigation were scaled between 0 and 1 to guarantee that all inputs and outputs received equal attention and to eliminate their dimensionality. Prior to the use of AI models, data normalization has two major advantages. The first is to avoid overshadowing qualities in smaller numeric ranges by employing attributes in larger numeric ranges. The second goal is to avoid any numerical problems throughout the process.After completion of the normalization process, we split the data set into two parts - training and test sets. In the test, we have utilized1060 for train 259 for testing Using the input and output variables, modeling was implemented.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data sets used to prepare Figures 2, 4 – 13 and 15 in the article entitled “Structure of semiconducting versus fast-ion conducting glasses in the Ag-Ge-Se system” that will appear in Royal Society Open Science. The files are labelled according to the figure numbers. The data sets were created using the methodology described in the manuscript. Each of the plots was created using Origin software (http://www.originlab.com/). The data set corresponding to a plotted curve within an Origin file can be identified by clicking on that curve. The units for each axis are given on the plots.
The data sets correspond to measurements made on glassy samples along the Agx(Ge0.25Se0.75)(1-x) tie line for x in the range from 0 to 25. Figure 2 gives the mass density, figure 4 gives the glass transition temperature, figures 5 – 6 give measured neutron and x-ray diffraction data sets, figures 7 – 13 show additional neutron diffraction data sets and an analysis of those data sets, and figure 15 shows the results for a model on the composition dependence of the coordination number for Se-Se homopolar bonds.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The various performance criteria applied in this analysis include the probability of reaching the ultimate target, the costs, elapsed times and system vulnerability resulting from any intrusion. This Excel file contains all the logical, probabilistic and statistical data entered by a user, and required for the evaluation of the criteria. It also reports the results of all the computations.