Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Different graph types may differ in their suitability to support group comparisons, due to the underlying graph schemas. This study examined whether graph schemas are based on perceptual features (i.e., each graph type, e.g., bar or line graph, has its own graph schema) or common invariant structures (i.e., graph types share common schemas). Furthermore, it was of interest which graph type (bar, line, or pie) is optimal for comparing discrete groups. A switching paradigm was used in three experiments. Two graph types were examined at a time (Experiment 1: bar vs. line, Experiment 2: bar vs. pie, Experiment 3: line vs. pie). On each trial, participants received a data graph presenting the data from three groups and were to determine the numerical difference of group A and group B displayed in the graph. We scrutinized whether switching the type of graph from one trial to the next prolonged RTs. The slowing of RTs in switch trials in comparison to trials with only one graph type can indicate to what extent the graph schemas differ. As switch costs were observed in all pairings of graph types, none of the different pairs of graph types tested seems to fully share a common schema. Interestingly, there was tentative evidence for differences in switch costs among different pairings of graph types. Smaller switch costs in Experiment 1 suggested that the graph schemas of bar and line graphs overlap more strongly than those of bar graphs and pie graphs or line graphs and pie graphs. This implies that results were not in line with completely distinct schemas for different graph types either. Taken together, the pattern of results is consistent with a hierarchical view according to which a graph schema consists of parts shared for different graphs and parts that are specific for each graph type. Apart from investigating graph schemas, the study provided evidence for performance differences among graph types. We found that bar graphs yielded the fastest group comparisons compared to line graphs and pie graphs, suggesting that they are the most suitable when used to compare discrete groups.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains an open and curated scholarly graph we built as a training and test set for data discovery, data connection, author disambiguation, and link prediction tasks. This graph represents the European Marine Science community included in the OpenAIRE Graph. The nodes of the graph we release represent publications, datasets, software, and authors respectively; edges interconnecting research products always have the publication as source, and the dataset/software as target. In addition, edges are labeled with semantics that outline whether the publication is referencing, citing, documenting, or supplementing the related outcome. To curate and enrich nodes metadata and edges semantics, we relied on the information extracted from the PDF of the publications and the datasets/software webpages respectively. We curated the authors so to remove duplicated nodes representing the same person. The resource we release counts 4,047 publications, 5,488 datasets, 22 software, 21,561 authors, and 9,692 edges connect publications to datasets/software. This graph is in the curated_MES folder. We provide this resource as: a property graph: we provide the dump that can be imported in neo4j 5 jsonl files containing publications, datasets, software, authors, and relationships respectively. Each line of a jsonl file contains a JSON object representing a node and contains the metadata of that node (or a relationship). We provide two additional scholarly graphs: The curated MES graph with the removed edges. During the curation we removed some edges since they were labeled with an inconsistent or imprecise semantics. This graph includes the same nodes and edges as the previous one, and, in addition, it contains the edges removed during the curation pipeline; these edges are marked as Removed. This graph is in the curated_MES_with_removed_semantics folder. The original MES community of OpenAIRE. It represents the MES community extracted from the OpenAIRE Research Graph. This graph has not been curated, and the metadata and semantics are those of the OpenAIRE Research Graph. This graph is in the original_MES_community folder.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Companion data for the creation of a banksia plot:Background:In research evaluating statistical analysis methods, a common aim is to compare point estimates and confidence intervals (CIs) calculated from different analyses. This can be challenging when the outcomes (and their scale ranges) differ across datasets. We therefore developed a plot to facilitate pairwise comparisons of point estimates and confidence intervals from different statistical analyses both within and across datasets.Methods:The plot was developed and refined over the course of an empirical study. To compare results from a variety of different studies, a system of centring and scaling is used. Firstly, the point estimates from reference analyses are centred to zero, followed by scaling confidence intervals to span a range of one. The point estimates and confidence intervals from matching comparator analyses are then adjusted by the same amounts. This enables the relative positions of the point estimates and CI widths to be quickly assessed while maintaining the relative magnitudes of the difference in point estimates and confidence interval widths between the two analyses. Banksia plots can be graphed in a matrix, showing all pairwise comparisons of multiple analyses. In this paper, we show how to create a banksia plot and present two examples: the first relates to an empirical evaluation assessing the difference between various statistical methods across 190 interrupted time series (ITS) data sets with widely varying characteristics, while the second example assesses data extraction accuracy comparing results obtained from analysing original study data (43 ITS studies) with those obtained by four researchers from datasets digitally extracted from graphs from the accompanying manuscripts.Results:In the banksia plot of statistical method comparison, it was clear that there was no difference, on average, in point estimates and it was straightforward to ascertain which methods resulted in smaller, similar or larger confidence intervals than others. In the banksia plot comparing analyses from digitally extracted data to those from the original data it was clear that both the point estimates and confidence intervals were all very similar among data extractors and original data.Conclusions:The banksia plot, a graphical representation of centred and scaled confidence intervals, provides a concise summary of comparisons between multiple point estimates and associated CIs in a single graph. Through this visualisation, patterns and trends in the point estimates and confidence intervals can be easily identified.This collection of files allows the user to create the images used in the companion paper and amend this code to create their own banksia plots using either Stata version 17 or R version 4.3.1
Facebook
TwitterComparing AUROC, exploited biases, use cases and issues of Graph Attention Networks on similarity and patient-centric graphs (directed, reversed directed and undirected) for the classification of sepsis on complete blood count data (higher values represent better performance). We evaluated the classification performance on two datasets (internal and external dataset). Bold values represent the best values in each column.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
After participating in an afterschool program where they used the Common Online Data Analysis Platform (CODAP) to study time-series data about infectious diseases, four middle school students were interviewed to determine how they understood features of and trends within these graphs. Our focus was on how students compared graphs. Students were readily able to compare cumulative/total infection rates among two countries with differently sized populations. It was more challenging for them to link a graph of yearly cases to the corresponding graph of cumulative cases. Students offered reasonable interpretations for spikes or steady periods in the graphs. Time-series graphs are accessible for 11- to 14-year-old students, who were able to make comparisons within and between graphs. Students used proportional reasoning for one comparison task, and on the other task, while it was challenging, they were beginning to understand how yearly and cumulative graphs were related. Time-series graphs are ubiquitous and socially relevant: Students should study time-series data more regularly in school, and more research is needed on the progression of sense-making with these graphs.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the repository for ISWC 2023 Resource Track submission for Text2KGBench: Benchmark for Ontology-Driven Knowledge Graph Generation from Text. Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.
It contains two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences.
An example
An example test sentence:
Test Sentence:
{"id": "ont_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by
American songwriters Gerry Goffin and Carole King."}
An example of ontology:
Ontology: Music Ontology
Expected Output:
{
"id": "ont_k_music_test_n",
"sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King.",
"triples": [
{
"sub": "The Loco-Motion",
"rel": "publication date",
"obj": "01 January 1962"
},{
"sub": "The Loco-Motion",
"rel": "lyrics by",
"obj": "Gerry Goffin"
},{
"sub": "The Loco-Motion",
"rel": "lyrics by",
"obj": "Carole King"
},]
}
The data is released under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY 4.0) License.
The structure of the repo is as the following.
benchmark the code used to generate the benchmarkevaluation evaluation scripts for calculating the resultsThis benchmark contains data derived from the TekGen corpus (part of the KELM corpus) [1] released under CC BY-SA 2.0 license and WebNLG 3.0 corpus [2] released under CC BY-NC-SA 4.0 license.
[1] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.
[2] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages
Facebook
Twitterhttps://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/2.0/customlicense?persistentId=doi:10.18419/DARUS-4231https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/2.0/customlicense?persistentId=doi:10.18419/DARUS-4231
This dataset contains the supplementary materials to our publication "Collaborative Problem Solving in Mixed Reality: A Study on Visual Graph Analysis", where we report on a study we conducted. Please refer to publication for more details, also the abstract can be found at the end of this description. The dataset contains: The collection of graphs with layout used in the study The final, randomized experiment files used in the study The source code of the study prototype The collected, anonymized data in tabular form The code for the statistical analysis The Supplemental Materials PDF The documents used in the study procedure (English, Italian, German) Paper abstract: Problem solving is a composite cognitive process, invoking a number of cognitive mechanisms, such as perception and memory. Individuals may form collectives to solve a given problem together, in collaboration, especially when complexity is thought to be high. To determine if and when collaborative problem solving is desired, we must quantify collaboration first. For this, we investigate the practical virtue of collaborative problem solving. Using visual graph analysis, we perform a study with 72 participants in two countries and three languages. We compare ad hoc pairs to individuals and nominal pairs, solving two different tasks on graphs in visuospatial mixed reality. The average collaborating pair does not outdo its nominal counterpart, but it does have a significant trade-off against the individual: an ad hoc pair uses 1.46 more time to achieve 4.6% higher accuracy. We also use the concept of task instance complexity to quantify differences in complexity. As task instance complexity increases, these differences largely scale, though with two notable exceptions. With this study we show the importance of using nominal groups as benchmark in collaborative virtual environments research. We conclude that a mixed reality environment does not automatically imply superior collaboration.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This resource consists of four folders:
The gold_standard folder provides the files consisting of manually evaluated triples.
The files were exported from ANNit with the column:
LEFT* for the source URI of the triple.
RIGHT* for the target URI of the triple.
UserChioce for the choice of user when manually evaluated
Decision* for the actual decision made by annotator. It can only be unknown, remove, remain.
Comment, if any.
The only three columns that matter for the evaluation of removed triples in this project are those labelled with *.
The folder graph_file includes the unweighted graphs, as well as the two sets of weighted graphs: the graphs with counted weights and the graphs with inferred weights (in the subdirectory of counted_weights and inferred_weights subdirectory respectively).
The files are compressed in the format of *.gz. Each file consists of two columns of integers as the source and the target. The integers corresponds to the URIs. The corresponding mapping files are in the directory mapping.
Finally, the corresponding files (of unweighted graphs) in WebGraph format are provided. These files were used when evaluating our algorithm against exiting web-scale feedback-arc-set algorithm.
Should there be any problem with these datasets, please feel free to report to us at the following email address: shuai.wang@vu.nl.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Wikipedia temporal graph.
The dataset is based on two Wikipedia SQL dumps: (1) English language articles and (2) user visit counts per page per hour (aka pagecounts). The original datasets are publicly available on the Wikimedia website.
Static graph structure is extracted from English language Wikipedia articles. Redirects are removed. Before building the Wikipedia graph we introduce thresholds on the minimum number of visits per hour and maximum in-degree. We remove the pages that have less than 500 visits per hour at least once during the specified period. Besides, we remove the nodes (pages) with in-degree higher than 8 000 to build a more meaningful initial graph. After cleaning, the graph contains 116 016 nodes (out of total 4 856 639 pages), 6 573 475 edges. The graph can be imported in two ways: (1) using edges.csv and vertices.csv or (2) using enwiki-20150403-graph.gt file that can be opened with open source Python library Graph-Tool.
Time-series data contains users' visit counts from 02:00, 23 September 2014 until 23:00, 30 April 2015. The total number of hours is 5278. The data is stored in two formats: CSV and H5. CSV file contains data in the following format [page_id :: count_views :: layer], where layer represents an hour. In H5 file, each layer corresponds to an hour as well.
Facebook
TwitterBipartite networks, also known as two-mode networks or affiliation networks, are a class of networks in which actors or objects are partitioned into two sets, with interactions taking place across but not within sets. These networks are omnipresent in society, encompassing phenomena such as student-teacher interactions, coalition structures, and international treaty participation. With growing data availability and proliferation in statistical estimators and software, scholars have increasingly sought to understand the methods available to model the data generating processes in these networks. This article compares three methods for doing so: (1) Logit; (2) the bipartite Exponential Random Graph Model (ERGM); and (3) the Relational Event Model (REM). This comparison demonstrates the relevance of choices with respect to dependence structures, temporality, parameter specification, and data structure. Considering the example of Ram Navami, a Hindu festival celebrating the birth of Lord Ram, the ego network of tweets using #RamNavami on April 21, 2021 is examined. The results of the analysis illustrate that critical modeling choices make a difference in the estimated parameters and the conclusions to be drawn from them.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Check the GitHub link to see how I generated the dataset: https://github.com/walidgeuttala/Synthetic-Benchmark-for-Graph-Classification
The Synthetic Network Datasets comprise two distinct sets containing synthetic networks generated via abstract generative models from Network Science. These datasets serve a dual purpose: the first set is utilized for both training and testing the performance of Graph Neural Network (GNN) models on previously unseen samples, while the second set is solely employed to evaluate the generalization ability of the trained models.
Within these datasets, networks are crafted using Erdős-Rényi (ER), Watts-Strogatz (WS), and Barabási-Albert (BA) models. Parameters are deliberately selected to emphasize the unique features of each network family while maintaining consistency in fundamental network statistics across the dataset. Key features considered include average path length (ℓ), transitivity (T), and the structure of the degree distribution, distinguishing between small-world properties, high transitivity, and scale-free distributions.
The datasets encompass eight distinct combinations based on the high and low instances of these three properties. To balance these features, regular lattices were introduced to represent high average path lengths, ensuring each node possesses an equal number of neighbors. This addition involved two interpretations of neighborhood, leading to varying transitivity values.
These datasets are divided into Small-Sized Graphs and Medium-Sized Graphs. The Small Dataset contains 250 samples from each of the eight network types, totaling 2000 synthetic networks, with network sizes randomly selected between 250 and 1024 nodes. Meanwhile, the Medium Dataset includes 250 samples from each network type, summing up to 2000 synthetic networks, but with sizes randomly selected between 1024 and 2048 nodes.
During training, the Small Dataset was split uniformly into training, validation, and testing graphs. The Medium Dataset acts as additional test data to evaluate the models' generalization capability. Parameters governing the average degree were meticulously chosen for each network type within both datasets.
The detailed structure and diverse characteristics of these Synthetic Network Datasets provide a comprehensive platform for training and evaluating GNN models across various network types, aiding in the exploration and understanding of their performance and generalization abilities.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Official Website: https://vga.csail.mit.edu
The Visual Graph Arena Benchmark is a collection of six datasets designed to evaluate and enhance the visual reasoning capabilities of AI. The datasets are structured around three primary concepts, each divided into two tasks. 1. Graph Isomorphism Tasks: - Easy Isomorphism: Determining whether two given graphs are isomorphic, with non-isomorphic graphs chosen randomly. - Hard Isomorphism: Determining whether two given graphs are isomorphic, with non-isomorphic graphs being degree-equivalent.
Graph Path Finding Tasks:
Graph Cycle Finding Tasks:
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Machine learning (ML) has become a crucial tool to accelerate research in advanced oxidation processes via predicting reaction parameters to evaluate the treatability of micropollutants (MPs). However, insufficient data sets and an incomplete prediction mechanism remain obstacles toward the precise prediction of MP treatability by a hydroxyl radical (HO•), especially when k values approach the diffusion-controlled limit. Herein, we propose a novel graph neural network (GNN) model integrating self-supervised pretraining on a large unlabeled data set (∼10 million) to predict the kHO values on MPs. Our model outperforms the common-seen and literature-established ML models on both whole data sets and diffusion-controlled limit data sets. Benefiting from the pretraining process, we demonstrate that k-value-related chemistry wisdom contained in the pretrained data set is fully exploited, and the learned knowledge can be transferred among data sets. In comparison with molecular fingerprints, we identify that molecular graphs (MGs) cover more structural information beyond substituents, facilitating a k-value prediction near the diffusion-controlled limit. In particular, we observe that mechanistic pathways of HO•-initiated reactions could be automatically classified and mapped out on the penultimate layer of our model. The phenomenon shows that the GNN model can be trained to excavate mechanistic knowledge by analyzing the kinetic parameters. These findings not only well interpret the robust model performance but also extrapolate the k-value prediction model to mechanistic elucidation, leading to better decision making in water treatment.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The network was generated using email data from a large European research institution. For a period from October 2003 to May 2005 (18 months) we have anonymized information about all incoming and outgoing email of the research institution. For each sent or received email message we know the time, the sender and the recipient of the email. Overall we have 3,038,531 emails between 287,755 different email addresses. Note that we have a complete email graph for only 1,258 email addresses that come from the research institution. Furthermore, there are 34,203 email addresses that both sent and received email within the span of our dataset. All other email addresses are either non-existing, mistyped or spam.
Given a set of email messages, each node corresponds to an email address. We create a directed edge between nodes i and j, if i sent at least one message to j.
Enron email communication network covers all the email communication within a dataset of around half million emails. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation. Nodes of the network are email addresses and if an address i sent at least one email to address j, the graph contains an undirected edge from i to j. Note that non-Enron email addresses act as sinks and sources in the network as we only observe their communication with the Enron email addresses.
The Enron email data was originally released by William Cohen at CMU.
Wikipedia is a free encyclopedia written collaboratively by volunteers around the world. Each registered user has a talk page, that she and other users can edit in order to communicate and discuss updates to various articles on Wikipedia. Using the latest complete dump of Wikipedia page edit history (from January 3 2008) we extracted all user talk page changes and created a network.
The network contains all the users and discussion from the inception of Wikipedia till January 2008. Nodes in the network represent Wikipedia users and a directed edge from node i to node j represents that user i at least once edited a talk page of user j.
The dynamic face-to-face interaction networks represent the interactions that happen during discussions between a group of participants playing the Resistance game. This dataset contains networks extracted from 62 games. Each game is played by 5-8 participants and lasts between 45--60 minutes. We extract dynamically evolving networks from the free-form discussions using the ICAF algorithm. The extracted networks are used to characterize and detect group deceptive behavior using the DeceptionRank algorithm.
The networks are weighted, directed and temporal. Each node represents a participant. At each 1/3 second, a directed edge from node u to v is weighted by the probability of participant u looking at participant v or the laptop. Additionally, we also provide a binary version where an edge from u to v indicates participant u looks at participant v (or the laptop).
Stanford Network Analysis Platform (SNAP) is a general purpose, high performance system for analysis and manipulation of large networks. Graphs consists of nodes and directed/undirected/multiple edges between the graph nodes. Networks are graphs with data on nodes and/or edges of the network.
The core SNAP library is written in C++ and optimized for maximum performance and compact graph representation. It easily scales to massive networks with hundreds of millions of nodes, and billions of edges. It efficiently manipulates large graphs, calculates structural properties, generates regular and random graphs, and supports attributes on nodes and edges. Besides scalability to large graphs, an additional strength of SNAP is that nodes, edges and attributes in a graph or a network can be changed dynamically during the computation.
SNAP was originally developed by Jure Leskovec in the course of his PhD studies. The first release was made available in Nov, 2009. SNAP uses a general purpose STL (Standard Template Library)-like library GLib developed at Jozef Stefan Institute. SNAP and GLib are being actively developed and used in numerous academic and industrial projects.
Facebook
TwitterHydrology Graphs This repository contains the code for the manuscript "A Graph Formulation for Tracing Hydrological Pollutant Transport in Surface Waters." There are three main folders containing code and data, and these are outlined below. We call the framework for building a graph of these hydrological systems "Hydrology Graphs". Several of the datafiles for building this framework are large and cannot be stored on Github. To conserve space, the notebook get_and_unpack_data.ipynb or the script get_and_unpack_data.py can be used to download the data from the Watershed Boundary Dataset (WBD), the National Hydrography Dataset (NHDPlusV2), and the agricultural land dataset for the state of Wisconsin. The files WILakes.df and WIRivers.df metnioend in section 1 below are contained within the WI_lakes_rivers.zip folder, and the files 24k Hydro Waterbodies dataset are contained in a zip file under the directory DNR_data/Hydro_Waterbodies. These files can also be unpacked by running the corresponding cells in the notebook get_and_unpack_data.ipynb or get_and_unpack_data.py. 1. graph_construction This folder contains the data and code for building a graph of the watershed-river-waterbody hydrological system. It uses data from the Watershed Boundary Dataset (link here) and the National Hydrography Dataset (link here) as a basis and builds a list of directed edges. We use NetworkX to build and visualize the list as a graph. case_studies This folder contains three .ipynb files for three separate case studies. These three case studies focus on how "Hydrology Graphs" can be used to analyze pollutant impacts in surface waters. Details of these case studies can be found in the manuscript above. DNR_data This folder contains data from the Wisconsin Department of Natural Resources (DNR) on water quality in several Wisconsin lakes. The data was obtained from here using the file Web_scraping_script.py. The original downloaded reports are found in the folder original_lake_reports. These reports were then cleaned and reformatted using the script DNR_data_filter.ipynb. The resulting, cleaned reports are found in the Lakes folder. Each subfolder of the Lakes folder contains data for a single lake. The two .csvs lake_index_WBIC.csv contain an index for what lake each numbered subfolder corresponds. In addition, we added the corresponding COMID in lake_index_WBIC_COMID.csv by matching the NHDPlusV2 data to the Wisconsin DNR's 24k Hydro Waterbodies dataset which we downloaded from here. The DNR's reported data only matches lakes to a waterbody identification code (WBIC), so we use HYDROLakes (indexed by WBIC) to match to the COMID. This is done in the DNR_data_filter.ipynb script as well. Python Versions The .py files in graph_construction/ were run using Python version 3.9.7. The scripts used the following packages and version numbers: geopandas (0.10.2) shapely (1.8.1.post1) tqdm (4.63.0) networkx (2.7.1) pandas (1.4.1) numpy (1.21.2). This dataset is associated with the following publication: Cole, D.L., G.J. Ruiz-Mercado, and V.M. Zavala. A graph-based modeling framework for tracing hydrological pollutant transport in surface waters. COMPUTERS AND CHEMICAL ENGINEERING. Elsevier Science Ltd, New York, NY, USA, 179: 108457, (2023).
Facebook
Twitterhttps://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.11588/DATA/YYULL2https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.11588/DATA/YYULL2
Reimplementation of four KG factorization methods and six negative sampling methods. Abstract Knowledge graphs are large, useful, but incomplete knowledge repositories. They encode knowledge through entities and relations which define each other through the connective structure of the graph. This has inspired methods for the joint embedding of entities and relations in continuous low-dimensional vector spaces, that can be used to induce new edges in the graph, i.e., link prediction in knowledge graphs. Learning these representations relies on contrasting positive instances with negative ones. Knowledge graphs include only positive relation instances, leaving the door open for a variety of methods for selecting negative examples. In this paper we present an empirical study on the impact of negative sampling on the learned embeddings, assessed through the task of link prediction. We use state-of-the-art knowledge graph embeddings -- \rescal , TransE, DistMult and ComplEX -- and evaluate on benchmark datasets -- FB15k and WN18. We compare well known methods for negative sampling and additionally propose embedding based sampling methods. We note a marked difference in the impact of these sampling methods on the two datasets, with the "traditional" corrupting positives method leading to best results on WN18, while embedding based methods benefiting the task on FB15k.
Facebook
TwitterNew York has presented the most cases compared to all states across the U.S..There have also been critiques regarding how much more unnoticed impact the flu has caused. My dataset allows us to compare whether or not this is true according to the most recent data.
This COVID-19 data is from Kaggle whereas the New York influenza data comes from the U.S. government health data website. I merged the two datasets by county and FIPS code and listed the most recent reports of 2020 COVID-19 cases and deaths alongside the 2019 known influenza cases for comparison.
I am thankful to Kaggle and the U.S. government for making the data that made this possible openly available.
This data can be extended to answer the common misconceptions of the scale of the COVID-19 and common flu. My inspiration stems from supporting conclusions with data rather than simply intuition.
I would like my data to help answer how we can make U.S. citizens realize what diseases are most impactful.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
This dataset contains complementary data to the paper "The Least Cost Directed Perfect Awareness Problem: Complexity, Algorithms and Computations" [1]. Here, we make available two sets of instances of the combinatorial optimization problem studied in that paper, which deals with the spread of information on social networks. We also provide the best known solutions and bounds obtained through computational experiments for each instance.
The first input set includes 300 synthetic instances composed of graphs that resemble real-world social networks. These graphs were produced with a generator proposed in [2]. The second set consists of 14 instances built from graphs obtained by crawling Twitter [3].
The directories "synthetic_instances" and "twitter_instances" contain files that describe both sets of instances, all of which follow the format: the first two lines correspond to:
where
where
where and
The directories "solutions_for_synthetic_instances" and "solutions_for_twitter_instances" contain files that describe the best known solutions for both sets of instances, all of which follow the format: the first line corresponds to:
where is the number of vertices in the solution. Each of the next lines contains:
where
where
Lastly, two files, namely, "bounds_for_synthetic_instances.csv" and "bounds_for_twitter_instances.csv", enumerate the values of the best known lower and upper bounds for both sets of instances.
This work was supported by grants from Santander Bank, Brazil, Brazilian National Council for Scientific and Technological Development (CNPq), Brazil, São Paulo Research Foundation (FAPESP), Brazil.
Caveat: the opinions, hypotheses and conclusions or recommendations expressed in this material are the responsibility of the authors and do not necessarily reflect the views of Santander, CNPq, or FAPESP.
References
[1] F. C. Pereira, P. J. de Rezende. The Least Cost Directed Perfect Awareness Problem: Complexity, Algorithms and Computations. Submitted. 2023.
[2] B. Bollobás, C. Borgs, J. Chayes, and O. Riordan. Directed scale-free graphs. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’03, pages 132–139, 2003.
[3] C. Schweimer, C. Gfrerer, F. Lugstein, D. Pape, J. A. Velimsky, R. Elsässer, and B. C. Geiger. Generating simple directed social network graphs for information spreading. In Proceedings of the ACM Web Conference 2022, WWW ’22, pages 1475–1485, 2022.
Facebook
TwitterThe NP-complete Vertex Cover problem asks to cover all edges of a graph by a small (given) number of vertices. It is among the most prominent graph-algorithmic problems. Following a recent trend in studying temporal graphs (a sequence of graphs, so-called layers, over the same vertex set but, over time, changing edge sets), we initiate the study of Multistage Vertex Cover. Herein, given a temporal graph, the goal is to find for each layer of the temporal graph a small vertex cover and to guarantee that two vertex cover sets of every two consecutive layers differ not too much (specified by a given parameter). We show that, different from classic Vertex Cover and some other dynamic or temporal variants of it, Multistage Vertex Cover is computationally hard even in fairly restricted settings. On the positive side, however, we also spot several fixed-parameter tractability results based on some of themost natural parameterizations.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Measuring the quality of Question Answering (QA) systems is a crucial task to validate the results of novel approaches. However, there are already indicators of a reproducibility crisis as many published systems have used outdated datasets or use subsets of QA benchmarks, making it hard to compare results. We identified the following core problems: there is no standard data format, instead, proprietary data representations are used by the different partly inconsistent datasets; additionally, the characteristics of datasets are typically not reflected by the dataset maintainers nor by the system publishers. To overcome these problems, we established an ontology---Question Answering Dataset Ontology (QADO)---for representing the QA datasets in RDF. The following datasets were mapped into the ontology: the QALD series, LC-QuAD series, RuBQ series, ComplexWebQuestions, and Mintaka. Hence, the integrated data in QADO covers widely used datasets and multilinguality. Additionally, we did intensive analyses of the datasets to identify their characteristics to make it easier for researchers to identify specific research questions and to select well-defined subsets. The provided resource will enable the research community to improve the quality of their research and support the reproducibility of experiments.
Here, the mapping results of the QADO process, the SPARQL queries for data analytics, and the archived analytics results file are provided.
Up-to-date statistics can be created automatically by the script provided at the corresponding QADO GitHub RDFizer repository.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Different graph types may differ in their suitability to support group comparisons, due to the underlying graph schemas. This study examined whether graph schemas are based on perceptual features (i.e., each graph type, e.g., bar or line graph, has its own graph schema) or common invariant structures (i.e., graph types share common schemas). Furthermore, it was of interest which graph type (bar, line, or pie) is optimal for comparing discrete groups. A switching paradigm was used in three experiments. Two graph types were examined at a time (Experiment 1: bar vs. line, Experiment 2: bar vs. pie, Experiment 3: line vs. pie). On each trial, participants received a data graph presenting the data from three groups and were to determine the numerical difference of group A and group B displayed in the graph. We scrutinized whether switching the type of graph from one trial to the next prolonged RTs. The slowing of RTs in switch trials in comparison to trials with only one graph type can indicate to what extent the graph schemas differ. As switch costs were observed in all pairings of graph types, none of the different pairs of graph types tested seems to fully share a common schema. Interestingly, there was tentative evidence for differences in switch costs among different pairings of graph types. Smaller switch costs in Experiment 1 suggested that the graph schemas of bar and line graphs overlap more strongly than those of bar graphs and pie graphs or line graphs and pie graphs. This implies that results were not in line with completely distinct schemas for different graph types either. Taken together, the pattern of results is consistent with a hierarchical view according to which a graph schema consists of parts shared for different graphs and parts that are specific for each graph type. Apart from investigating graph schemas, the study provided evidence for performance differences among graph types. We found that bar graphs yielded the fastest group comparisons compared to line graphs and pie graphs, suggesting that they are the most suitable when used to compare discrete groups.