14 datasets found

CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company...
zenodo.org
data.niaid.nih.gov
application/gzip, bin +1
Updated Jun 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lele Cao; Lele Cao; Vilhelm von Ehrenheim; Vilhelm von Ehrenheim; Mark Granroth-Wilding; Mark Granroth-Wilding; Richard Anselmo Stahl; Richard Anselmo Stahl; Drew McCornack; Drew McCornack; Armin Catovic; Armin Catovic; Dhiana Deva Cavacanti Rocha; Dhiana Deva Cavacanti Rocha (2024). CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company Similarity Quantification [Dataset]. http://doi.org/10.5281/zenodo.11391315
Explore at:
application/gzip, bin, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11391315
Dataset updated
Jun 4, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lele Cao; Lele Cao; Vilhelm von Ehrenheim; Vilhelm von Ehrenheim; Mark Granroth-Wilding; Mark Granroth-Wilding; Richard Anselmo Stahl; Richard Anselmo Stahl; Drew McCornack; Drew McCornack; Armin Catovic; Armin Catovic; Dhiana Deva Cavacanti Rocha; Dhiana Deva Cavacanti Rocha
Time period covered
May 29, 2024
Description
CompanyKG is a heterogeneous graph consisting of 1,169,931 nodes and 50,815,503 undirected edges, with each node representing a real-world company and each edge signifying a relationship between the connected pair of companies.

Edges: We model 15 different inter-company relations as undirected edges, each of which corresponds to a unique edge type. These edge types capture various forms of similarity between connected company pairs. Associated with each edge of a certain type, we calculate a real-numbered weight as an approximation of the similarity level of that type. It is important to note that the constructed edges do not represent an exhaustive list of all possible edges due to incomplete information. Consequently, this leads to a sparse and occasionally skewed distribution of edges for individual relation/edge types. Such characteristics pose additional challenges for downstream learning tasks. Please refer to our paper for a detailed definition of edge types and weight calculations.

Nodes: The graph includes all companies connected by edges defined previously. Each node represents a company and is associated with a descriptive text, such as "Klarna is a fintech company that provides support for direct and post-purchase payments ...". To comply with privacy and confidentiality requirements, we encoded the text into numerical embeddings using four different pre-trained text embedding models: mSBERT (multilingual Sentence BERT), ADA2, SimCSE (fine-tuned on the raw company descriptions) and PAUSE.

Evaluation Tasks. The primary goal of CompanyKG is to develop algorithms and models for quantifying the similarity between pairs of companies. In order to evaluate the effectiveness of these methods, we have carefully curated three evaluation tasks:

Similarity Prediction (SP). To assess the accuracy of pairwise company similarity, we constructed the SP evaluation set comprising 3,219 pairs of companies that are labeled either as positive (similar, denoted by "1") or negative (dissimilar, denoted by "0"). Of these pairs, 1,522 are positive and 1,697 are negative.

Competitor Retrieval (CR). Each sample contains one target company and one of its direct competitors. It contains 76 distinct target companies, each of which has 5.3 competitors annotated in average. For a given target company A with N direct competitors in this CR evaluation set, we expect a competent method to retrieve all N competitors when searching for similar companies to A.

Similarity Ranking (SR) is designed to assess the ability of any method to rank candidate companies (numbered 0 and 1) based on their similarity to a query company. Paid human annotators, with backgrounds in engineering, science, and investment, were tasked with determining which candidate company is more similar to the query company. It resulted in an evaluation set comprising 1,856 rigorously labeled ranking questions. We retained 20% (368 samples) of this set as a validation set for model development.

Edge Prediction (EP) evaluates a model's ability to predict future or missing relationships between companies, providing forward-looking insights for investment professionals. The EP dataset, derived (and sampled) from new edges collected between April 6, 2023, and May 25, 2024, includes 40,000 samples, with edges not present in the pre-existing CompanyKG (a snapshot up until April 5, 2023).

Background and Motivation

In the investment industry, it is often essential to identify similar companies for a variety of purposes, such as market/competitor mapping and Mergers & Acquisitions (M&A). Identifying comparable companies is a critical task, as it can inform investment decisions, help identify potential synergies, and reveal areas for growth and improvement. The accurate quantification of inter-company similarity, also referred to as company similarity quantification, is the cornerstone to successfully executing such tasks. However, company similarity quantification is often a challenging and time-consuming process, given the vast amount of data available on each company, and the complex and diversified relationships among them.

While there is no universally agreed definition of company similarity, researchers and practitioners in PE industry have adopted various criteria to measure similarity, typically reflecting the companies' operations and relationships. These criteria can embody one or more dimensions such as industry sectors, employee profiles, keywords/tags, customers' review, financial performance, co-appearance in news, and so on. Investment professionals usually begin with a limited number of companies of interest (a.k.a. seed companies) and require an algorithmic approach to expand their search to a larger list of companies for potential investment.

In recent years, transformer-based Language Models (LMs) have become the preferred method for encoding textual company descriptions into vector-space embeddings. Then companies that are similar to the seed companies can be searched in the embedding space using distance metrics like cosine similarity. The rapid advancements in Large LMs (LLMs), such as GPT-3/4 and LLaMA, have significantly enhanced the performance of general-purpose conversational models. These models, such as ChatGPT, can be employed to answer questions related to similar company discovery and quantification in a Q&A format.

However, graph is still the most natural choice for representing and learning diverse company relations due to its ability to model complex relationships between a large number of entities. By representing companies as nodes and their relationships as edges, we can form a Knowledge Graph (KG). Utilizing this KG allows us to efficiently capture and analyze the network structure of the business landscape. Moreover, KG-based approaches allow us to leverage powerful tools from network science, graph theory, and graph-based machine learning, such as Graph Neural Networks (GNNs), to extract insights and patterns to facilitate similar company analysis. While there are various company datasets (mostly commercial/proprietary and non-relational) and graph datasets available (mostly for single link/node/graph-level predictions), there is a scarcity of datasets and benchmarks that combine both to create a large-scale KG dataset expressing rich pairwise company relations.

Source Code and Tutorial:
https://github.com/llcresearch/CompanyKG2

Paper: to be published
A study on real graphs of fake news spreading on Twitter
zenodo.org
bin
Updated Aug 20, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amirhosein Bodaghi; Amirhosein Bodaghi (2021). A study on real graphs of fake news spreading on Twitter [Dataset]. http://doi.org/10.5281/zenodo.5225338
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5225338
Dataset updated
Aug 20, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Amirhosein Bodaghi; Amirhosein Bodaghi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
*** Fake News on Twitter ***

These 5 datasets are the results of an empirical study on the spreading process of newly fake news on Twitter. Particularly, we have focused on those fake news which have given rise to a truth spreading simultaneously against them. The story of each fake news is as follow:

1- FN1: A Muslim waitress refused to seat a church group at a restaurant, claiming "religious freedom" allowed her to do so.

2- FN2: Actor Denzel Washington said electing President Trump saved the U.S. from becoming an "Orwellian police state."

3- FN3: Joy Behar of "The View" sent a crass tweet about a fatal fire in Trump Tower.

4- FN4: The animated children's program 'VeggieTales' introduced a cannabis character in August 2018.

5- FN5: In September 2018, the University of Alabama football program ended its uniform contract with Nike, in response to Nike's endorsement deal with Colin Kaepernick.

The data collection has been done in two stages that each provided a new dataset: 1- attaining Dataset of Diffusion (DD) that includes information of fake news/truth tweets and retweets 2- Query of neighbors for spreaders of tweets that provides us with Dataset of Graph (DG).

DD

DD for each fake news story is an excel file, named FNx_DD where x is the number of fake news, and has the following structure:

The structure of excel files for each dataset is as follow:

Each row belongs to one captured tweet/retweet related to the rumor, and each column of the dataset presents a specific information about the tweet/retweet. These columns from left to right present the following information about the tweet/retweet:

User ID (user who has posted the current tweet/retweet)

The number of published tweet/retweet by the user at the time of posting the current tweet/retweet

Language of the tweet/retweet

Number of followers

Number of followings (friends)

Date and time of posting the current tweet/retweet

Number of like (favorite) the current tweet had been acquired before crawling it

Number of times the current tweet had been retweeted before crawling it

Is there any other tweet inside of the current tweet/retweet (for example this happens when the current tweet is a quote or reply or retweet)

The source (OS) of device by which the current tweet/retweet was posted

Tweet/Retweet ID

Retweet ID (if the post is a retweet then this feature gives the ID of the tweet that is retweeted by the current post)

Quote ID (if the post is a quote then this feature gives the ID of the tweet that is quoted by the current post)

Reply ID (if the post is a reply then this feature gives the ID of the tweet that is replied by the current post)

Frequency of tweet occurrences which means the number of times the current tweet is repeated in the dataset (for example the number of times that a tweet exists in the dataset in the form of retweet posted by others)

State of the tweet which can be one of the following forms (achieved by an agreement between the annotators):

r : The tweet/retweet is a fake news post

a : The tweet/retweet is a truth post

q : The tweet/retweet is a question about the fake news, however neither confirm nor deny it

n : The tweet/retweet is not related to the fake news (even though it contains the queries related to the rumor, but does not refer to the given fake news)

DG

DG for each fake news contains two files:

A file in graph format (.graph) which includes the information of graph such as who is linked to whom. (This file named FNx_DG.graph, where x is the number of fake news)

A file in Jsonl format (.jsonl) which includes the real user IDs of nodes in the graph file. (This file named FNx_Labels.jsonl, where x is the number of fake news)

Because in the graph file, the label of each node is the number of its entrance in the graph. For example if node with user ID 12345637 be the first node which has been entered into the graph file then its label in the graph is 0 and its real ID (12345637) would be at the row number 1 (because the row number 0 belongs to column labels) in the jsonl file and so on other node IDs would be at the next rows of the file (each row corresponds to 1 user id). Therefore, if we want to know for example what the user id of node 200 (labeled 200 in the graph) is, then in jsonl file we should look at row number 202.

The user IDs of spreaders in DG (those who have had a post in DD) would be available in DD to get extra information about them and their tweet/retweet. The other user IDs in DG are the neighbors of these spreaders and might not exist in DD.
Data from: KGCW 2024 Challenge @ ESWC 2024
zenodo.org
investigacion.usc.gal
+1more
application/gzip
Updated Apr 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dylan Van Assche; Dylan Van Assche; David Chaves-Fraga; David Chaves-Fraga; Anastasia Dimou; Anastasia Dimou; Umutcan Serles; Umutcan Serles; Ana Iglesias; Ana Iglesias (2024). KGCW 2024 Challenge @ ESWC 2024 [Dataset]. http://doi.org/10.5281/zenodo.10973433
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10973433
Dataset updated
Apr 15, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dylan Van Assche; Dylan Van Assche; David Chaves-Fraga; David Chaves-Fraga; Anastasia Dimou; Anastasia Dimou; Umutcan Serles; Umutcan Serles; Ana Iglesias; Ana Iglesias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Knowledge Graph Construction Workshop 2024: challenge

Knowledge graph construction of heterogeneous data has seen a lot of uptake
in the last decade from compliance to performance optimizations with respect
to execution time. Besides execution time as a metric for comparing knowledge
graph construction, other metrics e.g. CPU or memory usage are not considered.
This challenge aims at benchmarking systems to find which RDF graph
construction system optimizes for metrics e.g. execution time, CPU,
memory usage, or a combination of these metrics.

Task description

The task is to reduce and report the execution time and computing resources
(CPU and memory usage) for the parameters listed in this challenge, compared
to the state-of-the-art of the existing tools and the baseline results provided
by this challenge. This challenge is not limited to execution times to create
the fastest pipeline, but also computing resources to achieve the most efficient
pipeline.

We provide a tool which can execute such pipelines end-to-end. This tool also
collects and aggregates the metrics such as execution time, CPU and memory
usage, necessary for this challenge as CSV files. Moreover, the information
about the hardware used during the execution of the pipeline is available as
well to allow fairly comparing different pipelines. Your pipeline should consist
of Docker images which can be executed on Linux to run the tool. The tool is
already tested with existing systems, relational databases e.g. MySQL and
PostgreSQL, and triplestores e.g. Apache Jena Fuseki and OpenLink Virtuoso
which can be combined in any configuration. It is strongly encouraged to use
this tool for participating in this challenge. If you prefer to use a different
tool or our tool imposes technical requirements you cannot solve, please contact
us directly.

Track 1: Conformance

The set of new specification for the RDF Mapping Language (RML) established by the W3C Community Group on Knowledge Graph Construction provide a set of test-cases for each module:

RML-Core

RML-IO

RML-CC

RML-FNML

RML-Star

These test-cases are evaluated in this Track of the Challenge to determine their feasibility, correctness, etc. by applying them in implementations. This Track is in Beta status because these new specifications have not seen any implementation yet, thus it may contain bugs and issues. If you find problems with the mappings, output, etc. please report them to the corresponding repository of each module.

Note: validating the output of the RML Star module automatically through the provided tooling is currently not possible, see https://github.com/kg-construct/challenge-tool/issues/1.

Through this Track we aim to spark development of implementations for the new specifications and improve the test-cases. Let us know your problems with the test-cases and we will try to find a solution.

Track 2: Performance

Part 1: Knowledge Graph Construction Parameters

These parameters are evaluated using synthetic generated data to have more
insights of their influence on the pipeline.

Data

Number of data records: scaling the data size vertically by the number of records with a fixed number of data properties (10K, 100K, 1M, 10M records).

Number of data properties: scaling the data size horizontally by the number of data properties with a fixed number of data records (1, 10, 20, 30 columns).

Number of duplicate values: scaling the number of duplicate values in the dataset (0%, 25%, 50%, 75%, 100%).

Number of empty values: scaling the number of empty values in the dataset (0%, 25%, 50%, 75%, 100%).

Number of input files: scaling the number of datasets (1, 5, 10, 15).

Mappings

Number of subjects: scaling the number of subjects with a fixed number of predicates and objects (1, 10, 20, 30 TMs).

Number of predicates and objects: scaling the number of predicates and objects with a fixed number of subjects (1, 10, 20, 30 POMs).

Number of and type of joins: scaling the number of joins and type of joins (1-1, N-1, 1-N, N-M)

Part 2: GTFS-Madrid-Bench

The GTFS-Madrid-Bench provides insights in the pipeline with real data from the
public transport domain in Madrid.

Scaling

GTFS-1 SQL

GTFS-10 SQL

GTFS-100 SQL

GTFS-1000 SQL

Heterogeneity

GTFS-100 XML + JSON

GTFS-100 CSV + XML

GTFS-100 CSV + JSON

GTFS-100 SQL + XML + JSON + CSV

Example pipeline

The ground truth dataset and baseline results are generated in different steps
for each parameter:

The provided CSV files and SQL schema are loaded into a MySQL relational database.

Mappings are executed by accessing the MySQL relational database to construct a knowledge graph in N-Triples as RDF format

The pipeline is executed 5 times from which the median execution time of each
step is calculated and reported. Each step with the median execution time is
then reported in the baseline results with all its measured metrics.
Knowledge graph construction timeout is set to 24 hours.
The execution is performed with the following tool: https://github.com/kg-construct/challenge-tool,
you can adapt the execution plans for this example pipeline to your own needs.

Each parameter has its own directory in the ground truth dataset with the
following files:

Input dataset as CSV.

Mapping file as RML.

Execution plan for the pipeline in metadata.json.

Datasets

Knowledge Graph Construction Parameters

The dataset consists of:

Input dataset as CSV for each parameter.

Mapping file as RML for each parameter.

Baseline results for each parameter with the example pipeline.

Ground truth dataset for each parameter generated with the example pipeline.

Format

All input datasets are provided as CSV, depending on the parameter that is being
evaluated, the number of rows and columns may differ. The first row is always
the header of the CSV.

GTFS-Madrid-Bench

The dataset consists of:

Input dataset as CSV with SQL schema for the scaling and a combination of XML,

CSV, and JSON is provided for the heterogeneity.

Mapping file as RML for both scaling and heterogeneity.

SPARQL queries to retrieve the results.

Baseline results with the example pipeline.

Ground truth dataset generated with the example pipeline.

Format

CSV datasets always have a header as their first row.
JSON and XML datasets have their own schema.

Evaluation criteria

Submissions must evaluate the following metrics:

Execution time of all the steps in the pipeline. The execution time of a step is the difference between the begin and end time of a step.

CPU time as the time spent in the CPU for all steps of the pipeline. The CPU time of a step is the difference between the begin and end CPU time of a step.

Minimal and maximal memory consumption for each step of the pipeline. The minimal and maximal memory consumption of a step is the minimum and maximum calculated of the memory consumption during the execution of a step.

Expected output

Duplicate values

Scale Number of Triples
0 percent 2000000 triples
25 percent 1500020 triples
50 percent 1000020 triples
75 percent 500020 triples
100 percent 20 triples

Empty values

Scale Number of Triples
0 percent 2000000 triples
25 percent 1500000 triples
50 percent 1000000 triples
75 percent 500000 triples
100 percent 0 triples

Mappings

Scale Number of Triples
1TM + 15POM 1500000 triples
3TM + 5POM 1500000 triples
5TM + 3POM 1500000 triples
15TM + 1POM 1500000 triples

Properties

Scale Number of Triples
1M rows 1 column 1000000 triples
1M rows 10
o
Microsoft Academic Graph
explore.openaire.eu
zenodo.org
Updated Mar 14, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Microsoft Academic (2019). Microsoft Academic Graph [Dataset]. http://doi.org/10.5281/zenodo.2593153
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.2593153
Dataset updated
Mar 14, 2019
Authors
Microsoft Academic
Description
This is the Microsoft Academic Graph data from 2021-09-13. To get this, you'd normally jump through these hoops: https://docs.microsoft.com/en-us/academic-services/graph/get-started-setup-provisioning As required by ODC-BY, I acknowledge Microsoft Academic using the URI https://aka.ms/msracad. You can find out more about the data schema of the Microsoft Academic Graph at: https://web.archive.org/web/20220218202531/https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema Since Microsoft docs are covered by different licensing terms, the documentation cannot be provided along with the data. There were no changes to the files except compressing them with zstd (-T8 -19). This results in a smaller packed size, but still more data than the previous version. The compressed files will expand to the following sizes (output of zstd -l): Compressed Uncompressed Ratio Filename 1.39 MiB 5.30 MiB 3.822 Affiliations.txt.zst 4.45 MiB 15.7 MiB 3.518 AuthorExtendedAttributes.txt.zst 4.29 GiB 17.4 GiB 4.052 Authors.txt.zst 575 KiB 2.55 MiB 4.530 ConferenceInstances.txt.zst 126 KiB 453 KiB 3.598 ConferenceSeries.txt.zst 1.56 MiB 5.96 MiB 3.820 Journals.txt.zst 12.5 GiB 51.7 GiB 4.137 PaperAuthorAffiliations.txt.zst 687 MiB 2.76 GiB 4.116 PaperExtendedAttributes.txt.zst 7.32 GiB 40.5 GiB 5.530 PaperReferences.txt.zst 1.23 MiB 9.72 MiB 7.894 PaperResources.txt.zst 18.5 GiB 72.0 GiB 3.898 Papers.txt.zst 5.59 GiB 34.7 GiB 6.203 PaperUrls.txt.zst -------------------------------------------------- 48.9 GiB 219 GiB 4.484 XXH64 12 files This data is not the whole set what you get to download, there is much more (roughly 160GiB compressed), but the upload quota only permits this much. The additional data is retained and you may ask for it. The additional data is huge, so be prepared to provide sftp, rsync or similar access to drop the files in. If you want to donate an update but lack the bandwidth to download and repack the set, feel free to contact me (details via my ORCiD page), once you have gone through the provisioning steps. I'll either grab the set directly from the azure storage (you might have to give me access rights) or provide an sftp/rsync drop for you to dump the data in. The data for version 2021-09-13 was kindly contributed by Rudolf Siegel
Data from: NeSy4VRD: A Multifaceted Resource for Neurosymbolic AI Research...
zenodo.org
zip
Updated May 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Herron; David Herron; Ernesto Jimenez-Ruiz; Ernesto Jimenez-Ruiz; Giacomo Tarroni; Giacomo Tarroni; Tillman Weyde; Tillman Weyde (2023). NeSy4VRD: A Multifaceted Resource for Neurosymbolic AI Research using Knowledge Graphs in Visual Relationship Detection [Dataset]. http://doi.org/10.5281/zenodo.7931113
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7931113
Dataset updated
May 16, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
David Herron; David Herron; Ernesto Jimenez-Ruiz; Ernesto Jimenez-Ruiz; Giacomo Tarroni; Giacomo Tarroni; Tillman Weyde; Tillman Weyde
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
NeSy4VRD

NeSy4VRD is a multifaceted, multipurpose resource designed to foster neurosymbolic AI (NeSy) research, particularly NeSy research using Semantic Web technologies such as OWL ontologies, OWL-based knowledge graphs and OWL-based reasoning as symbolic components. The NeSy4VRD research resource pertains to the computer vision field of AI and, within that field, to the application tasks of visual relationship detection (VRD) and scene graph generation.

Whilst the core motivation of the NeSy4VRD research resource is to foster computer vision-based NeSy research using Semantic Web technologies such as OWL ontologies and OWL-based knowledge graphs, AI researchers can readily use NeSy4VRD to either: 1) pursue computer vision-based NeSy research without involving Semantic Web technologies as symbolic components, or 2) pursue computer vision research without NeSy (i.e. pursue research that focuses purely on deep learning alone, without involving symbolic components of any kind). This is the sense in which we describe NeSy4VRD as being multipurpose: it can readily be used by diverse groups of computer vision-based AI researchers with diverse interests and objectives.

The NeSy4VRD research resource in its entirety is distributed across two locations: Zenodo and GitHub.

NeSy4VRD on Zenodo: the NeSy4VRD dataset package

This entry on Zenodo hosts the NeSy4VRD dataset package, which includes the NeSy4VRD dataset and its companion NeSy4VRD ontology, an OWL ontology called VRD-World.

The NeSy4VRD dataset consists of an image dataset with associated visual relationship annotations. The images of the NeSy4VRD dataset are the same as those that were once publicly available as part of the VRD dataset. The NeSy4VRD visual relationship annotations are a highly customised and quality-improved version of the original VRD visual relationship annotations. The NeSy4VRD dataset is designed for computer vision-based research that involves detecting objects in images and predicting relationships between ordered pairs of those objects. A visual relationship for an image of the NeSy4VRD dataset has the form <'subject', 'predicate', 'object'>, where the 'subject' and 'object' are two objects in the image, and the 'predicate' describes some relation between them. Both the 'subject' and 'object' objects are specified in terms of bounding boxes and object classes. For example, representative annotated visual relationships are <'person', 'ride', 'horse'>, <'hat', 'on', 'teddy bear'> and <'cat', 'under', 'pillow'>.

Visual relationship detection is pursued as a computer vision application task in its own right, and as a building block capability for the broader application task of scene graph generation. Scene graph generation, in turn, is commonly used as a precursor to a variety of enriched, downstream visual understanding and reasoning application tasks, such as image captioning, visual question answering, image retrieval, image generation and multimedia event processing.

The NeSy4VRD ontology, VRD-World, is a rich, well-aligned, companion OWL ontology engineered specifically for use with the NeSy4VRD dataset. It directly describes the domain of the NeSy4VRD dataset, as reflected in the NeSy4VRD visual relationship annotations. More specifically, all of the object classes that feature in the NeSy4VRD visual relationship annotations have corresponding classes within the VRD-World OWL class hierarchy, and all of the predicates that feature in the NeSy4VRD visual relationship annotations have corresponding properties within the VRD-World OWL object property hierarchy. The rich structure of the VRD-World class hierarchy and the rich characteristics and relationships of the VRD-World object properties together give the VRD-World OWL ontology rich inference semantics. These provide ample opportunity for OWL reasoning to be meaningfully exercised and exploited in NeSy research that uses OWL ontologies and OWL-based knowledge graphs as symbolic components. There is also ample potential for NeSy researchers to explore supplementing the OWL reasoning capabilities afforded by the VRD-World ontology with Datalog rules and reasoning.

Use of the NeSy4VRD ontology, VRD-World, in conjunction with the NeSy4VRD dataset is, of course, purely optional, however. Computer vision AI researchers who have no interest in NeSy, or NeSy researchers who have no interest in OWL ontologies and OWL-based knowledge graphs, can ignore the NeSy4VRD ontology and use the NeSy4VRD dataset by itself.

All computer vision-based AI research user groups can, if they wish, also avail themselves of the other components of the NeSy4VRD research resource available on GitHub.

NeSy4VRD on GitHub: open source infrastructure supporting extensibility, and sample code

The NeSy4VRD research resource incorporates additional components that are companions to the NeSy4VRD dataset package here on Zenodo. These companion components are available at NeSy4VRD on GitHub. These companion components consist of:

comprehensive open source Python-based infrastructure supporting the extensibility of the NeSy4VRD visual relationship annotations (and, thereby, the extensibility of the NeSy4VRD ontology, VRD-World, as well)

open source Python sample code showing how one can work with the NeSy4VRD visual relationship annotations in conjunction with the NeSy4VRD ontology, VRD-World, and RDF knowledge graphs.

The NeSy4VRD infrastructure supporting extensibility consists of:

open source Python code for conducting deep and comprehensive analyses of the NeSy4VRD dataset (the VRD images and their associated NeSy4VRD visual relationship annotations)

an open source, custom-designed NeSy4VRD protocol for specifying visual relationship annotation customisation instructions declaratively, in text files

an open source, custom-designed NeSy4VRD workflow, implemented using Python scripts and modules, for applying small or large volumes of customisations or extensions to the NeSy4VRD visual relationship annotations in a configurable, managed, automated and repeatable process.

The purpose behind providing comprehensive infrastructure to support extensibility of the NeSy4VRD visual relationship annotations is to make it easy for researchers to take the NeSy4VRD dataset in new directions, by further enriching the annotations, or by tailoring them to introduce new or more data conditions that better suit their particular research needs and interests. The option to use the NeSy4VRD extensibility infrastructure in this way applies equally well to each of the diverse potential NeSy4VRD user groups already mentioned.

The NeSy4VRD extensibility infrastructure, however, may be of particular interest to NeSy researchers interested in using the NeSy4VRD ontology, VRD-World, in conjunction with the NeSy4VRD dataset. These researchers can of course tailor the VRD-World ontology if they wish without needing to modify or extend the NeSy4VRD visual relationship annotations in any way. But their degrees of freedom for doing so will be limited by the need to maintain alignment with the NeSy4VRD visual relationship annotations and the particular set of object classes and predicates to which they refer. If NeSy researchers want full freedom to tailor the VRD-World ontology, they may well need to tailor the NeSy4VRD visual relationship annotations first, in order that alignment be maintained.

To illustrate our point, and to illustrate our vision of how the NeSy4VRD extensibility infrastructure can be used, let us consider a simple example. It is common in computer vision to distinguish between thing objects (that have well-defined shapes) and stuff objects (that are amorphous). Suppose a researcher wishes to have a greater number of stuff object classes with which to work. Water is such a stuff object. Many VRD images contain water but it is not currently one of the annotated object classes and hence is never referenced in any visual relationship annotations. So adding a Water class to the class hierarchy of the VRD-World ontology would be pointless because it would never acquire any instances (because an object detector would never detect any). However, our hypothetical researcher could choose to do the following:

use the analysis functionality of the NeSy4VRD extensibility infrastructure to find images containing water (by, say, searching for images whose visual relationships refer to object classes such as 'boat', 'surfboard', 'sand', 'umbrella', etc.);

use free image analysis software (such as GIMP, at gimp.org) to get bounding boxes for instances of water in these images;

use the NeSy4VRD protocol to specify new visual relationships for these images that refer to the new 'water' objects (e.g. <'boat', 'on', 'water'>);

use the NeSy4VRD workflow to introduce the new object class 'water' and to apply the specified new visual relationships to the sets of annotations for the affected images;

introduce class Water to the class hierarchy of the VRD-World ontology (using, say, the free Protege ontology editor);

continue experimenting, now with the added benefit of the additional stuff object class 'water';

contribute the enriched set of NeSy4VRD visual relationship
d
Data from: A consensus method for ancestral recombination graphs
datadryad.org
data.niaid.nih.gov
+1more
zip
Updated Mar 10, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mary K. Kuhner; Jon Yamato (2018). A consensus method for ancestral recombination graphs [Dataset]. http://doi.org/10.5061/dryad.9rt50
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.9rt50
Dataset updated
Mar 10, 2018
Dataset provided by
Dryad
Authors
Mary K. Kuhner; Jon Yamato
Time period covered
2018
Description
Software for computing the ARG consensusThis is a tar gzip archive of all programs needed to compute the ARG consensus as described in our paper, and a documentation file explaining how they are used. In order to use this archive you will need access to a Python interpreter and a C++ compiler. A shell script linking the steps is included, but is optional.argconsense_software.tgz
e
Multisource Field Plot Data for Studies of Vegetation Alliances:...
knb.ecoinformatics.org
Updated Jan 6, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NCEAS 3540: Jennings: The ecology of steppe and grassland vegetation alliances of the Columbia River Basin (Hosted by NCEAS); NCEAS 2180: Peet: An information infrastructure for vegetation science (Hosted by NCEAS); NCEAS 4340: Peet: Tools for vegetation classification and analysis; NCEAS 2840: Reichman: A Knowledge Network for Biocomplexity: Building and evaluating a metadata-based framework for integrating heterogeneous scientific data (Hosted by NCEAS); National Center for Ecological Analysis and Synthesis; Michael Jennings (2015). Multisource Field Plot Data for Studies of Vegetation Alliances: Northwestern USA [Dataset]. http://doi.org/10.5063/AA/nceas.286.1
Explore at:
Unique identifier
https://doi.org/10.5063/AA/nceas.286.1
Dataset updated
Jan 6, 2015
Dataset provided by
Knowledge Network for Biocomplexity
Authors
NCEAS 3540: Jennings: The ecology of steppe and grassland vegetation alliances of the Columbia River Basin (Hosted by NCEAS); NCEAS 2180: Peet: An information infrastructure for vegetation science (Hosted by NCEAS); NCEAS 4340: Peet: Tools for vegetation classification and analysis; NCEAS 2840: Reichman: A Knowledge Network for Biocomplexity: Building and evaluating a metadata-based framework for integrating heterogeneous scientific data (Hosted by NCEAS); National Center for Ecological Analysis and Synthesis; Michael Jennings
Time period covered
Jan 1, 1970 - Dec 31, 2000
Area covered

Description
Although vegetation alliances defined and described in the U.S. National Vegetation Classification are used as a fundamental unit of habitat for modeling species distributions and for conservation assessments, little is known about their ecological characteristics, either generally or individually. A major barrier to understanding alliances better is the lack of primary biotic and physical data about them. In particular, few alliance or association descriptions of the USNVC are based directly on original field plot data. Such data do not exist in the quantity or over the geographic extents necessary, and new field work to acquire such data is unlikely. This study attempts to learn about the efficacy of and limitations to developing the data needed by integrating existing information from multiple sources and themes across the Inland Northwest of the USA. Almost 40,000 field plot records from 11 different sources were integrated, sorted, and evaluated to generate a single standardized database from which plots were classified a priori as members of alliances. Additional data sets of climate, biomass productivity, and morphological traits of plant species were also integrated with the field plot data. The plot records were filtered for eight univariate parameters, species names were standardized, and multivariate outliers of species composition were identified and removed. Field plot records were extracted from the data sets with SQL statements based on existing descriptions of alliances, and these subsets were tested against a null model. Ultimately 21% of the field plots were classified to 49 vegetation alliances. Field plot classifications were corroborated with a nonmetric multidimensional scaling ordination. This study resulted in a large set of primary data critical for the study of vegetation alliances. It shows that it is possible to develop synthetic vegetation field plot data sets from existing multisource information.
Z
Data from: CLARA Knowledge Graph of licensed educational resources
data.niaid.nih.gov
zenodo.org
Updated Oct 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Serrano Alvarado, Patricia (2023). CLARA Knowledge Graph of licensed educational resources [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8403141
Explore at:
Dataset updated
Oct 20, 2023
Dataset provided by
Fakih, Ghinwa
Kieffer, Manoé
Serrano Alvarado, Patricia
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
CLARAThis deposit is part of the CLARA project. The CLARA project aims to empower teachers in the task of creating new educational resources. And in particular with the task of handling the licenses of reused educational resources. The present deposit contains the RDF files created using an RDF mapping (RML) and a mapper (Morph-KGC). It also contains the files JSON used as input. The corresponding pipeline can be found on Gitlab. The data used in that pipeline originate from X5GON, a European project aiming to generate and gather open educational resources. Knowledge graph contentThe present Knowledge Graph contains information about 45K Educational Resources (ERs) and 135K subjects (extracted from DBpedia).That information contains

the author, its title and description the license, a URL to the resource itself, the language of the ER, its mimetype, and finally which subject it talks about, and to what extent. That extent is given by two scores: a PageRank score and a Cosinus score. A particularity of the knowledge graph is its heavy use of RDF reification, across large multi-valued properties.Thus four versions of the knowledge graph exist, using Standard reification, Singleton property, Named graphs, and RDF-star. The Knowledge Graph also contains categories originating from DBpedia. They help precise the subjects that are also extracted from DBpedia. The KG.zip files contain five types of files:

Authors_[X].nt - Those contain the authors' nodes, their type, and name. ER_[X].nt/nq/ttl - Those contain the ERs and their information using the respective RDF reification model. categories_skos_[X].ttl - Those contain the hierarchy of DBpedia categories. categories_labels.ttl - This file contains additional information about the categories. categories_article.ttl - This file contains the RDF triples that link the DBpedia subjects to the DBpedia categories.

JSON content The original dataset was cut into multiple JSON files in order to make its processing easier. DBpedia categories were extracted as RDF and aren't present in the JSON files.There are two types of files in the input-json.zip file:

authors_[X].json - Which lists the authors names ER_[X].json - Which lists the ERs and their related information.That information contains:

their title. their description. their language (and language_detected, only the first one is used in the pipeline here). their license. their mimetype. the authors. the date of creation of the resource. a url linking to the resource itself. the subjects (named concepts) associated with the resource. With the corresponding scores.

If you do use this dataset, you can cite this repository:

Kieffer, M., Fakih, G., & Serrano Alvarado, P. (2023). CLARA Knowledge Graph of licensed educational resources [Data set]. Semantics, Leipzig, Germany. Zenodo. https://doi.org/10.5281/zenodo.8403142 Or the corresponding paper

Kieffer, M., Fakih, G. & Serrano-Alvarado, P. (2023). Evaluating Reification with Multi-valued Properties in a Knowledge Graph of Licensed Educational Resources. Semantics, Leipzig, Germany.
Z
Dataset - Clustering Semantic Predicates in the Open Research Knowledge...
data.niaid.nih.gov
zenodo.org
Updated Aug 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arab Oghli, Omar (2022). Dataset - Clustering Semantic Predicates in the Open Research Knowledge Graph [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6513498
Explore at:
Dataset updated
Aug 8, 2022
Dataset authored and provided by
Arab Oghli, Omar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset has been created for implementing a content-based recommender system in the context of the Open Research Knowledge Graph (ORKG). The recommender system accepts research paper's title and abstracts as input and recommends existing predicates in the ORKG semantically relevant to the given paper.

The paper instances in the dataset are grouped by ORKG comparisons and therefore the data.json file is more comprehensive than training_set.json and test_set.json.

data.json

The main JSON object consists of a list of comparisons. Each comparisons object has an ID, label, list of papers and list of predicates, whereas each paper object has ID, label, DOI, research field, research problems and abstract. Each predicate object has an ID and a label. See an example instance below.

{ "comparisons": [ { "id": "R108331", "label": "Analysis of approaches based on required elements in way of modeling", "papers": [ { "id": "R108312", "label": "Rapid knowledge work visualization for organizations", "doi": "10.1108/13673270710762747", "research_field": { "id": "R134", "label": "Computer and Systems Architecture" }, "research_problems": [ { "id": "R108294", "label": "Enterprise engineering" } ], "abstract": "Purpose \u2013 The purpose of this contribution is to motivate a new, rapid approach to modeling knowledge work in organizational settings and to introduce a software tool that demonstrates the viability of the envisioned concept.Design/methodology/approach \u2013 Based on existing modeling structures, the KnowFlow toolset that aids knowledge analysts in rapidly conducting interviews and in conducting multi\u2010perspective analysis of organizational knowledge work is introduced.Findings \u2013 This article demonstrates how rapid knowledge work visualization can be conducted largely without human modelers by developing an interview structure that allows for self\u2010service interviews. Two application scenarios illustrate the pressing need for and the potentials of rapid knowledge work visualizations in organizational settings.Research limitations/implications \u2013 The efforts necessary for traditional modeling approaches in the area of knowledge management are often prohibitive. This contribution argues that future research needs ..." }, .... ], "predicates": [ { "id": "P37126", "label": "activities, behaviours, means [for knowledge development and/or for knowledge conveyance and transformation" }, { "id": "P36081", "label": "approach name" }, .... ] }, .... ] }

training_set.json and test_set.json

The main JSON object consists of a list of training/test instances. Each instance has an instance_id with the format (comparison_id X paper_id) and a text. The text is a concatenation of the paper's label (title) and abstract. See an example instance below.

Note that test instances are not duplicated and do not occur in the training set. Training instances are also not duplicated, BUT training papers can be duplicated in a concatenation with different comparisons.

{ "instances": [ { "instance_id": "R108331xR108301", "comparison_id": "R108331", "paper_id": "R108301", "text": "A notation for Knowledge-Intensive Processes Business process modeling has become essential for managing organizational knowledge artifacts. However, this is not an easy task, especially when it comes to the so-called Knowledge-Intensive Processes (KIPs). A KIP comprises activities based on acquisition, sharing, storage, and (re)use of knowledge, as well as collaboration among participants, so that the amount of value added to the organization depends on process agents' knowledge. The previously developed Knowledge Intensive Process Ontology (KIPO) structures all the concepts (and relationships among them) to make a KIP explicit. Nevertheless, KIPO does not include a graphical notation, which is crucial for KIP stakeholders to reach a common understanding about it. This paper proposes the Knowledge Intensive Process Notation (KIPN), a notation for building knowledge-intensive processes graphical models." }, ... ] }

Dataset Statistics:

- Papers Predicates Research Fields Research Problems Min/Comparison 2 2 1 0 Max/Comparison 202 112 5 23 Avg./Comparison 21,54 12,79 1,20 1,09 Total 4060 1816 46 178

Dataset Splits:

- Papers Comparisons Training Set 2857 214 Test Set 1203 180
UK Admiralty nautical chart series
edmed.seadatanet.org
bodc.ac.uk
nc
Updated Jan 15, 2010
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
United Kingdom Hydrographic Office (2010). UK Admiralty nautical chart series [Dataset]. https://edmed.seadatanet.org/report/570/
Explore at:
ncAvailable download formats
Dataset updated
Jan 15, 2010
Dataset provided by
UK Hydrographic Officehttps://www.gov.uk/ukho
Authors
United Kingdom Hydrographic Office
License
https://vocab.nerc.ac.uk/collection/L08/current/CC/https://vocab.nerc.ac.uk/collection/L08/current/CC/
Time period covered
1995 - Present
Area covered
World,
Description
A series of approximately 3250 navigational charts covering the world. The series is maintained by Admiralty Notices to Mariners issued every week. New editions or new charts are published as required. Two thirds of the series are now available in metric units.

In areas where the United Kingdom is, or until recently has been, the responsible hydrographic authority - i.e. Home Waters, some Commonwealth countries, British colonies, and certain areas like the Gulf, Red Sea and parts of the eastern Mediterranean - the Admiralty charts afford detailed cover of all waters, ports and harbours. These make up about 30 per cent of the total series. Modern charts in these areas usually have a source data diagram showing the sources from which the chart was compiled. The quantity and quality of the sources vary due to age and the part of the world the chart depicts. The other 70 per cent are derived from information on foreign charts, and the Admiralty versions are designed to provide charts for ocean passage and landfall, and approach and entry to the major ports.

The series contains charts on many different scales, but can be divided very broadly as follows:

Route planning 1:10 million Ocean planning 1:3.5 million Coast approach or landfall identification 1:1 million Coasting 1:300,000 to 1:200,000 Intricate or congested coastal waters 1:150,000 to 1:75,000 Port approach 1:50,000 or larger Terminal installation 1:12,500 or larger

Charts on scales smaller than 1:50,000, except in polar regions, are on Mercator projection. Since 1978 all charts on 1:50,000 and larger have been produced on Transverse Mercator projection. Prior to 1978 larger scale charts were on a modified polyconic projection referred to as 'gnomonic', not to be confused with the true Gnomonic projection.

Most of the detail shown on a chart consists of hydrographic information - soundings (selected spot depths) in metres (on older charts in fathoms or feet) reduced to a stated vertical datum; depth contours; dredged channels; and the nature of the seabed and foreshore. Features which present hazards to navigation, fishing and other marine operations are also shown. These include underwater rocks and reefs; wrecks and obstructions; submarine cables and pipelines and offshore installations. Shallow water areas are usually highlighted with pale blue tint(s). Also shown are aids established to assist the navigator - buoys, beacons, lights, fog signals and radio position finding and reporting services; and information about traffic separation schemes, anchorages, tides, tidal streams and magnetic variation. Outline coastal topography is shown especially objects of use as fixing marks. As a base for navigation the chart carries compass roses, scales, horizontal datum information, graduation (and sometimes land map grids), conversion tables and tables of tidal and tidal stream rates.
Monthly development Dow Jones Industrial Average Index 2018-2025
statista.com
Updated Mar 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Monthly development Dow Jones Industrial Average Index 2018-2025 [Dataset]. https://www.statista.com/statistics/261690/monthly-performance-of-djia-index/
Explore at:
Dataset updated
Mar 4, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Jan 2018 - Mar 2025
Area covered
United States
Description
The value of the DJIA index amounted to 43,191.24 at the end of March 2025, up from 21,917.16 at the end of March 2020. Global panic about the coronavirus epidemic caused the drop in March 2020, which was the worst drop since the collapse of Lehman Brothers in 2008. Dow Jones Industrial Average index – additional information The Dow Jones Industrial Average index is a price-weighted average of 30 of the largest American publicly traded companies on New York Stock Exchange and NASDAQ, and includes companies like Goldman Sachs, IBM and Walt Disney. This index is considered to be a barometer of the state of the American economy. DJIA index was created in 1986 by Charles Dow. Along with the NASDAQ 100 and S&P 500 indices, it is amongst the most well-known and used stock indexes in the world. The year that the 2018 financial crisis unfolded was one of the worst years of the Dow. It was also in 2008 that some of the largest ever recorded losses of the Dow Jones Index based on single-day points were registered. On September 29th of 2008, for instance, the Dow had a loss of 106.85 points, one of the largest single-day losses of all times. The best years in the history of the index still are 1915, when the index value increased by 81.66 percent in one year, and 1933, year when the index registered a growth of 63.74 percent.
Benchmark maps and assignments for multi-agent path finding
zenodo.org
data.niaid.nih.gov
zip
Updated Jan 21, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tomáš Rybecký; Tomáš Rybecký; Miroslav Kulich; Miroslav Kulich (2021). Benchmark maps and assignments for multi-agent path finding [Dataset]. http://doi.org/10.5281/zenodo.4439404
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4439404
Dataset updated
Jan 21, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tomáš Rybecký; Tomáš Rybecký; Miroslav Kulich; Miroslav Kulich
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset

The dataset is produced within the SafeLog project and it is used for benchmarking of multi-agent path planning algorithms. Specifically, the dataset consists of a set of 21 maps with increasing density and a set of 500 random assignments, each for a group of 100 agents for planning on each of the maps.

All of the maps, in the form of a graph G = {V, E}, are built on the same set of 400 vertices V. The sets of edges Ej, where j ∈ (0; 20), in the maps then form a set ranging from a spanning tree to a mostly 4-connected graph. These maps were created by generating a complete square graph with the size of 20*20 vertices. The graph was then simplified to a spanning tree, and, finally, approximately 50 random edges from the complete graph were added 20 times, to create the set of 21 maps of density ranging from 800 to 1500 edges in the graph.

Content and format

The following files are included in the dataset

test_nodes.txt - 400 nodes of a 20*20 square map in the form "id x y"
testAssignment.txt - 50499 random pairs of nodes ids from test_nodes.txt
test_edgesX.txt - pairs of adjacent nodes ids from test_nodes.txt forming edges
- X = 0 - tree
- X = 20 - full graph
- created starting at a full graph and repeatedly erasing edges until a tree remains

To illustrate the maps in the dataset, we provide three images (1008.png, 1190.png, and 1350.png) showing maps with 1008 (1190, 1350) edges.

Citation

If you use the dataset, please cite:

[1] Hvězda, J., Rybecký, T., Kulich, M., and Přeučil, L. (2018). Context-Aware Route Planning for Automated Warehouses. Proceedings of 2018 21st International Conference on Intelligent Transportation Systems (ITSC).

@inproceedings{Hvezda18itsc,
author = {Hvězda, Jakub and Rybecký, Tomáš and Kulich, Miroslav and Přeučil, Libor},
title = {Context-Aware Route Planning for Automated Warehouses},
booktitle = {Proceedings of 2018 21st International Conference on Intelligent Transportation Systems (ITSC)},
publisher = {IEEE Intelligent Transportation Systems Society},
address = {Maui},
year = {2018},
doi = {10.1109/ITSC.2018.8569712},
}

[2] Hvězda, J., Kulich, M., and Přeučil, L. (2019). On Randomized Searching for Multi-robot Coordination. In: Gusikhin O., Madani K. (eds) Informatics in Control, Automation and Robotics. ICINCO 2018. Lecture Notes in Electrical Engineering, vol 613. Springer, Cham.

@inbook{Hvezda19springer,
author = {Hvězda, Jakub and Kulich, Miroslav and Přeučil, Libor},
title = {On Randomized Searching for Multi-robot Coordination},
booktitle = {Informatics in Control, Automation and Robotics},
publisher = {Springer},
address = {Cham, CH},
year = {2019},
series = {Lecture Notes in Electrical Engineering},
language = {English},
url = {https://link.springer.com/chapter/10.1007/978-3-030-31993-9_18},
doi = {10.1007/978-3-030-31993-9},
}

[3] Hvězda, J., Kulich, M., and Přeučil, L. (2018). Improved Discrete RRT for Coordinated Multi-robot Planning. Proceedings of the 15th International Conference on Informatics in Control, Automation and Robotics - (Volume 2).

@inproceedings{Hvezda18icinco,
author = {Hvězda, Jakub and Kulich, Miroslav and Přeučil, Libor},
title = {Improved Discrete RRT for Coordinated Multi-robot Planning},
booktitle = {Proceedings of the 15th International Conference on Informatics in Control, Automation and Robotics - (Volume 2)},
publisher = {SciTePress},
address = {Madeira, PT},
year = {2018},
language = {English},
url = {http://www.scitepress.org/PublicationsDetail.aspx?ID=ppwUqsGaX18=\&t=1},
doi = {10.5220/0006865901710179},
access = {full}
}
Replication Package: Block-based or Graph-Based? Why Not Both? Designing a...
zenodo.org
zip
Updated Nov 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous; Anonymous (2024). Replication Package: Block-based or Graph-Based? Why Not Both? Designing a Hybrid Programming Environment for End-users [Dataset]. http://doi.org/10.5281/zenodo.14232315
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14232315
Dataset updated
Nov 27, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous; Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Block-based or Graph-based? Why Not Both? Designing a Hybrid Programming Environment for End-users: Replication Package

This repository contains supplementary materials for the paper "Block-based or Graph-based? Why Not Both? Designing a Hybrid Programming Environment for End-users". We provide this data for transparency reasons and to support replications of our experiments.

Note: This package is anonymized for peer review purposes. We will provide contact information for the authors at a later date. We also plan to add interactive versions of our tasks and tutorials for an easier exploration of our study design.

Summary of files contained in this package

This package contains two parts:

The data-analysis/ folder contains the raw dataset we collected for our experiment in CSV format, as well as scripts we used for our analyses.

Column ID contains a unique 4-digit identifier for each participant that they were assigned throughout our study.

Column Group contains the group (Blocks/Graph) that participants were randomly assigned to.

Columns Task1Time and Task2Time contain the time participants spent to complete the two programming tasks of our study in minutes.

Columns Task1Success and Task2Success contain a boolean value indicating whether the participants successfully completed the given task. Note that participants had unlimited attempts until they timed out after a strict time limit of 30 minutes, so if a participant was unsuccessful the corresponding time value is 30.

Columns Task1Tests and Task2Tests contain the number of times a participant executed their code throughout a task, including their final submission if they were successful.

Columns LearnTask, ReadTask and WriteTask contain the scores that participants gave to the task editor component of their assigned programming environment. There are 3 scores for the categories "learnability", "readability" and "writability". Scores are on a 5-point scale from 1 (worst) to 5 (best).

Columns LearnTrig, ReadTrig and WriteTrig contain the scores that participants gave to the trigger editor component of their assigned programming environment. There are 3 scores for the categories "learnability", "readability" and "writability". Scores are on a 5-point scale from 1 (worst) to 5 (best).

Columns LearnComp, ReadComp and WriteComp contain the scores that participants gave to their assigned assigned programming environment in direct comparison to the other alternative. There are 3 scores for the categories "learnability", "readability" and "writability". Unlike in the paper, where scores are on a scale from -2 to 2, the raw scores here are on a 5-point scale from 1 (strong preference for other environment) to 5 (strong preference for own environment).

The script successplot.py was used to generate the success rate plot used in a figure in the paper

The script survival.py was used to perform the survival analysis presented in the paper and generate the related figure.

The script batplot.py was used to generate the 3x3 grid of ratings used in a figure in the paper.

The materials/ folder contains the tutorials and task descriptions we presented to study participants. It also contains the exact wording of pre-screening and post-experiemental survey questions.

The image pre-screening.png shows the three pre-screening questions we used to determine whether our participants could be included in our study.

The images tutorial1_instructions.png and tutorial1_sim.png contain the instructions and initial simulator state we provided to participants for the first programming tutorial. This tutorial did not provide starter code and was identical for both participant groups.

The images tutorial2_instructions.png and tutorial2_sim.png contain the instructions and initial simulator state we provided to participants for the second programming tutorial. This tutorial was identical for both participant groups and provided participants with starter code, which is shown in the images:

tutorial2_code_main.png for the main program in the left canvas

tutorial2_code_move.png for the definition of "Move box to the right".

The images tutorial3_instructions_blocks.png/tutorial3_instructions_graph.png and tutorial3_sim.png contain the instructions and initial simulator state we provided to participants for the third programming tutorial. This tutorial also provided participants with starter code, which is shown in the images:

tutorial3_code_main.png for the main program in the left canvas

tutorial3_code_pick.png for the definition of "Pick up box"

tutorial3_code_place.png for the definition of "Place box"

The images task1_instructions.png and task1_sim.png contain the instructions and initial simulator state we provided to participants for the first programming task. The task did not provide starter code and the instructions were identical for both participant groups.

The images task2_instructions.png and task2_sim.png contain the instructions and initial simulator state we provided to participants for the second programming task. The instructions were identical for both groups. This task also provided participants with starter code, which is shown in the images:

task2_code_main.png for the main program in the left canvas

task2_code_pick_prog.png for the definition of "Pick up block"

task2_code_load_trig_blocks.png/task2_code_load_trig_graph.png for the definition of the trigger "Ready to load machine"

task2_code_load_prog.png for the definition of "Load and activate machine"

task2_code_finished_trig_blocks.png/task2_code_finished_trig_graph.png for the definition of the trigger "Machine finished"

task2_code_finished_prog1.png for the definition of "Get block from machine"

task2_code_finished_prog2.png for the definition of "Place block in bin"

The document post_survey_full.pdf contains a raw export of the comprehension questions and post-experimental survey as they were presented to participants

The image usability.png shows the usability questions we used to determine a participant's rating of their assigned programming environment. The questions were identical for both participant groups.

The images comprehension_blocks_1.png and comprehension_blocks_2.png show the program comprehension questions we used to determine whether participants in the Blocks group could understand more complex triggers.

The images comprehension_graph_1.png and comprehension_graph_2.png show the program comprehension questions we used to determine whether participants in the Graph group could understand more complex triggers.

The images comparison_blocks.png and comparison_graph.png show the images of triggers in the alternative environment that we showed to our participants before choosing their preferred environment. The questions were identical for both participant groups.

The image comparison.png shows the questions we used to determine a participant's preference between the two programming environment alternatives.
Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...
zenodo.org
data.niaid.nih.gov
zip
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata (2023). Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph Generation from Text [Dataset]. http://doi.org/10.5281/zenodo.7916716
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7916716
Dataset updated
May 23, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the repository for ISWC 2023 Resource Track submission for Text2KGBench: Benchmark for Ontology-Driven Knowledge Graph Generation from Text. Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.

It contains two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences.

An example

An example test sentence:

Test Sentence: {"id": "ont_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King."}

An example of ontology:

Ontology: Music Ontology

Expected Output:

{ "id": "ont_k_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King.", "triples": [ { "sub": "The Loco-Motion", "rel": "publication date", "obj": "01 January 1962" },{ "sub": "The Loco-Motion", "rel": "lyrics by", "obj": "Gerry Goffin" },{ "sub": "The Loco-Motion", "rel": "lyrics by", "obj": "Carole King" },] }

The data is released under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY 4.0) License.

The structure of the repo is as the following.

Text2KGBench

src: the source code used for generation and evaluation, and baseline

benchmark the code used to generate the benchmark

evaluation evaluation scripts for calculating the results

baseline code for generating the baselines including prompts, sentence similarities, and LLM client.

data: the benchmark datasets and baseline data. There are two datasets: wikidata_tekgen and dbpedia_webnlg.

wikidata_tekgen Wikidata-TekGen Dataset

ontologies 10 ontologies used by this dataset

train training data

test test data

manually_verified_sentences ids of a subset of test cases manually validated

unseen_sentences new sentences that are added by the authors which are not part of Wikipedia

test unseen test unseen test sentences

ground_truth ground truth for unseen test sentences.

ground_truth ground truth for the test data

baselines data related to running the baselines.

test_train_sent_similarity for each test case, 5 most similar train sentences generated using SBERT T5-XXL model.

prompts prompts corresponding to each test file

unseen prompts unseen prompts for the unseen test cases

Alpaca-LoRA-13B data related to the Alpaca-LoRA model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

unseen results results for the unseen test cases

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

Vicuna-13B data related to the Vicuna-13B model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

dbpedia_webnlg DBpedia Dataset

ontologies 19 ontologies used by this dataset

train training data

test test data

ground_truth ground truth for the test data

baselines data related to running the baselines.

test_train_sent_similarity for each test case, 5 most similar train sentences generated using SBERT T5-XXL model.

prompts prompts corresponding to each test file

Alpaca-LoRA-13B data related to the Alpaca-LoRA model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

Vicuna-13B data related to the Vicuna-13B model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

This benchmark contains data derived from the TekGen corpus (part of the KELM corpus) [1] released under CC BY-SA 2.0 license and WebNLG 3.0 corpus [2] released under CC BY-NC-SA 4.0 license.

[1] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.

[2] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Scale	Number of Triples
0 percent	2000000 triples
25 percent	1500020 triples
50 percent	1000020 triples
75 percent	500020 triples
100 percent	20 triples

Scale	Number of Triples
0 percent	2000000 triples
25 percent	1500000 triples
50 percent	1000000 triples
75 percent	500000 triples
100 percent	0 triples

Scale	Number of Triples
1TM + 15POM	1500000 triples
3TM + 5POM	1500000 triples
5TM + 3POM	1500000 triples
15TM + 1POM	1500000 triples

Scale	Number of Triples
1M rows 1 column	1000000 triples
1M rows 10

Facebook

Twitter

Click to copy link

Link copied

Cite

Lele Cao; Lele Cao; Vilhelm von Ehrenheim; Vilhelm von Ehrenheim; Mark Granroth-Wilding; Mark Granroth-Wilding; Richard Anselmo Stahl; Richard Anselmo Stahl; Drew McCornack; Drew McCornack; Armin Catovic; Armin Catovic; Dhiana Deva Cavacanti Rocha; Dhiana Deva Cavacanti Rocha (2024). CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company Similarity Quantification [Dataset]. http://doi.org/10.5281/zenodo.11391315

CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company Similarity Quantification

Explore at:

application/gzip, bin, txtAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.11391315

Dataset updated

Jun 4, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Time period covered

May 29, 2024

Description

CompanyKG is a heterogeneous graph consisting of 1,169,931 nodes and 50,815,503 undirected edges, with each node representing a real-world company and each edge signifying a relationship between the connected pair of companies.

Edges: We model 15 different inter-company relations as undirected edges, each of which corresponds to a unique edge type. These edge types capture various forms of similarity between connected company pairs. Associated with each edge of a certain type, we calculate a real-numbered weight as an approximation of the similarity level of that type. It is important to note that the constructed edges do not represent an exhaustive list of all possible edges due to incomplete information. Consequently, this leads to a sparse and occasionally skewed distribution of edges for individual relation/edge types. Such characteristics pose additional challenges for downstream learning tasks. Please refer to our paper for a detailed definition of edge types and weight calculations.

Nodes: The graph includes all companies connected by edges defined previously. Each node represents a company and is associated with a descriptive text, such as "Klarna is a fintech company that provides support for direct and post-purchase payments ...". To comply with privacy and confidentiality requirements, we encoded the text into numerical embeddings using four different pre-trained text embedding models: mSBERT (multilingual Sentence BERT), ADA2, SimCSE (fine-tuned on the raw company descriptions) and PAUSE.

Evaluation Tasks. The primary goal of CompanyKG is to develop algorithms and models for quantifying the similarity between pairs of companies. In order to evaluate the effectiveness of these methods, we have carefully curated three evaluation tasks:

Similarity Prediction (SP). To assess the accuracy of pairwise company similarity, we constructed the SP evaluation set comprising 3,219 pairs of companies that are labeled either as positive (similar, denoted by "1") or negative (dissimilar, denoted by "0"). Of these pairs, 1,522 are positive and 1,697 are negative.
Competitor Retrieval (CR). Each sample contains one target company and one of its direct competitors. It contains 76 distinct target companies, each of which has 5.3 competitors annotated in average. For a given target company A with N direct competitors in this CR evaluation set, we expect a competent method to retrieve all N competitors when searching for similar companies to A.
Similarity Ranking (SR) is designed to assess the ability of any method to rank candidate companies (numbered 0 and 1) based on their similarity to a query company. Paid human annotators, with backgrounds in engineering, science, and investment, were tasked with determining which candidate company is more similar to the query company. It resulted in an evaluation set comprising 1,856 rigorously labeled ranking questions. We retained 20% (368 samples) of this set as a validation set for model development.
Edge Prediction (EP) evaluates a model's ability to predict future or missing relationships between companies, providing forward-looking insights for investment professionals. The EP dataset, derived (and sampled) from new edges collected between April 6, 2023, and May 25, 2024, includes 40,000 samples, with edges not present in the pre-existing CompanyKG (a snapshot up until April 5, 2023).

Background and Motivation

In the investment industry, it is often essential to identify similar companies for a variety of purposes, such as market/competitor mapping and Mergers & Acquisitions (M&A). Identifying comparable companies is a critical task, as it can inform investment decisions, help identify potential synergies, and reveal areas for growth and improvement. The accurate quantification of inter-company similarity, also referred to as company similarity quantification, is the cornerstone to successfully executing such tasks. However, company similarity quantification is often a challenging and time-consuming process, given the vast amount of data available on each company, and the complex and diversified relationships among them.

While there is no universally agreed definition of company similarity, researchers and practitioners in PE industry have adopted various criteria to measure similarity, typically reflecting the companies' operations and relationships. These criteria can embody one or more dimensions such as industry sectors, employee profiles, keywords/tags, customers' review, financial performance, co-appearance in news, and so on. Investment professionals usually begin with a limited number of companies of interest (a.k.a. seed companies) and require an algorithmic approach to expand their search to a larger list of companies for potential investment.

In recent years, transformer-based Language Models (LMs) have become the preferred method for encoding textual company descriptions into vector-space embeddings. Then companies that are similar to the seed companies can be searched in the embedding space using distance metrics like cosine similarity. The rapid advancements in Large LMs (LLMs), such as GPT-3/4 and LLaMA, have significantly enhanced the performance of general-purpose conversational models. These models, such as ChatGPT, can be employed to answer questions related to similar company discovery and quantification in a Q&A format.

However, graph is still the most natural choice for representing and learning diverse company relations due to its ability to model complex relationships between a large number of entities. By representing companies as nodes and their relationships as edges, we can form a Knowledge Graph (KG). Utilizing this KG allows us to efficiently capture and analyze the network structure of the business landscape. Moreover, KG-based approaches allow us to leverage powerful tools from network science, graph theory, and graph-based machine learning, such as Graph Neural Networks (GNNs), to extract insights and patterns to facilitate similar company analysis. While there are various company datasets (mostly commercial/proprietary and non-relational) and graph datasets available (mostly for single link/node/graph-level predictions), there is a scarcity of datasets and benchmarks that combine both to create a large-scale KG dataset expressing rich pairwise company relations.

Source Code and Tutorial:
https://github.com/llcresearch/CompanyKG2

Paper: to be published

Clear search

Close search

Google apps

Main menu

CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company...

A study on real graphs of fake news spreading on Twitter

Data from: KGCW 2024 Challenge @ ESWC 2024

Knowledge Graph Construction Workshop 2024: challenge

Track 1: Conformance

Track 2: Performance

Microsoft Academic Graph

Data from: NeSy4VRD: A Multifaceted Resource for Neurosymbolic AI Research...

Data from: A consensus method for ancestral recombination graphs

Multisource Field Plot Data for Studies of Vegetation Alliances:...

Data from: CLARA Knowledge Graph of licensed educational resources

Dataset - Clustering Semantic Predicates in the Open Research Knowledge...

UK Admiralty nautical chart series

Monthly development Dow Jones Industrial Average Index 2018-2025

Benchmark maps and assignments for multi-agent path finding

Replication Package: Block-based or Graph-Based? Why Not Both? Designing a...

Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...

CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company Similarity QuantificationSee More Versions

CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company Similarity Quantification