CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
An automatically assembled dataset of raw semantic mappings produced by python -m semra.database
. This incorporates mappings from the following places:
This is a database of raw mapping without further processing. For processed mapping datasets, we suggest smaller domain-specific processing rules (see https://github.com/biopragmatics/semra/tree/main/notebooks/landscape for examples). It can be accessed directly via:
mappings.sssom.tsv.gz
- loadable through any tools supporting SSSOMmappings.jsonl.gz
- loadable through SeMRA using semra.from_jsonl
sh run_on_docker.sh
from the command lineMappings are licensed according to their primary resources. These are explicitly annotated in the SSSOM file on each row (when available) and on the mapping set level in the Neo4j graph database artifacts.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the dataset used for paper: "A Recommender System of Buggy App Checkers for App Store Moderators", published on the International Conference on Mobile Software Engineering and Systems (MOBILESoft) in 2015.
Dataset Collection We built a dataset that consists of a random sample of Android app metadata and user reviews available on the Google Play Store on January and March 2014. Since the Google Play Store is continuously evolving (adding, removing and/or updating apps), we updated the dataset twice. The dataset D1 contains available apps in the Google Play Store in January 2014. Then, we created a new snapshot (D2) of the Google Play Store in March 2014.
The apps belong to the 27 different categories defined by Google (at the time of writing the paper), and the 4 predefined subcategories (free, paid, new_free, and new_paid). For each category-subcategory pair (e.g. tools-free, tools-paid, sports-new_free, etc.), we collected a maximum of 500 samples, resulting in a median number of 1.978 apps per category.
For each app, we retrieved the following metadata: name, package, creator, version code, version name, number of downloads, size, upload date, star rating, star counting, and the set of permission requests.
In addition, for each app, we collected up to a maximum of the latest 500 reviews posted by users in the Google Play Store. For each review, we retrieved its metadata: title, description, device, and version of the app. None of these fields were mandatory, thus several reviews lack some of these details. From all the reviews attached to an app, we only considered the reviews associated with the latest version of the app —i.e., we discarded unversioned and old-versioned reviews. Thus, resulting in a corpus of 1,402,717 reviews (2014 Jan.).
Dataset Stats Some stats about the datasets:
D1 (Jan. 2014) contains 38,781 apps requesting 7,826 different permissions, and 1,402,717 user reviews.
D2 (Mar. 2014) contains 46,644 apps and 9,319 different permission requests, and 1,361,319 user reviews.
Additional stats about the datasets are available here.
Dataset Description To store the dataset, we created a graph database with Neo4j. This dataset therefore consists of a graph describing the apps as nodes and edges. We chose a graph database because the graph visualization helps to identify connections among data (e.g., clusters of apps sharing similar sets of permission requests).
In particular, our dataset graph contains six types of nodes: - APP nodes containing metadata of each app, - PERMISSION nodes describing permission types, - CATEGORY nodes describing app categories, - SUBCATEGORY nodes describing app subcategories, - USER_REVIEW nodes storing user reviews. - TOPIC topics mined from user reviews (using LDA).
Furthermore, there are five types of relationships between APP nodes and each of the remaining nodes:
Dataset Files Info
Neo4j 2.0 Databases
googlePlayDB1-Jan2014_neo4j_2_0.rar
googlePlayDB2-Mar2014_neo4j_2_0.rar We provide two Neo4j databases containing the 2 snapshots of the Google Play Store (January and March 2014). These are the original databases created for the paper. The databases were created with Neo4j 2.0. In particular with the tool version 'Neo4j 2.0.0-M06 Community Edition' (latest version available at the time of implementing the paper in 2014).
Neo4j 3.5 Databases
googlePlayDB1-Jan2014_neo4j_3_5_28.rar
googlePlayDB2-Mar2014_neo4j_3_5_28.rar Currently, the version Neo4j 2.0 is deprecated and it is not available for download in the official Neo4j Download Center. We have migrated the original databases (Neo4j 2.0) to Neo4j 3.5.28. The databases can be opened with the tool version: 'Neo4j Community Edition 3.5.28'. The tool can be downloaded from the official Neo4j Donwload page.
In order to open the databases with more recent versions of Neo4j, the databases must be first migrated to the corresponding version. Instructions about the migration process can be found in the Neo4j Migration Guide.
First time the Neo4j database is connected, it could request credentials. The username and pasword are: neo4j/neo4j
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Synthetic dataset created with GPT-4o
Synthetic dataset of text2cypher over 16 different graph schemas. Questions were generated using GPT-4-turbo, and the corresponding Cypher statements with gpt-4o using Chain of Thought. Here, there are only questions that return results when queried against the database. For more information visit: https://github.com/neo4j-labs/text2cypher/tree/main/datasets/synthetic_gpt4o_demodbs Dataset is available as train.csv. Columns are the following:… See the full description on the dataset page: https://huggingface.co/datasets/tomasonjo/text2cypher-gpt4o-clean.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was constructed as a small example of a graph mimicking an e-commerce site where people can also follow each others.
The files here reported can be used to build a property graph in Neo4J:
This data was used as motivating example dataset in the paper "MINE GRAPH RULE: A New GQL Operator for Mining Association Rules in Property Graph Databases".
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This work presents a threat modelling approach to represent changes to the attack paths through an Internet of Things (IoT) environment when the environment changes dynamically, that is, when new devices are added or removed from the system or when whole sub-systems join or leave. The proposed approach investigates the propagation of threats using attack graphs, a popular attack modelling method. However, traditional attack-graph approaches have been applied in static environments that do not continuously change, such as enterprise networks, leading to static and usually very large attack graphs. In contrast, IoT environments are often characterised by dynamic change and interconnections; different topologies for different systems may interconnect with each other dynamically and outside the operator’s control. Such new interconnections lead to changes in the reachability amongst devices according to which their corresponding attack graphs change. This requires dynamic topology and attack graphs for threat and risk analysis. This article introduces an example scenario based on healthcare systems to motivate the work and illustrate the proposed approach. The proposed approach is implemented using a graph database management tool (GDBM), Neo4j, which is a popular tool for mapping, visualising, and querying the graphs of highly connected data. It is efficient in providing a rapid threat modelling mechanism, making it suitable for capturing security changes in the dynamic IoT environment. Our results show that our developed threat modelling approach copes with dynamic system changes that may occur in IoT environments and enables identifying attack paths, whilst allowing for system dynamics. The developed dynamic topology and attack graphs can cope with the changes in the IoT environment efficiently and rapidly by maintaining their associated graphs.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Business process event data modeled as labeled property graphs
Data Format
-----------
The dataset comprises one labeled property graph in two different file formats.
#1) Neo4j .dump format
A neo4j (https://neo4j.com) database dump that contains the entire graph and can be imported into a fresh neo4j database instance using the following command, see also the neo4j documentation: https://neo4j.com/docs/
/bin/neo4j-admin.(bat|sh) load --database=graph.db --from=
The .dump was created with Neo4j v3.5.
#2) .graphml format
A .zip file containing a .graphml file of the entire graph
Data Schema
-----------
The graph is a labeled property graph over business process event data. Each graph uses the following concepts
:Event nodes - each event node describes a discrete event, i.e., an atomic observation described by attribute "Activity" that occurred at the given "timestamp"
:Entity nodes - each entity node describes an entity (e.g., an object or a user), it has an EntityType and an identifier (attribute "ID")
:Log nodes - describes a collection of events that were recorded together, most graphs only contain one log node
:Class nodes - each class node describes a type of observation that has been recorded, e.g., the different types of activities that can be observed, :Class nodes group events into sets of identical observations
:CORR relationships - from :Event to :Entity nodes, describes whether an event is correlated to a specific entity; an event can be correlated to multiple entities
:DF relationships - "directly-followed by" between two :Event nodes describes which event is directly-followed by which other event; both events in a :DF relationship must be correlated to the same entity node. All :DF relationships form a directed acyclic graph.
:HAS relationship - from a :Log to an :Event node, describes which events had been recorded in which event log
:OBSERVES relationship - from an :Event to a :Class node, describes to which event class an event belongs, i.e., which activity was observed in the graph
:REL relationship - placeholder for any structural relationship between two :Entity nodes
The concepts a further defined in Stefan Esser, Dirk Fahland: Multi-Dimensional Event Data in Graph Databases. CoRR abs/2005.14552 (2020) https://arxiv.org/abs/2005.14552
Data Contents
-------------
neo4j-bpic19-2021-02-17 (.dump|.graphml.zip)
An integrated graph describing the raw event data of the entire BPI Challenge 2019 dataset.
van Dongen, B.F. (Boudewijn) (2019): BPI Challenge 2019. 4TU.ResearchData. Collection. https://doi.org/10.4121/uuid:d06aff4b-79f0-45e6-8ec8-e19730c248f1
This data originated from a large multinational company operating from The Netherlands in the area of coatings and paints and we ask participants to investigate the purchase order handling process for some of its 60 subsidiaries. In particular, the process owner has compliance questions. In the data, each purchase order (or purchase document) contains one or more line items. For each line item, there are roughly four types of flows in the data: (1) 3-way matching, invoice after goods receipt: For these items, the value of the goods receipt message should be matched against the value of an invoice receipt message and the value put during creation of the item (indicated by both the GR-based flag and the Goods Receipt flags set to true). (2) 3-way matching, invoice before goods receipt: Purchase Items that do require a goods receipt message, while they do not require GR-based invoicing (indicated by the GR-based IV flag set to false and the Goods Receipt flags set to true). For such purchase items, invoices can be entered before the goods are receipt, but they are blocked until goods are received. This unblocking can be done by a user, or by a batch process at regular intervals. Invoices should only be cleared if goods are received and the value matches with the invoice and the value at creation of the item. (3) 2-way matching (no goods receipt needed): For these items, the value of the invoice should match the value at creation (in full or partially until PO value is consumed), but there is no separate goods receipt message required (indicated by both the GR-based flag and the Goods Receipt flags set to false). (4)Consignment: For these items, there are no invoices on PO level as this is handled fully in a separate process. Here we see GR indicator is set to true but the GR IV flag is set to false and also we know by item type (consignment) that we do not expect an invoice against this item. Unfortunately, the complexity of the data goes further than just this division in four categories. For each purchase item, there can be many goods receipt messages and corresponding invoices which are subsequently paid. Consider for example the process of paying rent. There is a Purchase Document with one item for paying rent, but a total of 12 goods receipt messages with (cleared) invoices with a value equal to 1/12 of the total amount. For logistical services, there may even be hundreds of goods receipt messages for one line item. Overall, for each line item, the amounts of the line item, the goods receipt messages (if applicable) and the invoices have to match for the process to be compliant. Of course, the log is anonymized, but some semantics are left in the data, for example: The resources are split between batch users and normal users indicated by their name. The batch users are automated processes executed by different systems. The normal users refer to human actors in the process. The monetary values of each event are anonymized from the original data using a linear translation respecting 0, i.e. addition of multiple invoices for a single item should still lead to the original item worth (although there may be small rounding errors for numerical reasons). Company, vendor, system and document names and IDs are anonymized in a consistent way throughout the log. The company has the key, so any result can be translated by them to business insights about real customers and real purchase documents.
The case ID is a combination of the purchase document and the purchase item. There is a total of 76,349 purchase documents containing in total 251,734 items, i.e. there are 251,734 cases. In these cases, there are 1,595,923 events relating to 42 activities performed by 627 users (607 human users and 20 batch users). Sometimes the user field is empty, or NONE, which indicates no user was recorded in the source system. For each purchase item (or case) the following attributes are recorded: concept:name: A combination of the purchase document id and the item id, Purchasing Document: The purchasing document ID, Item: The item ID, Item Type: The type of the item, GR-Based Inv. Verif.: Flag indicating if GR-based invoicing is required (see above), Goods Receipt: Flag indicating if 3-way matching is required (see above), Source: The source system of this item, Doc. Category name: The name of the category of the purchasing document, Company: The subsidiary of the company from where the purchase originated, Spend classification text: A text explaining the class of purchase item, Spend area text: A text explaining the area for the purchase item, Sub spend area text: Another text explaining the area for the purchase item, Vendor: The vendor to which the purchase document was sent, Name: The name of the vendor, Document Type: The document type, Item Category: The category as explained above (3-way with GR-based invoicing, 3-way without, 2-way, consignment).
The data contains the following entities and their events
- PO - Purchase Order documents handled at a large multinational company operating from The Netherlands
- POItem - an item in a Purchase Order document describing a specific item to be purchased
- Resource - the user or worker handling the document or a specific item
- Vendor - the external organization from which an item is to be purchased
Data Size
---------
BPIC19, nodes: 1926651, relationships: 15082099
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present three defect rediscovery datasets mined from Bugzilla. The datasets capture data for three groups of open source software projects: Apache, Eclipse, and KDE. The datasets contain information about approximately 914 thousands of defect reports over a period of 18 years (1999-2017) to capture the inter-relationships among duplicate defects.
File Descriptions
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The Knowledge Graph (KG) is designed with a focus on IMO number, ship name, and ship type as central identifiers, ensuring that every node representing a vessel is anchored by these unique values. This design choice allows for a clear and consistent visual representation where each ship is easily identifiable, and all associated data is linked directly to these identifiers.
In the Neo4j graph, the visualization is structured so that nodes and relationships form an interconnected web of information. The graph includes nodes for CO₂ emissions, fuel usage, KPIs, time information, and basic registry details. The right-hand panel in the Neo4j interface is particularly useful; it displays detailed properties of any selected node. For example, when a user selects a node representing CO₂ emissions, the panel highlights specific data points such as annual CO₂ emissions per distance traveled and per unit of work performed. This detailed view is essential for understanding the environmental performance of a vessel. It provides stakeholders with an at-a-glance summary of key performance metrics and allows them to drill down into the data for further analysis.
The visual structure of the Knowledge Graph also reinforces the interconnected nature of the data. The graph not only displays nodes but also shows the relationships between them, which represent how different aspects of a vessel’s performance are linked. For instance, a ship node (BasicInfo) is visually connected to multiple other nodes, such as Fuel, CO₂, TimeInfo, KPI, and CargoTransport. Since IMO number, ship name, and ship type serve as the primary unique identifiers, they act as the backbone that links all related data tables and relationships within the graph. This interconnected layout makes it immediately apparent which vessels have similar performance profiles, and which operational metrics are closely correlated.
By leveraging Neo4j’s powerful visualization capabilities, the KG transforms static data into a dynamic and interactive tool for real-time decision-making.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Translated Text2Cypher'24 Training Set - Sampled & Multilingual
This dataset provides a sampled and translated training set based on the Neo4j Text2Cypher '24 dataset. It is designed to support research on multilingual natural language to Cypher query generation. We offer two versions of the training set:
1. Multilingual Version (multilang)
Total examples: ~36,000
Languages: English (en), Spanish (es), Turkish (tr)
Samples per language: ~12,000
Translation… See the full description on the dataset page: https://huggingface.co/datasets/mgoNeo4j/translated_text2cypher24_trainset_sampled.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This folder contains all the datasets used for the performance evaluation of the MINE GRAPH RULE operator proposed in the paper "MINE GRAPH RULE: A New GQL Operator for Mining Association Rules in Property Graph Databases".
Each folder contains the following files used to create a property graph in Neo4j with a fixed schema mimicking an e-commerce site.
The folders contain various graph instances with differing dimensions, and each folder is named to reflect its defining features. The features in the name are given in this order:
(Example: the folder 10000_0.5_0.0005_0.1_0.0005_dataset contains files of a graph with 10000 nodes, of which half of them are Person nodes, 0.0005 is the probability of having a relationship FOLLOW between two Person nodes, 0.1 is the probability of having a relationship BUY between a Person node and an Item node, and 0.0005 is the probability of having a relationship RECOMMEND between a Person node and an Item node).
Not seeing a result you expected?
Learn how you can add new datasets to our index.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
An automatically assembled dataset of raw semantic mappings produced by python -m semra.database
. This incorporates mappings from the following places:
This is a database of raw mapping without further processing. For processed mapping datasets, we suggest smaller domain-specific processing rules (see https://github.com/biopragmatics/semra/tree/main/notebooks/landscape for examples). It can be accessed directly via:
mappings.sssom.tsv.gz
- loadable through any tools supporting SSSOMmappings.jsonl.gz
- loadable through SeMRA using semra.from_jsonl
sh run_on_docker.sh
from the command lineMappings are licensed according to their primary resources. These are explicitly annotated in the SSSOM file on each row (when available) and on the mapping set level in the Neo4j graph database artifacts.