9 datasets found

Neo4j open measurment
kaggle.com
zip
Updated Feb 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tom Nijhof-Verhees (2023). Neo4j open measurment [Dataset]. https://www.kaggle.com/datasets/wagenrace/neo4j-open-measurment
Explore at:
zip(29854808766 bytes)Available download formats
Dataset updated
Feb 15, 2023
Authors
Tom Nijhof-Verhees
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Kickstart a chemical graph database

I have spent some time scrapping and shaping PubChem data into a Neo4j graph database. The process took a lot of time, mainly downloading, and loading it into Neo4j. The whole process took weeks. If you want to build your own I will show you how to download mine and set it up in less than an hour (most of the time you’ll just have to wait). The process of how this dataset is created is described in the following blogs: - https://medium.com/@nijhof.dns/exploring-neodash-for-197m-chemical-full-text-graph-e3baed9615b8 - https://medium.com/neo4j/combining-3-biochemical-datasets-in-a-graph-database-8e9aafbb5788 - https://medium.com/p/d9ee9779dfbe

What do you get?

The full database is a merge of 3 datasets, PubChem (compounds + synonyms), NCI60 (GI50), and ChEMBL (cell lines). It contains 6 nodes of interest: ● Compound: This is related to a compound of PubChem. It has 1 property. ○ pubChemCompId: The id within pubchem. So “compound:cid162366967” links to https://pubchem.ncbi.nlm.nih.gov/compound/162366967. This number can be used with both PubChem RDF and PUG. ● Synonym: A name found in the literature. This name can refer to zero, one, or more compounds. This helps find relations between natural language names and absolute compounds they are related to. ○ Name: Natural language name. Can contain letters, spaces, numbers, and any other Unicode character. ○ pubChemSynId: PubChem synonym id as used within the RDF ● CellLine: These are the ChEMBL cell lines. They hold a lot of information. ○ Name: The name of the cell line. ○ Uri: A unique URI for every element within the ChEMBL RDF. ○ cellosaurusId: The id to connect it to the Cellosaurus dataset. This is one of the most extensive cell line datasets out there. ● Measurement: A measurement you can do within a biomedical experiment. Currently, only GI50 (the concentration needed for Growth Inhibition of 50%) is added. ○ Name: Name of the measurement. ● Condition: A single condition of an experiment. A condition is part of an experiment. Examples are: an individual of the control group, a sample with drug A, or a sample with more CO2 ● Experiment: A collection of multiple conditions all done at the same time with the same bias. Meaning we assume all uncontrolled variables are the same. ○ Name: Name of experiment.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F442733%2F7dd804811e105390dfe20bb5cd1a68c0%2FUntitled%20graph.png?generation=1680113457794452&alt=media" alt="">

Overview of the graph design

How do download it Warning, you need 120 GB of free memory. The compressed file you download is already 30 GB. The uncompressed file is 30 GB. The database afterward is 60 GB. 60 GB is only for temporary files, the other 60 is for the database. If you do this on an HDD hard disk it will be slow.

If you load this into Neo4j desktop as a local database (like I do) it will scream and yell at you, just ignore this. We are pushing it far further than it is designed for, but it will still work.

Download the file

Go to this Kaggle dataset and download the dump file. Unzip the file, then delete the zipped file. This part needs 60 GB but only takes 30 by the end of it. Create a database Open the Neo4j desktop app, and click “Reveal files in File Explorer”. Move the .dump you downloaded into this folder.

Click on the ... behind the .dump file and click Create new DBMS from dump. This database is a dump from Neo4j V4, so your database also needs to be V4.x.x!

It will now create the database. This will take a long time, it might even say it has timed out. Do not believe this lie! In the background, it is still running. Every time you start it, it will time out. Just let it run and press start later again. The second time it will be started up directly.

Every time I start it up I get the timed-out error. After waiting 10 minutes and clicking start again the database, and with it, more than 200 million nodes, is ready. And you are done! Good luck and let me know what you build with it
Z
Dataset used for "A Recommender System of Buggy App Checkers for App Store...
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Jun 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maria Gomez; Romain Rouvoy; Martin Monperrus; Lionel Seinturier (2021). Dataset used for "A Recommender System of Buggy App Checkers for App Store Moderators" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5034291
Explore at:
Dataset updated
Jun 28, 2021
Dataset provided by
University of Lille / Inria
Authors
Maria Gomez; Romain Rouvoy; Martin Monperrus; Lionel Seinturier
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the dataset used for paper: "A Recommender System of Buggy App Checkers for App Store Moderators", published on the International Conference on Mobile Software Engineering and Systems (MOBILESoft) in 2015.

Dataset Collection We built a dataset that consists of a random sample of Android app metadata and user reviews available on the Google Play Store on January and March 2014. Since the Google Play Store is continuously evolving (adding, removing and/or updating apps), we updated the dataset twice. The dataset D1 contains available apps in the Google Play Store in January 2014. Then, we created a new snapshot (D2) of the Google Play Store in March 2014.

The apps belong to the 27 different categories defined by Google (at the time of writing the paper), and the 4 predefined subcategories (free, paid, new_free, and new_paid). For each category-subcategory pair (e.g. tools-free, tools-paid, sports-new_free, etc.), we collected a maximum of 500 samples, resulting in a median number of 1.978 apps per category.

For each app, we retrieved the following metadata: name, package, creator, version code, version name, number of downloads, size, upload date, star rating, star counting, and the set of permission requests.

In addition, for each app, we collected up to a maximum of the latest 500 reviews posted by users in the Google Play Store. For each review, we retrieved its metadata: title, description, device, and version of the app. None of these fields were mandatory, thus several reviews lack some of these details. From all the reviews attached to an app, we only considered the reviews associated with the latest version of the app —i.e., we discarded unversioned and old-versioned reviews. Thus, resulting in a corpus of 1,402,717 reviews (2014 Jan.).

Dataset Stats Some stats about the datasets:

D1 (Jan. 2014) contains 38,781 apps requesting 7,826 different permissions, and 1,402,717 user reviews.

D2 (Mar. 2014) contains 46,644 apps and 9,319 different permission requests, and 1,361,319 user reviews.

Additional stats about the datasets are available here.

Dataset Description To store the dataset, we created a graph database with Neo4j. This dataset therefore consists of a graph describing the apps as nodes and edges. We chose a graph database because the graph visualization helps to identify connections among data (e.g., clusters of apps sharing similar sets of permission requests).

In particular, our dataset graph contains six types of nodes: - APP nodes containing metadata of each app, - PERMISSION nodes describing permission types, - CATEGORY nodes describing app categories, - SUBCATEGORY nodes describing app subcategories, - USER_REVIEW nodes storing user reviews. - TOPIC topics mined from user reviews (using LDA).

Furthermore, there are five types of relationships between APP nodes and each of the remaining nodes:

USES_PERMISSION relationships between APP and PERMISSION nodes

HAS_REVIEW between APP and USER_REVIEW nodes

HAS_TOPIC between USER_REVIEW and TOPIC nodes

BELONGS_TO_CATEGORY between APP and CATEGORY nodes

BELONGS_TO_SUBCATEGORY between APP and SUBCATEGORY nodes

Dataset Files Info

Neo4j 2.0 Databases

googlePlayDB1-Jan2014_neo4j_2_0.rar

googlePlayDB2-Mar2014_neo4j_2_0.rar We provide two Neo4j databases containing the 2 snapshots of the Google Play Store (January and March 2014). These are the original databases created for the paper. The databases were created with Neo4j 2.0. In particular with the tool version 'Neo4j 2.0.0-M06 Community Edition' (latest version available at the time of implementing the paper in 2014).

Neo4j 3.5 Databases

googlePlayDB1-Jan2014_neo4j_3_5_28.rar

googlePlayDB2-Mar2014_neo4j_3_5_28.rar Currently, the version Neo4j 2.0 is deprecated and it is not available for download in the official Neo4j Download Center. We have migrated the original databases (Neo4j 2.0) to Neo4j 3.5.28. The databases can be opened with the tool version: 'Neo4j Community Edition 3.5.28'. The tool can be downloaded from the official Neo4j Donwload page.

In order to open the databases with more recent versions of Neo4j, the databases must be first migrated to the corresponding version. Instructions about the migration process can be found in the Neo4j Migration Guide. First time the Neo4j database is connected, it could request credentials. The username and pasword are: neo4j/neo4j
Twitter Graph Example v2 43
kaggle.com
zip
Updated Jun 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mathias Weiß (2022). Twitter Graph Example v2 43 [Dataset]. https://www.kaggle.com/datasets/weissmedia/twitter-graph-example-v2-43
Explore at:
zip(17943518 bytes)Available download formats
Dataset updated
Jun 29, 2022
Authors
Mathias Weiß
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This project is inspired on https://github.com/neo4j-graph-examples/twitter-v2.

Twitter Graph

Show data from your personal Twitter account

The Graph Your Network application inserts your Twitter activity into Neo4j.

https://neo4jsandbox.com/guides/twitter/img/twitter-data-model.svg" alt="">

Content

~10 MB of graphs data (CSV)

43.325 node labels - Hashtag - Link - Me - Source - Tweet - User

57.896 relationship types - AMPLIFIES - CONTAINS - FOLLOWS - INTERACTS_WITH - MENTIONS - POSTS - REPLY_TO - RETWEETS - RT_MENTIONS - SIMILAR_TO - TAGS - USING
Z
Rediscovery Datasets: Connecting Duplicate Reports of Apache, Eclipse, and...
data.niaid.nih.gov
Updated Aug 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sadat, Mefta; Bener, Ayse Basar; Miranskyy, Andriy V. (2024). Rediscovery Datasets: Connecting Duplicate Reports of Apache, Eclipse, and KDE [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_400614
Explore at:
Dataset updated
Aug 3, 2024
Dataset provided by
Ryerson University
Authors
Sadat, Mefta; Bener, Ayse Basar; Miranskyy, Andriy V.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present three defect rediscovery datasets mined from Bugzilla. The datasets capture data for three groups of open source software projects: Apache, Eclipse, and KDE. The datasets contain information about approximately 914 thousands of defect reports over a period of 18 years (1999-2017) to capture the inter-relationships among duplicate defects.

File Descriptions

apache.csv - Apache Defect Rediscovery dataset

eclipse.csv - Eclipse Defect Rediscovery dataset

kde.csv - KDE Defect Rediscovery dataset

apache.relations.csv - Inter-relations of rediscovered defects of Apache

eclipse.relations.csv - Inter-relations of rediscovered defects of Eclipse

kde.relations.csv - Inter-relations of rediscovered defects of KDE

create_and_populate_neo4j_objects.cypher - Populates Neo4j graphDB by importing all the data from the CSV files. Note that you have to set dbms.import.csv.legacy_quote_escaping configuration setting to false to load the CSV files as per https://neo4j.com/docs/operations-manual/current/reference/configuration-settings/#config_dbms.import.csv.legacy_quote_escaping

create_and_populate_mysql_objects.sql - Populates MySQL RDBMS by importing all the data from the CSV files

rediscovery_db_mysql.zip - For your convenience, we also provide full backup of the MySQL database

neo4j_examples.txt - Sample Neo4j queries

mysql_examples.txt - Sample MySQL queries

rediscovery_eclipse_6325.png - Output of Neo4j example #1

distinct_attrs.csv - Distinct values of bug_status, resolution, priority, severity for each project
CIS Graph Database and Model
figshare.com
pdf
Updated Sep 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanislava Gardasevic (2023). CIS Graph Database and Model [Dataset]. http://doi.org/10.6084/m9.figshare.21663401.v4
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21663401.v4
Dataset updated
Sep 6, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Stanislava Gardasevic
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is based on the model developed with the Ph.D. students of the Communication and Information Sciences Ph.D. program at the University of Hawaii at Manoa, intended to help new students get relevant information. The model was first presented at the iConference 2023, in a paper "Community Design of a Knowledge Graph to Support Interdisciplinary Ph.D. Students " by Stanislava Gardasevic and Rich Gazan (available at: https://scholarspace.manoa.hawaii.edu/server/api/core/bitstreams/9eebcea7-06fd-4db3-b420-347883e6379e/content)The database is created in Neo4J, and the .dump file can be imported to the cloud instance of this software. The dataset (.dump) contains publically available data collected from multiple web locations and indexes of the sample of publications from the people in this domain. Except for that, it contains my (first author's) personal graph demonstrating progress through a student's program in this degree, and activities they have done while in the program. This dataset was made possible with the huge help of my collaborator, Petar Popovic, who ingested the data in the database.The model and dataset were developed while involving the end users in the design and are based on the actual information needs of a population. It is intended to allow researchers to investigate multigraph visualization of the data modeled by the said model.The knowledge graph was evaluated with CIS student population, and the study results show that it is very helpful for decision-making, information discovery, and identification of people in one's surroundings who might be good collaborators or information points. We provide the .json file containing the Neo4J Bloom perspective with styling and queries used in these evaluation sessions.
G
Managed Neo4j Services Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Aug 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Managed Neo4j Services Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/managed-neo4j-services-market
Explore at:
pdf, csv, pptxAvailable download formats
Dataset updated
Aug 23, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Managed Neo4j Services Market Outlook

According to our latest research, the global managed Neo4j services market size reached USD 423 million in 2024, reflecting robust demand for graph database solutions across diverse industries. The market is projected to expand at a CAGR of 20.1% from 2025 to 2033, reaching a forecasted value of USD 2.23 billion by 2033. This remarkable growth trajectory is driven by the increasing adoption of connected data analytics, rising digital transformation initiatives, and the need for scalable, flexible, and managed database solutions across enterprises worldwide.

One of the primary growth factors fueling the managed Neo4j services market is the exponential rise in data complexity and interconnectedness within enterprise environments. Organizations are increasingly recognizing the limitations of traditional relational databases in handling highly connected data, such as social networks, fraud detection, recommendation engines, and supply chain management. Managed Neo4j services, leveraging the power of graph databases, enable businesses to model, store, and analyze complex relationships efficiently. The growing need for real-time insights, enhanced customer experiences, and advanced analytics capabilities is pushing enterprises to adopt managed Neo4j solutions, as these services offer seamless integration, scalability, and expert support for mission-critical applications.

Another significant driver for the managed Neo4j services market is the widespread shift towards cloud-based and hybrid IT infrastructures. As organizations migrate their workloads to the cloud, managed services become essential for ensuring optimal performance, security, and cost-effectiveness. Managed Neo4j providers offer end-to-end solutions, including consulting, implementation, support, and training, which alleviate the burden on internal IT teams and accelerate time-to-value. The increasing prevalence of multi-cloud strategies, combined with the need for high availability and disaster recovery, further enhances the appeal of managed Neo4j services. Enterprises are also prioritizing compliance and data governance, and managed service providers are well-positioned to deliver solutions that meet regulatory requirements while enabling innovation.

The managed Neo4j services market is also benefiting from the surge in artificial intelligence, machine learning, and big data analytics initiatives across industries. Graph databases like Neo4j are uniquely suited to support advanced analytics use cases, such as knowledge graphs, identity and access management, and network analysis. As organizations seek to unlock the value of their data assets, managed Neo4j services provide the expertise, tools, and ongoing support needed to deploy and scale graph-based applications. The rise of digital ecosystems, IoT integration, and API-driven architectures is further expanding the addressable market for managed Neo4j services, as enterprises aim to stay competitive in a rapidly evolving digital landscape.

From a regional perspective, North America continues to dominate the managed Neo4j services market, accounting for the largest share in 2024, driven by early technology adoption, a mature IT services sector, and strong investments in data-driven initiatives. However, Asia Pacific is emerging as the fastest-growing region, with a projected CAGR exceeding 24% during the forecast period, fueled by rapid digitalization, expanding cloud adoption, and government-led innovation programs. Europe, Latin America, and the Middle East & Africa are also witnessing increased demand for managed Neo4j solutions, as enterprises across these regions embrace graph databases to enhance operational efficiency, customer engagement, and compliance.

Service Type Analysis

The managed Neo4j services market is segmented by service type into consulting, implementation, support & maintenance, and training. Consulting services represent a critical entry point for organizations embarking on their
f
DataSheet1_Threat modelling in Internet of Things (IoT) environments using...
frontiersin.figshare.com
zip
Updated May 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marwa Salayma (2024). DataSheet1_Threat modelling in Internet of Things (IoT) environments using dynamic attack graphs.ZIP [Dataset]. http://doi.org/10.3389/friot.2024.1306465.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.3389/friot.2024.1306465.s001
Dataset updated
May 30, 2024
Dataset provided by
Frontiers
Authors
Marwa Salayma
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This work presents a threat modelling approach to represent changes to the attack paths through an Internet of Things (IoT) environment when the environment changes dynamically, that is, when new devices are added or removed from the system or when whole sub-systems join or leave. The proposed approach investigates the propagation of threats using attack graphs, a popular attack modelling method. However, traditional attack-graph approaches have been applied in static environments that do not continuously change, such as enterprise networks, leading to static and usually very large attack graphs. In contrast, IoT environments are often characterised by dynamic change and interconnections; different topologies for different systems may interconnect with each other dynamically and outside the operator’s control. Such new interconnections lead to changes in the reachability amongst devices according to which their corresponding attack graphs change. This requires dynamic topology and attack graphs for threat and risk analysis. This article introduces an example scenario based on healthcare systems to motivate the work and illustrate the proposed approach. The proposed approach is implemented using a graph database management tool (GDBM), Neo4j, which is a popular tool for mapping, visualising, and querying the graphs of highly connected data. It is efficient in providing a rapid threat modelling mechanism, making it suitable for capturing security changes in the dynamic IoT environment. Our results show that our developed threat modelling approach copes with dynamic system changes that may occur in IoT environments and enables identifying attack paths, whilst allowing for system dynamics. The developed dynamic topology and attack graphs can cope with the changes in the IoT environment efficiently and rapidly by maintaining their associated graphs.
GPU Database Market by Deployment and Geography - Forecast and Analysis...
technavio.com
pdf
Updated Oct 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2021). GPU Database Market by Deployment and Geography - Forecast and Analysis 2021-2025 [Dataset]. https://www.technavio.com/report/gpu-database-market-industry-analysis
Explore at:
pdfAvailable download formats
Dataset updated
Oct 19, 2021
Dataset provided by
TechNavio
Authors
Technavio
License
https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Time period covered
2020 - 2025
Description
Snapshot img

The GPU database market share should rise by USD 361.56 million from 2021 to 2025 at a CAGR of 17.82%.

This GPU database market research report provides valuable insights on the post COVID-19 impact on the market, which will help companies evaluate their business approaches. Furthermore, this report extensively covers market segmentation by deployment (on-premise and cloud) and geography (North America, Europe, APAC, South America, and MEA). The GPU database market report also offers information on several market vendors, including BlazingSQL Inc., Brytlyt Ltd., Hetero DB Co. Ltd., Jedox GmbH, Kinetica DB Inc., Neo4j Inc., NVIDIA Corp., OmniSci Inc., SQream Technologies Ltd., and Zilliz among others.

What will the GPU Database Market Size be in 2021?

To Unlock the GPU Database Market Size for 2021 and Other Important Statistics, Download the Free Report Sample!

GPU Database Market: Key Drivers and Trends

The massive data generation across various industries supporting the adoption of GPU accelerated tools is notably driving the GPU database market growth, although factors such as unavailability of enough technical expertise and domain knowledge may impede market growth. Our research analysts have studied the historical data and deduced the key market drivers and the COVID-19 pandemic impact on the GPU database industry. The holistic analysis of the drivers will help in predicting end goals and refining marketing strategies to gain a competitive edge.

This GPU database market analysis report also provides detailed information on other upcoming trends and challenges that will have a far-reaching effect on the market growth. The actionable insights on the trends and challenges will help companies evaluate and develop growth strategies for 2021-2025.

Who are the Major GPU Database Market Vendors?

The report analyzes the market’s competitive landscape and offers information on several market vendors, including:

BlazingSQL Inc. Brytlyt Ltd. Hetero DB Co. Ltd. Jedox GmbH Kinetica DB Inc. Neo4j Inc. NVIDIA Corp. OmniSci Inc. SQream Technologies Ltd. Zilliz

The vendor landscape of the GPU database market entails successful business strategies deployed by the vendors. The GPU database market is fragmented and the vendors are deploying various organic and inorganic growth strategies to compete in the market.

To make the most of the opportunities and recover from post COVID-19 impact, market vendors should focus more on the growth prospects in the fast-growing segments, while maintaining their positions in the slow-growing segments.

Download a free sample of the GPU database market forecast report for insights on complete key vendor profiles. The profiles include information on the production, sustainability, and prospects of the leading companies.

Which are the Key Regions for GPU Database Market?

For more insights on the market share of various regions Request for a FREE sample now!

48% of the market’s growth will originate from North America during the forecast period. The US is the key market for GPU databases in North America.

The report offers an up-to-date analysis of the geographical composition of the market. North America has been recording a significant growth rate and is expected to offer several growth opportunities to market vendors during the forecast period. The growing demand for artificial intelligence (AI) will facilitate the GPU database market growth in North America over the forecast period. The report offers an up-to-date analysis of the geographical composition of the market, competitive intelligence, and regional opportunities in store for vendors.

What are the Revenue-generating Deployment Segments in the GPU Database Market?

To gain further insights on the market contribution of various segments Request for a FREE sample

The GPU database market share growth by the on-premise segment has been significant. This report provides insights on the impact of the unprecedented outbreak of COVID-19 on market segments. Through these insights, you can safely deduce transformation patterns in consumer behavior, which is crucial to gauge segment-wise revenue growth during 2021-2025 and embrace technologies to improve business efficiency.

This report provides an accurate prediction of the contribution of all the segments to the growth of the GPU database market size. Furthermore, our analysts have indicated actionable market insights on post COVID-19 impact on each segment, which is crucial to predict change in consumer demand.

GPU Database Market Scope Report Coverage Details Page number 120 Base year 2020 Forecast period 2021-2025 Growth momentum & CAGR Accelerate at a CAGR of 17.82% Market growth 2021-2025 $ 361.56 million
h
translated_text2cypher24_trainset_sampled
huggingface.co
Updated Nov 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MGO (2025). translated_text2cypher24_trainset_sampled [Dataset]. https://huggingface.co/datasets/mgoNeo4j/translated_text2cypher24_trainset_sampled
Explore at:
Dataset updated
Nov 27, 2025
Authors
MGO
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Translated Text2Cypher'24 Training Set - Sampled & Multilingual

This dataset provides a sampled and translated training set based on the Neo4j Text2Cypher '24 dataset. It is designed to support research on multilingual natural language to Cypher query generation. We offer two versions of the training set:

1. Multilingual Version (multilang)

Total examples: ~36,000
Languages: English (en), Spanish (es), Turkish (tr)
Samples per language: ~12,000
Translation… See the full description on the dataset page: https://huggingface.co/datasets/mgoNeo4j/translated_text2cypher24_trainset_sampled.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Tom Nijhof-Verhees (2023). Neo4j open measurment [Dataset]. https://www.kaggle.com/datasets/wagenrace/neo4j-open-measurment

Neo4j open measurment

A graph database with 193 million synonyms

Explore at:

zip(29854808766 bytes)Available download formats

Dataset updated

Feb 15, 2023

Authors

Tom Nijhof-Verhees

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Kickstart a chemical graph database

I have spent some time scrapping and shaping PubChem data into a Neo4j graph database. The process took a lot of time, mainly downloading, and loading it into Neo4j. The whole process took weeks. If you want to build your own I will show you how to download mine and set it up in less than an hour (most of the time you’ll just have to wait). The process of how this dataset is created is described in the following blogs: - https://medium.com/@nijhof.dns/exploring-neodash-for-197m-chemical-full-text-graph-e3baed9615b8 - https://medium.com/neo4j/combining-3-biochemical-datasets-in-a-graph-database-8e9aafbb5788 - https://medium.com/p/d9ee9779dfbe

What do you get?

The full database is a merge of 3 datasets, PubChem (compounds + synonyms), NCI60 (GI50), and ChEMBL (cell lines). It contains 6 nodes of interest: ● Compound: This is related to a compound of PubChem. It has 1 property. ○ pubChemCompId: The id within pubchem. So “compound:cid162366967” links to https://pubchem.ncbi.nlm.nih.gov/compound/162366967. This number can be used with both PubChem RDF and PUG. ● Synonym: A name found in the literature. This name can refer to zero, one, or more compounds. This helps find relations between natural language names and absolute compounds they are related to. ○ Name: Natural language name. Can contain letters, spaces, numbers, and any other Unicode character. ○ pubChemSynId: PubChem synonym id as used within the RDF ● CellLine: These are the ChEMBL cell lines. They hold a lot of information. ○ Name: The name of the cell line. ○ Uri: A unique URI for every element within the ChEMBL RDF. ○ cellosaurusId: The id to connect it to the Cellosaurus dataset. This is one of the most extensive cell line datasets out there. ● Measurement: A measurement you can do within a biomedical experiment. Currently, only GI50 (the concentration needed for Growth Inhibition of 50%) is added. ○ Name: Name of the measurement. ● Condition: A single condition of an experiment. A condition is part of an experiment. Examples are: an individual of the control group, a sample with drug A, or a sample with more CO2 ● Experiment: A collection of multiple conditions all done at the same time with the same bias. Meaning we assume all uncontrolled variables are the same. ○ Name: Name of experiment.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F442733%2F7dd804811e105390dfe20bb5cd1a68c0%2FUntitled%20graph.png?generation=1680113457794452&alt=media" alt="">

Overview of the graph design

How do download it Warning, you need 120 GB of free memory. The compressed file you download is already 30 GB. The uncompressed file is 30 GB. The database afterward is 60 GB. 60 GB is only for temporary files, the other 60 is for the database. If you do this on an HDD hard disk it will be slow.

If you load this into Neo4j desktop as a local database (like I do) it will scream and yell at you, just ignore this. We are pushing it far further than it is designed for, but it will still work.

Download the file

Go to this Kaggle dataset and download the dump file. Unzip the file, then delete the zipped file. This part needs 60 GB but only takes 30 by the end of it. Create a database Open the Neo4j desktop app, and click “Reveal files in File Explorer”. Move the .dump you downloaded into this folder.

Click on the ... behind the .dump file and click Create new DBMS from dump. This database is a dump from Neo4j V4, so your database also needs to be V4.x.x!

It will now create the database. This will take a long time, it might even say it has timed out. Do not believe this lie! In the background, it is still running. Every time you start it, it will time out. Just let it run and press start later again. The second time it will be started up directly.

Every time I start it up I get the timed-out error. After waiting 10 minutes and clicking start again the database, and with it, more than 200 million nodes, is ready. And you are done! Good luck and let me know what you build with it

Clear search

Close search

Google apps

Main menu

Neo4j open measurment

Kickstart a chemical graph database

What do you get?

Overview of the graph design

Download the file

Dataset used for "A Recommender System of Buggy App Checkers for App Store...

Twitter Graph Example v2 43

Twitter Graph

Content

Rediscovery Datasets: Connecting Duplicate Reports of Apache, Eclipse, and...

CIS Graph Database and Model

Managed Neo4j Services Market Research Report 2033

Managed Neo4j Services Market Outlook

Service Type Analysis

DataSheet1_Threat modelling in Internet of Things (IoT) environments using...

GPU Database Market by Deployment and Geography - Forecast and Analysis...

Snapshot img

translated_text2cypher24_trainset_sampled

Neo4j open measurment

A graph database with 193 million synonyms

Kickstart a chemical graph database

What do you get?

Overview of the graph design

Download the file