Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We disribute here the datasets used in the tests for the paper:
«Beyond Macrobenchmarks: Microbenchmark-based Graph Database Evaluation.»
by Lissandrini, Matteo; Brugnara, Martin; and Velegrakis, Yannis.
In PVLDB, 12(4):390-403, 2018.
From the official webpage: https://graphbenchmark.com/
The original files where stored on Google Drive. Now going to be discontinued.
The datasets used in the tests are stored in GraphSON format for the versions of the engines supporting Tinkerpop 3. System using Tinkerpop 2 support instead GraphSON 1.0. Our datasets can be easily converted to an updated or older version. For an example see our Docker image.
The MiCo Dataset comes from the authors of GraMi
For more details, you can read:
«GRAMI: Frequent Subgraph and Pattern Mining in a Single Large Graph»
by Mohammed Elseidy, Ehab Abdelhamid, Spiros Skiadopoulos, and Panos Kalnis.
In PVLDB, 7(7):517-528, 2014.
The Yeast Dataset has been converted from the one transformed in Pajek format by V. Batagelj. The original dataset comes from
«Topological structure analysis of the protein-protein interaction network in budding yeast»
by Shiwei Sun, Lunjiang Ling, Nan Zhang, Guojie Li and Runsheng Chen.
In Nucleic Acids Research, 2003, Vol. 31, No. 9 2443-2450
Moreover you can read about the details of our Freebase ExQ datasets, or you can use our Docker image to generate the LDBC synthetic dataset.
| Name | Files | Size (bytes) | Graph Size (Nodes/Edges) |
|---|---|---|---|
| Yeast | yeast.jsonyeast.json.gz | 1.5M180K | 2.3K / 7.1K |
| MiCo | mico.jsonmico.json.gz | 84M12M | 0.1M / 1.1M |
| Frb-O | freebase_org.jsonfreebase_org.json.gz | 584M 81M | 1.9M / 4.3M |
| Frb-S | freebase_small.jsonfreebase_small.json.gz | 87M12M | 0.5M / 0.3M |
| Frb-M | freebase_medium.jsonfreebase_medium.json.gz | 816M117M | 4M / 3.1M |
| Frb-L | freebase_large.jsonfreebase_large.json.gz | 6.3G616M | 28.4M / 31.2M |
| LDBC | ldbc.jsonldbc.json.gz | 144M 13M | 0.18M / 1.5M |
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Structured data characterizing selected avian conservation aspects of North Carolina's Wildlife Action Plans were already encoded in a Semantic MediaWiki database (http://wiki.ncpif.org/). That database was created, and is maintained by, the North Carolina Partners in Flight (NC PIF) program, which is a program of the North Carolina Wildlife Resources Commission. The NC PIF wiki database was ported into a Neo4j labeled property graph database for an experiment in linking avian species, organizations, geographies, and management plans. This JSON file is an export from that Neo4j database.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
A Hadith is a report describing the words, actions, intentions or habits of the last Prophet and Messenger Muhammed (Peace Be Upon Him). The term literally means report, account or narrative.
Ṣaḥīḥ al-Bukhārī (صحيح البخاري), is one of the six major hadith collections books. It was collected by a Muslim scholar Imam Muhammad al-Bukhari, after being transmitted orally for generations. There are 7,658 full hadiths in this collection narrated by 1,755 narrators/transmitters.
Imam Bukhari finished his work around 846 A.D.
The two main sources of data regarding hadith are works of hadith and works containing biographical information about hadith narrators. The dataset contains 7,658 hadiths in Arabic and the names of 1,755 transmitters. Imam Bukhari followed the following criterion to include a hadith in this book.
Quality and soundness of chain of narrators - the lifetime of a narrator should overlap with the lifetime of the authority from whom he narrates.
Verifiable - it should be verifiable that narrators have met with their source persons. They should also expressly state that they obtained the narrative from these authorities.
Piousness – he only accepted the narratives from only those who, according to his knowledge, not only believed in Islam but practiced its teachings.
More information on Hadith and Sahih Bukhari can be found from this link - Hadith Books
Here are some ideas worth exploring:
The traditional criteria for determining if a hadith is Sahih (authentic) requires that there should be an uninterrupted chain of narrators; that all those narrators should be highly reliable and there should not be any hidden defects. Can we make a social network graph of all the narrators and then timestamp it with their age and era to see who overlaps who?
The name of a transmitter mentioned in a given hadith is not the full name, and many transmitters have similar names. So identifying who is the transmitter of a given hadith based on the names mentioned in the text might be a good problem to tackle
Can we analyze the chains of transmitters for entire collections using Neo4j or some other graph database
There exist different chains that reports the same hadith with little variation of words, can you identify those
Can you link the text with other external data sources?
Can we produce the word cloud for each chapter of the book?
Can we train a neural network to authenticate if the hadith is real or not?
Can we find out the specific style or vocabulary of each narrator?
Can we develop a system for comparing variant wordings for the same hadith to identify how reliable a given transmitter is.
Please also help me extend this dataset. If you have any other hadith book in CSV or text format, please send me a message and I will add.
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
YAGO 3 combines the information from the Wikipedias in multiple languages with WordNet, GeoNames, and other data sources. YAGO 3 taps into multilingual resources of Wikipedia, getting to know more local entities and facts. This version has been extracted from 10 different Wikipedia versions (English, German, French, Dutch, Italian, Spanish, Polish, Romanian, Persian, and Arabic). YAGO 3 is special in several ways: * YAGO 3 combines the clean taxonomy of WordNet with the richness of the Wikipedia category system, assigning the entities to more than 350,000 classes. * YAGO 3 is anchored in time and space. YAGO attaches a temporal dimension and a spatial dimension to many of its facts and entities. * In addition to taxonomy, YAGO has thematic domains such as “music” or “science” from WordNet Domains. * YAGO 3 extracts and combines entities and facts from 10 Wikipedias in different languages. * YAGO 3 contains canonical representations of entities appearing in different Wikipedia language editions. * YAGO 3 integrates all non-English entities into the rich type taxonomy of YAGO. * YAGO 3 provides a mapping between non-English infobox attributes and YAGO relations. YAGO 3 knows more than 17 million entities (like persons, organizations, cities, etc.) and contains more than 150 million facts about these entities. As with all major releases, the accuracy of YAGO 3 has been manually evaluated, proving a confirmed accuracy of 95%. Every relation is annotated with its confidence value.
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
YAGO 4 is a version of the YAGO knowledge base that is based on Wikidata — the largest public general-purpose knowledge base. YAGO refines the data as follows: * All entity identifiers and property identifiers are human-readable. * The top-level classes come from schema.org — a standard repertoire of classes and properties maintained by Google and others, combined with bioschemas.org. The lower level classes are a selection of Wikidata classes. * The properties come from schema.org. * YAGO 4 contains semantic constraints in the form of SHACL. These constraints keep the data clean, and allow for logical reasoning on YAGO. YAGO contains more than 50 million entities and 2 billion facts.
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
YAGO 2 is an improved version of the original YAGO knowledge base: * YAGO 2 is anchored in time and space. YAGO 2 attaches a temporal dimension and a spacial dimension to many of its facts and entities. * YAGO 2 is particularly suited for disambiguation purposes, as it contains a large number of names for entities. It also knows the gender of people. * As all major releases, the accuracy of YAGO 2 has been manually evaluated, proving an accuracy of 95% with respect to Wikipedia. Every relation is annotated with its confidence value.
Facebook
Twitterhttps://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api
This dataset aims to build a graph of subreddit links based on how they reference each other. The original database dump can be found here.
name (str): name of the subreddit.
type (str): type of the subreddit.
title (str): title of the subredditdescription (str): short description of the subreddit.subscribers (int?): amount of subscribers at the moment.nsfw (bool?): indicator if its flaged as not safe for work 🔞.quarantined (bool?): indicator if it has been quarantined 😷.color (str): key color of the subreddit.img_banner (str?): url of the image used as the banner.img_icon (str?): url of the image used as the icon (snoo).created_at (datetime): utc timestamp of when the subreddit was created.updated_at (datetime): utc timestamp of when the information of the subreddit was last updated.note: the '?' indicates that the value can be null under certain conditions.
| TYPE | AMOUNT |
|---|---|
| TOTAL | 127800 |
| public | 59227 |
| banned | 31473 |
| restricted | 14601 |
| public [nsfw] | 14244 |
| private | 5139 |
| restricted [nsfw] | 3014 |
| public [quarantined] | 29 |
| restricted [quarantined] | 21 |
| archived | 17 |
| premium | 12 |
| public [nsfw] [quarantined] | 11 |
| user [nsfw] | 6 |
| user | 4 |
| restricted [nsfw] [quarantined] | 1 |
| employees | 1 |
source (str): name of the subreddit where the link was found.target (str): name of the linked subreddit.type (str): place where the reference from source to target was found.
updated_at (datetime): utc timestamp of when the information the link was last updated.| TYPE | AMOUNT |
|---|---|
| TOTAL | 349744 |
| wiki | 214206 |
| sidebar | 123650 |
| topbar | 7291 |
| description | 4597 |
Facebook
Twitterhttps://www.gnu.org/copyleft/fdl.htmlhttps://www.gnu.org/copyleft/fdl.html
This is the 2008 version of YAGO. It knows more than 2 million entities (like persons, organizations, cities, etc.). It knows 20 million facts about these entities. This version of YAGO includes the data extracted from the categories and infoboxes of Wikipedia, combined with the taxonomy of WordNet. YAGO 1 was manually evaluated, and found to have an accuracy of 95% with respect to the extraction source.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
SynthCypher Dataset Repository
Overview
This repository hosts SynthCypher, a novel synthetic dataset designed to bridge the gap in Text-to-Cypher (Text2Cypher) tasks. SynthCypher leverages state-of-the-art large language models (LLMs) to automatically generate and validate high-quality data for training and evaluating models that convert natural language questions into Cypher queries for graph databases like Neo4j. Our dataset and pipeline contribute significantly to… See the full description on the dataset page: https://huggingface.co/datasets/ServiceNow-AI/SynthCypher.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We disribute here the datasets used in the tests for the paper:
«Beyond Macrobenchmarks: Microbenchmark-based Graph Database Evaluation.»
by Lissandrini, Matteo; Brugnara, Martin; and Velegrakis, Yannis.
In PVLDB, 12(4):390-403, 2018.
From the official webpage: https://graphbenchmark.com/
The original files where stored on Google Drive. Now going to be discontinued.
The datasets used in the tests are stored in GraphSON format for the versions of the engines supporting Tinkerpop 3. System using Tinkerpop 2 support instead GraphSON 1.0. Our datasets can be easily converted to an updated or older version. For an example see our Docker image.
The MiCo Dataset comes from the authors of GraMi
For more details, you can read:
«GRAMI: Frequent Subgraph and Pattern Mining in a Single Large Graph»
by Mohammed Elseidy, Ehab Abdelhamid, Spiros Skiadopoulos, and Panos Kalnis.
In PVLDB, 7(7):517-528, 2014.
The Yeast Dataset has been converted from the one transformed in Pajek format by V. Batagelj. The original dataset comes from
«Topological structure analysis of the protein-protein interaction network in budding yeast»
by Shiwei Sun, Lunjiang Ling, Nan Zhang, Guojie Li and Runsheng Chen.
In Nucleic Acids Research, 2003, Vol. 31, No. 9 2443-2450
Moreover you can read about the details of our Freebase ExQ datasets, or you can use our Docker image to generate the LDBC synthetic dataset.
| Name | Files | Size (bytes) | Graph Size (Nodes/Edges) |
|---|---|---|---|
| Yeast | yeast.jsonyeast.json.gz | 1.5M180K | 2.3K / 7.1K |
| MiCo | mico.jsonmico.json.gz | 84M12M | 0.1M / 1.1M |
| Frb-O | freebase_org.jsonfreebase_org.json.gz | 584M 81M | 1.9M / 4.3M |
| Frb-S | freebase_small.jsonfreebase_small.json.gz | 87M12M | 0.5M / 0.3M |
| Frb-M | freebase_medium.jsonfreebase_medium.json.gz | 816M117M | 4M / 3.1M |
| Frb-L | freebase_large.jsonfreebase_large.json.gz | 6.3G616M | 28.4M / 31.2M |
| LDBC | ldbc.jsonldbc.json.gz | 144M 13M | 0.18M / 1.5M |