Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
DBPedia Ontology Classification DatasetVersion 2, Updated 09/09/2015LICENSEThe DBpedia datasets are licensed under the terms of the Creative Commons Attribution-ShareAlike License and the GNU Free Documentation License. For more information, please refer to http://dbpedia.org. For a recent overview paper about DBpedia, please refer to: Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, Sören Auer, Christian Bizer: DBpedia – A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia. Semantic Web Journal, Vol. 6 No. 2, pp 167–195, 2015.The DBPedia ontology classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu), licensed under the terms of the Creative Commons Attribution-ShareAlike License and the GNU Free Documentation License. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).DESCRIPTIONThe DBpedia ontology classification dataset is constructed by picking 14 non-overlapping classes from DBpedia 2014. They are listed in classes.txt. From each of thse 14 ontology classes, we randomly choose 40,000 training samples and 5,000 testing samples. Therefore, the total size of the training dataset is 560,000 and testing dataset 70,000.The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 3 columns in them, corresponding to class index (1 to 14), title and content. The title and content are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). There are no new lines in title or content.ClassesCompanyEducationalInstitutionArtistAthleteOfficeHolderMeanOfTransportationBuildingNaturalPlaceVillageAnimalPlantAlbumFilmWrittenWork
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
DBpedia (from "DB" for "database") is a project aiming to extract structured content from the information created in Wikipedia. This is an extract of the data (after cleaning, kernel included) that provides taxonomic, hierarchical categories ("classes") for 342,782 wikipedia articles. There are 3 levels, with 9, 70 and 219 classes respectively. A version of this dataset is a popular baseline for NLP/text classification tasks. This version of the dataset is much tougher, especially if the L2/L3 levels are used as the targets.
This is an excellent benchmark for hierarchical multiclass/multilabel text classification. Some example approaches are included as code snippets.
DBPedia dataset with multiple levels of hierarchy/classes, as a multiclass dataset. Original DBPedia ontology (triplets data): https://wiki.dbpedia.org/develop/datasets Listing of the class tree/taxonomy: http://mappings.dbpedia.org/server/ontology/classes/
Thanks to the Wikimedia foundation for creating Wikipedia, DBPedia and associated open-data goodness!
Thanks to my colleagues at Sparkbeyond (https://www.sparkbeyond.com) for pointing me towards the taxonomical version of this dataset (as opposed to the classic 14 class version)
Try different NLP models.
Compare to the SOTA in Text Classification on DBpedia - "https://paperswithcode.com/sota/text-classification-on-dbpedia">https://paperswithcode.com/sota/text-classification-on-dbpedia
Facebook
TwitterTraffic analytics, rankings, and competitive metrics for dbpedia.org as of September 2025
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Semantic Annotation for Tabular Data with DBpedia: Adapted SemTab 2019 with DBpedia 2016-10
CEA:
Keep only valid entities in DBpedia 2016-10
Resolve percentage encoding
Add missing redirect entities
CTA:
Keep only valid types
Resolve transitive types (parents and equivalent types of the specific type) with DBpedia ontology 2016-10
CPA:
Add equivalent properties
Statistic of Adapted Tabular data SemTab 2019
| CEA | CPA | CTA | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Orginal | Adapted | Change | Orginal | Adapted | Change | Orginal | Adapted | Change | |
| Round 1 | 8418 | 8406 | -0.14% | 116 | 116 | 0.00% | 120 | 120 | 0.00% |
| Round 2 | 463796 | 457567 | -1.34% | 6762 | 6762 | 0.00% | 14780 | 14333 | -3.02% |
| Round 3 | 406827 | 406820 | 0.00% | 7575 | 7575 | 0.00% | 5762 | 5673 | -1.54% |
| Round 4 | 107352 | 107351 | 0.00% | 2747 | 2747 | 0.00% | 1732 | 1717 | -0.87% |
DBpedia 2016-10 extra resources: (Original dataset http://downloads.dbpedia.org/2016-10/)
File: _dbpedia_classes_2016-10.csv
Information: DBpedia classes and parents: (We remove the abstract types: Agent, Thing)
Total: 759 classes
Structure: class, parents (separate with space)
Example: "City","Location Place PopulatedPlace Settlement"
File: _dbpedia_properties_2016-10.csv
Information: DBpedia properties and these equivalents
Total: 2865 properties
Structure: property, it’s equivalent properties
Example: "restingDate","deathDate"
File: _dbpedia_domains_2016-10.csv
Information: DBpedia properties and these domain types
Total: 2421 properties (have types as their domain)
Structure: property, type (domain)
Example: "deathDate","Person"
File: _dbpedia_entities_2016-10.jsonl.bz2
Information: DBpedia entity dump
Format: json list bz2 (bz2 Compressed json list)
Source: DBpedia dump 2016-10 core
Total: 5,289,577 entities (No disambiguation entities)
Structure:
An entity: for example “Tokyo”: (datatype: dictionary),
{
'wd': 'Q1322032', (Wikidata ID, datatype: string)
'wp': 'Tokyo', (Wikipedia ID, add prefix https://en.wikipedia.org/wiki/ + wp to get the Wikipedia URL, datatype: string)
'dp': 'Tokyo', (DBpedia ID, add prefix http://dbpedia.org/resource/ + dp to get the DBpedia URL, datatype: string)
'label': 'Tokyo', (Entity label, datatype: string)
'aliases': ['To-kyo', 'Tôkyô Prefecture', ..], (Other entity names, datatype: list)
'aliases_multilingual': ['东京小子', 'طوكيو', ...], (Other entity names in multilingual, datatype: list)
'types_specific': 'City', (Entity direct type, datatype: string)
'types_transitive': ['Human settlement', 'City', 'PopulatedPlace', 'Location', 'Place', 'Settlement'], (Entity transitive types, datatype: list)
'claims_entity': { (entity statements, datatype: dictionary. Keys: properties, Values: list of tail entities)
'governingBody': ['Tokyo Metropolitan Government'],
'subdivision': ['Honshu', 'Kantō region'],
...
},
'claims_literal': {
'string': { (String literal: datatype: dictionary. Keys: properties, Values: list of values
'postalCode': ['JP-13'],
'utcOffset': ['+09:00', '+9'],
…
}
'time': { (Time literal: datatype: dictionary. Keys: properties, Values: list of date time
'populationAsOf': ['2016-07-31'],
...
}),
'quantity': { (Numerical literal: datatype: dictionary. Keys: properties, Values: list of values
populationDesity: [6224.66, 6349.0],
'maximumElevation': [2017],
...
},
'pagerank': 2.2167366040153352e-06 (Entity page rank score calculated on DBpedia Graph)
}
THIS DATA IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset contains information for each of the countries from DBpedia, a large general-purpose KG that is created from Wikipedia.
Let’s take a look at how the KG looks in the neighbourhood of a specific country: 🇧🇪 Belgium 🇧🇪. This process is analogous to going to its corresponding DBpedia page and then recursively clicking on all the links on that page.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F443651%2Faa55b06de7f17a0bcab69ef12b25b7ff%2Fkg_creation.png?generation=1604326149146978&alt=media" alt="">
We created a custom dataset with country information by using the DBpedia SPARQL endpoint. We retrieved a list of countries from the University of Mannheim “Semantic Web for Machine Learning” repository. Each of the countries contains information regarding their inflation and academic output. This information is binarized into “high” and “low” (so two binary classification tasks). Moreover, for each of the countries we retrieved their continent (Europe, Asia, Americas, Africa or Oceania), which gives us a 5-class classification task. The KG with the information about these countries is a subset from DBpedia: for each country, we retrieved all information by expanding the KG three times. Due to a rate limitation placed on the SPARQL endpoint, only a maximum of 10000 nodes at depth 3 and their parents are included.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
QBLink-KG is a modified version of QBLink, which is a high-quality benchmark for evaluating conversational understanding of Wikipedia content.QBLink consists of sequences of up to three hand-crafted queries, with responses being single-named entities that match the titles of Wikipedia articles.For the QBLink-KG, the English subset of the DBpedia snapshot from September 2021 was used as the target Knowledge Graph. QBLink answers provided as the titles of Wikipedia infoboxes can be easily mapped to DBpedia entity URIs - if the corresponding entities are present in DBpedia - since DBpedia is constructed through the extraction of information from Wikipedia infoboxes.QBLink, in its original format, is not directly applicable for Conversational Entity Retrieval from a Knowledge Graph (CER-KG) because knowledge graphs contain considerably less information than Wikipedia. A named entity serving as an answer to a QBLink query may not be present as an entity in DBpedia. To modify QBLink for CER over DBpedia, we implemented two filtering steps: 1) we removed all queries for which the wiki_page field is empty, or the answer cannot be mapped to a DBpedia entity or does not match to a Wikipedia page. 2) For the evaluation of a model with specific techniques for entity linking and candidate selection, we excluded queries with answers that do not belong to the set of candidate entities derived using that model.The original QBLink dataset files before filtering are:QBLink-train.jsonQBLink-dev.jsonQBLink-test.jsonAnd the final QBLink-KG files after filtering are:QBLink-Filtered-train.jsonQBLink-Filtered-dev.jsonQBLink-Filtered-test.jsonWe used below references to construct QBLink-KG:Ahmed Elgohary, Chen Zhao, and Jordan Boyd-Graber. 2018. A dataset and baselines for sequential open-domain question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1077–1083, Brussels, Belgium. Association for Computational Linguistics.https://databus.dbpedia.org/dbpedia/collections/dbpedia-snapshot-2021-09Lehmann, Jens et al. ‘DBpedia – A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia’. 1 Jan. 2015 : 167 – 195.To give more details about QBLink-KG, please read our research paper:Zamiri, Mona, et al. "Benchmark and Neural Architecture for Conversational Entity Retrieval from a Knowledge Graph", The Web Conference 2024.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The uploaded datasets contain automatically extracted SHACL shapes for DBpedia [1], YAGO-4 [2] and LUBM [3] (with scale factor 500) datasets.
These validating SHACL shapes are generated by a program that parses the corresponding RDF files (in `.nt` format).
These shapes encode various SHACL constraints, e.g., literal types or RDF types. For each shape we encode coverage in terms of number of entities satisfying such shape, this information is encoded using the void:entities predicate.
We have provided as executable Jar file the program we developed to extract these SHACL shapes.
More details about the datasets used to extract these shapes and how to run the Jar are available on our GitHub repository https://github.com/Kashif-Rabbani/validatingshapes.
[1] Auer, Sören, et al. "Dbpedia: A nucleus for a web of open data." The semantic web. Springer, Berlin, Heidelberg, 2007. 722-735.
[2] Pellissier Tanon, Thomas, Gerhard Weikum, and Fabian Suchanek. "Yago 4: A reason-able knowledge base." European Semantic Web Conference. Springer, Cham, 2020.
[3] Guo, Yuanbo, Zhengxiang Pan, and Jeff Heflin. "LUBM: A benchmark for OWL knowledge base systems." Journal of Web Semantics 3.2-3 (2005): 158-182.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This chart provides a detailed overview of the number of France online retailers by Monthly Views. Most France stores' Monthly Views are Less than 100, there are 85.88K stores, which is 44.47% of total. In second place, 58.67K stores' Monthly Views are 100 to 1K, which is 30.38% of total. Meanwhile, 34.4K stores' Monthly Views are 1K to 10K, which is 17.81% of total. This breakdown reveals insights into France stores distribution, providing a comprehensive picture of the performance and efficient of online retailer.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The number of intersections for DBpedia-Yago.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Previous studies have shown impaired memory for faces following restricted sleep.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the SAXS data obtained at the European Synchrotron Radiation Facility beamline ID02. All normalised SAXS data files and processed fitting including visualisation data via MatLab is included. All relevant data to the presented figures/table in the article is contained within this archive.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
LC-QuAD 2.0 is a breakthrough dataset designed to advance the state of intelligent querying towards unprecedented heights. By providing a collection of 30,000 different pairs of questions and their respective SPARQL queries each, it presents an enormous opportunity for every person looking to unlock the power of knowledge with smart querying techniques.
These questions have been carefully devised such that they relate to the latest version of Wikidata and DBpedia, granting tech-savvy individuals an access key to an information repository far beyond what was once thought imaginable. The dataset found under this union is nothing short of amazing - consisting not just of Natural Language Questions but also their solutions in the form of a SPARQL query. With LC-QuAD 2.0, you have at your fingertips more than thirty thousand answers ready for any query you can think up! Unlocking knowledge has never been easier!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
Using the LC-QuAD 2.0 dataset can be a great way to power up your intelligent systems with smarter querying. Whether you want to build a question-answering system or create new knowledge graphs and search systems, utilizing this dataset can certainly be helpful. Here is a guide on how to use this dataset:
Understand the structure of the data: The LC-QuAD 2.0 consists of 30,000 different pairs of questions and their corresponding SPARQL queries in two files – train (used for training an intelligent system) and test (used for testing an intelligent system). The columns present in each pair are NNQT_question (Natural Language Question), subgraph (Subgraph information for the question), sparql_dbpedia18 (SPARQL query for DBpedia 18), template (Templates from which SPARQL query was generated).
Read up on SPARQL: Before you start using this dataset, it is important that you read up more on what SPARQL means and how it works as SPAQL will be used frequently when browsing through this data set. This will make the understanding of the content easier and quicker!
Start exploring!: After doing some research about SPARQL, now it’s time to explore! You can start by looking at each pair in detail - read through its natural language question, subgraph information and try understanding its relation with its corresponding sparql queries from both DBpedia 18 or try running these sparql queries yourself against Wikidata or DBPedia platform to see where they lead you eventually! In case any query has multiple results having different variances with respect to answers range , then look inside entity definitions contained within words \ phrases / synonyms reflected by natural language parsing services API's like AIKATsetu etc., before writing authoritative answer modules/endpoints forming partinmonly sustainable pipeline architecture using such prepared & refined datasets like LC-QUAD !
Use your own data: Once you have familiarized yourself sufficiently with the available pairs & understand their relevance , consider creating your own data set by adding more complex questions along associated unique attributes which shall give great insights . If not done already evaluate if population enrichment techniques should be applied suiting specific domain's needs your bot purports - either just features selection criterion wise or entire classifier selection algorithm wise as otherwise global extracted vectors may decide either selectively for reducing overfitting/generalization penalty in
- Incorporating the LC-QUAD 2.0 dataset into Intelligent systems such as Chatbots, Question Answering Systems, and Document Summarization programs to allow them to retrieve the required information by transforming natural language questions into SPARQL queries.
- Utilizing this dataset in Semantic Scholar Search Engines and Academic Digital Libraries which can use natural language queries instead of keywords in order to perform more sophisticated searches and provide more accurate results for researchers in diverse areas.
- Applying this dataset for building Knowledge Graphs that can store entities along with their attributes, categories and relations thereby allowing better understanding of complex relationships between entities or data and further advancing development of AI agents that are able to answer specific questions or provide personalized recommendations in various contexts or tasks
&g...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data Biology
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
columns are individual-id, sex and population
Facebook
TwitterData and Matlab files to reproduce Figures 8 and 9.
Facebook
TwitterThis repository contains lattice-gas Monte Carlo simulation data and supporting code for the paper "Lattice-Geometry Effects in Garnet Solid Electrolytes: A Lattice-Gas Monte Carlo Simulation Study"
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains microscopy images collected to demonstrate imaging capabilities of the OpenFlexure Microscope. Images for bright-field transmission and reflection, polarisation contrast, and fluorescence imaging are provided. A set of images obtained from a large tile scan are provided, along with the Microsoft Image Composite Editor file used for tiling.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This script is for analyzing the empirical data.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The pie chart showcases the distribution of app/software spending by store category in France, providing insights into how eCommerce stores allocate their resources on the app or software they utilize. Among the store categories, Apparel exhibits the highest spending, with a total expenditure of $10.65M units representing 12.44% of the overall spending. Following closely behind is Beauty & Fitness with a spend of $4.80M units, comprising 5.61% of the total. Home & Garden also contributes significantly with a spend of $2.64M units, accounting for 3.08% of the overall app/software spending. This data sheds light on the investment patterns of eCommerce stores within each category, reflecting their priorities and resource allocation towards app or software solutions.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
flight trajectories.TXT
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
DBPedia Ontology Classification DatasetVersion 2, Updated 09/09/2015LICENSEThe DBpedia datasets are licensed under the terms of the Creative Commons Attribution-ShareAlike License and the GNU Free Documentation License. For more information, please refer to http://dbpedia.org. For a recent overview paper about DBpedia, please refer to: Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, Sören Auer, Christian Bizer: DBpedia – A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia. Semantic Web Journal, Vol. 6 No. 2, pp 167–195, 2015.The DBPedia ontology classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu), licensed under the terms of the Creative Commons Attribution-ShareAlike License and the GNU Free Documentation License. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).DESCRIPTIONThe DBpedia ontology classification dataset is constructed by picking 14 non-overlapping classes from DBpedia 2014. They are listed in classes.txt. From each of thse 14 ontology classes, we randomly choose 40,000 training samples and 5,000 testing samples. Therefore, the total size of the training dataset is 560,000 and testing dataset 70,000.The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 3 columns in them, corresponding to class index (1 to 14), title and content. The title and content are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). There are no new lines in title or content.ClassesCompanyEducationalInstitutionArtistAthleteOfficeHolderMeanOfTransportationBuildingNaturalPlaceVillageAnimalPlantAlbumFilmWrittenWork