Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
WikiDBs is an open-source corpus of 100,000 relational databases. We aim to support research on tabular representation learning on multi-table data. The corpus is based on Wikidata and aims to follow certain characteristics of real-world databases.
WikiDBs was published as a spotlight paper at the Dataset & Benchmarks track at NeurIPS 2024.
WikiDBs contains the database schemas, as well as table contents. The database tables are provided as CSV files, and each database schema as JSON. The 100,000 databases are available in five splits, containing 20k databases each. In total, around 165 GB of disk space are needed for the full corpus. We also provide a script to convert the databases into SQLite.
A collection of SQLite database files containing all the data retrieved from the Wikidata JSON dump of 2022-05-02 and the Wikipedia SQL dumps of 2022-05-01 in the context of analyzing the consistency between Wikipedia and Wikidata categories. Detailed information can be found on the Github page.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
WikiDBs-10k (https://wikidbs.github.io/) is a corpus of relational databases built from Wikidata (https://www.wikidata.org/). This is the preliminary 10k version, the newer version of 100k databases (https://zenodo.org/records/11559814) includes more coherent databases and more diverse table and column names.
The WikiDBs-10k corpus consists of 10,000 databases, for more details read our paper: https://ceur-ws.org/Vol-3462/TADA3.pdf (TaDA@VLDB'23)
Each database is saved in a sub-folder, the table files are provided as csv files and the database schema as a json file.
We thank Till Döhmen and Madelon Hulsebos for generously providing the table statistics from their GitSchemas dataset and Jan-Micha Bodensohn for converting the dataset to SQLite files. This work has been supported by the BMBF and the state of Hesse as part of the NHR Program and the BMBF project KompAKI (grant number 02L19C150), as well as the HMWK cluster project 3AI. Finally, we want to thank hessian.AI, and DFKI Darmstadt for their support.
https://bioregistry.io/spdx:CC0-1.0https://bioregistry.io/spdx:CC0-1.0
Wikidata is a collaboratively edited knowledge base operated by the Wikimedia Foundation. It is intended to provide a common source of certain types of data which can be used by Wikimedia projects such as Wikipedia. Wikidata functions as a document-oriented database, centred on individual items. Items represent topics, for which basic information is stored that identifies each topic.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset lists the organisations of data.gouv.fr linked to the database Wikidata.org. The data is being consolidated and should be used with caution. As an indication, the reconciliation of the list of organisations of data.gouv.fr and Wikidata can be used to: * analyse the nature of the organisations (public administration, joint, undertaking, association, etc.); * have additional information about these same organisations (e.g. Github account or Twitter account); * obtain alternative labels to data.gouv.fr labels; * display links to an organisation’s data.gouv.fr page from the corresponding Wikipedia article; * have a vision of the hierarchy of organisations (knowing that one organisation is a subsidiary of another).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
EventWiki is a knowledge base of major events happening throughout mankind history. It contains 21,275 events of 95 types. The details of event entries can be found in our paper submission and documentation file. Data in the knowledge base is mainly harvested from Wikipedia.As Wikipedia, this resource can be distributed and shared under CC-BY 3.0 license.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This data contain the Wikidata of Feb 23, 2015 codified in four alternative schemes for our work Reifying RDF: What Works Well With Wikidata? presented in the International Workshop on Scalable Semantic Web Knowledge Base Systems (SSWS), Bethlehem, Pensylvania, Oct 11, 2015.
WikiReading is a large-scale natural language understanding task and publicly-available dataset with 18 million instances. The task is to predict textual values from the structured knowledge base Wikidata by reading the text of the corresponding Wikipedia articles. The task contains a rich variety of challenging classification and extraction sub-tasks, making it well-suited for end-to-end models such as deep neural networks (DNNs).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is created to develop and evaluate a cataloging system which assigns appropriate metadata to an image record for database management in digital libraries. That is assumed for evaluating a task, in which given an image and assigned tags, an appropriate Wikipedia page is selected for each of the given tags.
A main characteristic of the dataset is including ambiguous tags. Thus, visual contents of images are not unique to their tags. For example, it includes a tag 'mouse' which has double meaning of not a mammal but a computer controller device. The annotations are corresponding Wikipedia articles for tags as correct entities by human judgement.
The dataset offers both data and programs that reproduce experiments of the above-mentioned task. Its data consist of sources of images and annotations. The image sources are URLs of 420 images uploaded to Flickr. The annotations are a total 2,464 relevant Wikipedia pages manually judged for tags of the images. The dataset also provides programs in Jupiter notebook (scripts.ipynb) to conduct a series of experiments running some baseline methods for the designated task and evaluating the results.
data directory 1.1. image_URL.txt This file lists URLs of image files.
1.2. rels.txt This file lists collect Wikipedia pages for each topic in topics.txt
1.3. topics.txt This file lists a target pair, which is called a topic in this dataset, of an image and a tag to be disambiguated.
1.4. enwiki_20171001.xml This file is extracted texts from the title and body parts of English Wikipedia articles as of 1st October 2017. This is a modified data of Wikipedia dump data (https://archive.org/download/enwiki-20171001).
img directory This directory is a placeholder directory to fetch image files for downloading.
results directory This directory is a placeholder directory to store results files for evaluation. It maintains three results of baseline methods in sub-directories. They contain json files each of which is a result of one topic, and are ready to be evaluated using an evaluation scripts in scripts.ipynb for reference of both usage and performance.
scripts.ipynb The scripts for running baseline methods and evaluation are ready in this Jupyter notebook file.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Wikidata offers a wide range of general data about our universe as well as links to other databases. The data is published under the CC0 "Public domain dedication" license. It can be edited by anyone and is maintained by Wikidata's editor community.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset is a dump of Wikidata in JSON format, produced in January 4, 2016 by the Wikimedia Foundation. Wikidata historical dumps are not preserved by the Wikimedia Foundation. Thus, this dump is distributed here in order to make our experiments repeatable along the time.
https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/
.WIKI Whois Database, discover comprehensive ownership details, registration dates, and more for .WIKI TLD with Whois Data Center.
Back-end ORACLE database to house NetQOS network performance and network characterization data for historical analysis and reporting.
Wikipedia revision metadata for every edit to every page in seven major language versions of Wikipedia. Files in this collection are database dumps originally from https://dumps.wikimedia.org and were created by the Wikimedia Foundation, which deletes dumps after approximately 6 months. They are being uploaded here to preserve computational reproducibility of research projects based on these specific dumps. Files are in the format [language]wiki-20170420-stub-meta-history[part].xml.gz for English (en), German (de), Spanish (es), French (fr), Japanese (ja), Portuguese (pt), and Chinese Mandarin (zh) Wikipedias. Material found within is copyright Wikipedia contributors and is freely licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 3.0 License.
WikiSQL consists of a corpus of 87,726 hand-annotated SQL query and natural language question pairs. These SQL queries are further split into training (61,297 examples), development (9,145 examples) and test sets (17,284 examples). It can be used for natural language inference tasks related to relational databases.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A global reference dataset on cropland was collected through a crowdsourcing campaign implemented using Geo-Wiki. This reference dataset is based on a systematic sample at latitude and longitude intersections, enhanced in locations where the cropland probability varies between 25-75% for a better representation of cropland globally. Over a three week period, around 36K samples of cropland were collected. For the purpose of quality assessment, additional datasets are provided. One is a control dataset of 1793 sample locations that have been validated by students trained in image interpretation. This dataset was used to assess the quality of the crowd validations as the campaign progressed. Another set of data contains 60 expert or gold standard validations for additional evaluation of the quality of the participants. These three datasets have two parts, one showing cropland only and one where it is compiled per location and user. This reference dataset will be used to validate and compare medium and high resolution cropland maps that have been generated using remote sensing. The dataset can also be used to train classification algorithms in developing new maps of land cover and cropland extent.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets that can be used together with the Code in: https://github.com/JanKalo/RuleAlign
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This resource includes complementary documentation that aims to make repeatable the experiments described in the paper Querying Wikidata: Comparing SPARQL, Relational and Graph Databases (by Daniel Hernández, Aidan Hogan, Cristian Riveros, Carlos Rojas and Enzo Zerega).
First release of a BridgeDb pathway identifier mapping database. Currently supports Wikidata and WikiPathways identifiers. CCZero.
[INFO]: Database finished.
INFO: old database is Wikidata 1.0.0 (build: 20211211)
INFO: new database is Wikidata 1.0.0 (build: 20211211)
INFO: Number of ids in Wd (Wikidata): 905 (unchanged)
INFO: Number of ids in Wp (WikiPathways): 900 (unchanged)
INFO: new size is 2 Mb (changed +0.0%)
INFO: total number of identifiers is 1805
INFO: total number of mappings is 1810
The data store to support the Division of Database Support (DDBS) developed processes such as the commit checkpoint routine.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
WikiDBs is an open-source corpus of 100,000 relational databases. We aim to support research on tabular representation learning on multi-table data. The corpus is based on Wikidata and aims to follow certain characteristics of real-world databases.
WikiDBs was published as a spotlight paper at the Dataset & Benchmarks track at NeurIPS 2024.
WikiDBs contains the database schemas, as well as table contents. The database tables are provided as CSV files, and each database schema as JSON. The 100,000 databases are available in five splits, containing 20k databases each. In total, around 165 GB of disk space are needed for the full corpus. We also provide a script to convert the databases into SQLite.