CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This product consists of the XML schema and the JSON schema aligned with the Scholix Metadata Guidelines Version 4. It contains the .json and /xsd files together with examples of compatible metadata records. Changes from the previous update are backward compatible and include the following: The schema admits for the field type (typology of source/target objects) terms of the following vocabulary: publications, datasets, software, other research types (version 3.0 included only literature and dataset) The schema includes a new optional field subtype which includes the specific sub-type of the objects, according to the OpenAIRE classification of publications, datasets, software, and other products (for more) The schema admits multiple entries for the field Identifier in both the source and target objects; this is to specify the list of PIDs resulting from the deduplication on OpenAIRE (i.e. the same publication may have been collected from Crossref and from EuropePMC, thus including both PIDs).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This fileset contains a preprint version of the conference paper (.pdf), presentation slides (as .pptx) and the dataset(s) and validation schema(s) for the IDCC 2019 (Melbourne) conference paper: The Red Queen in the Repository: metadata quality in an ever-changing environment. Datasets and schemas are in .xml, .xsd , Excel (.xlsx) and .csv (two files representing two different sheets in the .xslx -file). The validationSchemas.zip holds the additional validation schemas (.xsd), that were not found in the schemaLocations of the metadata xml-files to be validated. The schemas must all be placed in the same folder, and are to be used for validating the Dataverse dcterms records (with metadataDCT.xsd) and the Zenodo oai_datacite feeds respectively (schema.datacite.org_oai_oai-1.0_oai.xsd). In the latter case, a simpler way of doing it might be to replace the incorrect URL "http://schema.datacite.org/oai/oai-1.0/ oai_datacite.xsd" in the schemaLocation of these xml-files by the CORRECT: schemaLocation="http://schema.datacite.org/oai/oai-1.0/ http://schema.datacite.org/oai/oai-1.0/oai.xsd" as has been done already in the sample files here. The sample file folders testDVNcoll.zip (Dataverse), testFigColl.zip (Figshare) and testZenColl.zip (Zenodo) contain all the metadata files tested and validated that are registered in the spreadsheet with objectIDs.
In the case of Zenodo, one original file feed,
zen2018oai_datacite3orig-https%20_zenodo.org_oai2d%20verb=ListRecords%26metadata
Prefix=oai_datacite%26from=2018-11-29%26until=2018-11-30.xml ,
is also supplied to show what was necessary to change in order to perform validation as indicated in the paper.
For Dataverse, a corrected version of a file,
dvn2014ddi-27595Corr_https%20_dataverse.harvard.edu_api_datasets_export%20
exporter=ddi%26persistentId=doi%253A10.7910_DVN_27595Corr.xml ,
is also supplied in order to show the changes it would take to make the file validate without error.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains Zenodo's published open access records and communities metadata, including entries marked by the Zenodo staff as spam and deleted.
The datasets are gzipped compressed JSON-lines files, where each line is a JSON object representation of a Zenodo record or community.
Records dataset
Filename: zenodo_open_metadata_{ date of export }.jsonl.gz
Each object contains the terms: part_of, thesis, description, doi, meeting, imprint, references, recid, alternate_identifiers, resource_type, journal, related_identifiers, title, subjects, notes, creators, communities, access_right, keywords, contributors, publication_date
which correspond to the fields with the same name available in Zenodo's record JSON Schema at https://zenodo.org/schemas/records/record-v1.0.0.json.
In addition, some terms have been altered:
The term files contains a list of dictionaries containing filetype, size, and filename only.
The term license contains a short Zenodo ID of the license (e.g. "cc-by").
Communities dataset
Filename: zenodo_community_metadata_{ date of export }.jsonl.gz
Each object contains the terms: id, title, description, curation_policy, page
which correspond to the fields with the same name available in Zenodo's community creation form.
Notes for all datasets
For each object the term spam contains a boolean value, determining whether a given record/community was marked as spam content by Zenodo staff.
Some values for the top-level terms, which were missing in the metadata may contain a null value.
A smaller uncompressed random sample of 200 JSON lines is also included for each dataset to test and get familiar with the format without having to download the entire dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is related to the manuscript "An empirical meta-analysis of the life sciences linked open data on the web" published at Nature Scientific Data. If you use the dataset, please cite the manuscript as follows:Kamdar, M.R., Musen, M.A. An empirical meta-analysis of the life sciences linked open data on the web. Sci Data 8, 24 (2021). https://doi.org/10.1038/s41597-021-00797-yWe have extracted schemas from more than 80 publicly available biomedical linked data graphs in the Life Sciences Linked Open Data (LSLOD) cloud into an LSLOD schema graph and conduct an empirical meta-analysis to evaluate the extent of semantic heterogeneity across the LSLOD cloud. The dataset published here contains the following files:- The set of Linked Data Graphs from the LSLOD cloud from which schemas are extracted.- Refined Sets of extracted classes, object properties, data properties, and datatypes, shared across the Linked Data Graphs on LSLOD cloud. Where the schema element is reused from a Linked Open Vocabulary or an ontology, it is explicitly indicated.- The LSLOD Schema Graph, which contains all the above extracted schema elements interlinked with each other based on the underlying content. Sample instances and sample assertions are also provided along with broad level characteristics of the modeled content. The LSLOD Schema Graph is saved as a JSON Pickle File. To read the JSON object in this Pickle file use the Python command as follows:with open('LSLOD-Schema-Graph.json.pickle' , 'rb') as infile: x = pickle.load(infile, encoding='iso-8859-1')Check the Referenced Link for more details on this research, raw data files, and code references.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
スキーマにサンプルコードを埋め込むための語彙。XSLTを使って、XHTMLの中で例を表示させることができる。 @ja, Vocabulary to include sample codes in a schema. Can work with XSLT (http://purl.org/net/ns/ns-schema.xsl) to present schema as XHTML list with examples. @en
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Summary
GitTables 1M (https://gittables.github.io) is a corpus of currently 1M relational tables extracted from CSV files in GitHub repositories, that are associated with a license that allows distribution. We aim to grow this to at least 10M tables.
Each parquet file in this corpus represents a table with the original content (e.g. values and header) as extracted from the corresponding CSV file. Table columns are enriched with annotations corresponding to >2K semantic types from Schema.org and DBpedia (provided as metadata of the parquet file). These column annotations consist of, for example, semantic types, hierarchical relations to other types, and descriptions.
We believe GitTables can facilitate many use-cases, among which:
Data integration, search and validation.
Data visualization and analysis recommendation.
Schema analysis and completion for e.g. database or knowledge base design.
If you have questions, the paper, documentation, and contact details are provided on the website: https://gittables.github.io. We recommend using Zenodo's API to easily download the full dataset (i.e. all zipped topic subsets).
Dataset contents
The data is provided in subsets of tables stored in parquet files, each subset corresponds to a term that was used to query GitHub with. The column annotations and other metadata (e.g. URL and repository license) are attached to the metadata of the parquet file. This version corresponds to this version of the paper https://arxiv.org/abs/2106.07258v4.
In summary, this dataset can be characterized as follows:
Statistic |
Value |
# tables |
1M |
average # columns |
12 |
average # rows |
142 |
# annotated tables (at least 1 column annotation) |
723K+ (DBpedia), 738K+ (Schema.org) |
# unique semantic types |
835 (DBpedia), 677 (Schema.org) |
How to download
The dataset can be downloaded through Zenodo's interface directly, or using Zenodo's API (recommended for full download).
Future releases
Future releases will include the following:
Increased number of tables (expected at least 10M)
Associated datasets
- GitTables benchmark - column type detection: https://zenodo.org/record/5706316
- GitTables 1M - CSV files: https://zenodo.org/record/6515973
ClaimsKG is a knowledge graph of metadata information for fact-checked claims scraped from popular fact-checking sites. In addition to providing a single dataset of claims and associated metadata, truth ratings are harmonized and additional information is provided for each claim, e.g., about mentioned entities. Please see ( https://data.gesis.org/claimskg/ ) for further details about the data model, query examples and statistics.
The dataset facilitates structured queries about claims, their truth values, involved entities, authors, dates, and other kinds of metadata. ClaimsKG is generated through a (semi-)automated pipeline, which harvests claim-related data from popular fact-checking web sites, annotates them with related entities from DBpedia/Wikipedia, and lifts all data to RDF using established vocabularies (such as schema.org).
The latest release of ClaimsKG covers 74066 claims and 72127 Claim Reviews. This is the fourth release of the dataset where data was scraped till Jan 31, 2023 containing claims published between 1996 and 2023 from 13 fact-checking websites. The websites are Fullfact, Politifact, TruthOrFiction, Checkyourfact, Vishvanews, AFP (French), AFP, Polygraph, EU factcheck, Factograph, Fatabyyano, Snopes and Africacheck. The claim-review (fact-checking) period for claims ranges between the year 1996 to 2023. Similar to the previous release, the Entity fishing python client ( https://github.com/hirmeos/entity-fishing-client-python ) has been used for entity linking and disambiguation in this release. Improvements have been made in the web scraping and data preprocessing pipeline to extract more entities from both claims and claims reviews. Currently, ClaimsKG contains 3408386 entities detected and referenced with DBpedia.
This latest release of ClaimsKG supersedes the previous versions as it contained all the claims from the previous versions together in addition to the additional new claims as well as improved entity annotation resulting in a higher number of entities.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the The Yelp Collaborative Knowledge Graph (YCKG) - a transformation of the Yelp Open Dataset into RDF format using Y2KG.
Paper Abstract
The Yelp Open Dataset (YOD) contains data about businesses, reviews, and users from the Yelp website and is available for research purposes. This dataset has been widely used to develop and test Recommender Systems (RS), especially those using Knowledge Graphs (KGs), e.g., integrating taxonomies, product categories, business locations, and social network information. Unfortunately, researchers applied naive or wrong mappings while converting YOD in KGs, consequently obtaining unrealistic results. Among the various issues, the conversion processes usually do not follow state-of-the-art methodologies, fail to properly link to other KGs and reuse existing vocabularies. In this work, we overcome these issues by introducing Y2KG, a utility to convert the Yelp dataset into a KG. Y2KG consists of two components. The first is a dataset including (1) a vocabulary that extends Schema.org with properties to describe the concepts in YOD and (2) mappings between the Yelp entities and Wikidata. The second component is a set of scripts to transform YOD in RDF and obtain the Yelp Collaborative Knowledge Graph (YCKG). The design of Y2KG was driven by 16 core competency questions. YCKG includes 150k businesses and 16.9M reviews from 1.9M distinct real users, resulting in over 244 million triples (with 144 distinct predicates) for about 72 million resources, with an average in-degree and out-degree of 3.3 and 12.2, respectively.
Links
Latest GitHub release: https://github.com/MadsCorfixen/The-Yelp-Collaborative-Knowledge-Graph/releases/latest
PURL domain: https://purl.archive.org/domain/yckg
Files
yelp_schema_mappings.nt.gz
containing the mappings from Yelp categories to Schema things.schema_hierarchy.nt.gz
containing the full hierarchy of the mapped Schema things.yelp_wiki_mappings.nt.gz
containing the mappings from Yelp categories to Wikidata entities.wikidata_location_mappings.nt.gz
containing the mappings from Yelp locations to Wikidata entities.yelp_categories.ttl
contains metadata for all Yelp categories.yelp_entities.ttl
contains metadata regarding the datasetyelp_vocabulary.ttl
contains metadata on the created Yelp vocabulary and properties.yelp_category_schema_mappings.csv
. This file contains the 310 mappings from Yelp categories to Schema types. These mappings have been manually verified to be correct.yelp_predicate_schema_mappings.csv
. This file contains the 14 mappings from Yelp attributes to Schema properties. These mappings are manually found.ground_truth_yelp_category_schema_mappings.csv
. This file contains the ground truth, based on 200 manually verified mappings from Yelp categories to Schema things. The ground truth mappings were used to calculate precision and recall for the semantic mappings.manually_split_categories.csv
. This file contains all Yelp categories containing either a & or /, and their manually split versions. The split versions have been used in the semantic mappings to Schema things.This dataset is provided as an example of XML metadata that can be used to create a records in ServCat for GIS datasets.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The intended use of this archive is to facilitate meta-analysis of the Data Observation Network for Earth (DataONE, [1]).
DataONE is a distributed infrastructure that provides information about earth observation data. This dataset was derived from the DataONE network using Preston [2] between 17 October 2018 and 6 November 2018, resolving 335,213 urls at an average retrieval rate of about 5 seconds per url, or 720 files per hour, resulting in a data gzip compressed tar archive of 837.3 MB .
The archive associates 325,757 unique metadata urls [3] to 202,063 unique ecological metadata files [4]. Also, the DataONE search index was captured to establish provenance of how the dataset descriptors were found and acquired. During the creation of the snapshot (or crawl), 15,389 urls [5], or 4.7% of urls, did not successfully resolve.
To facilitate discovery, the record of the Preston snapshot crawl is included in the preston-ls-* files . There files are derived from the rdf/nquad file with hash://sha256/8c67e0741d1c90db54740e08d2e39d91dfd73566ea69c1f2da0d9ab9780a9a9f . This file can also be found in the data.tar.gz at data/8c/67/e0/8c67e0741d1c90db54740e08d2e39d91dfd73566ea69c1f2da0d9ab9780a9a9f/data . For more information about concepts and format, please see [2].
To extract all EML files from the included Preston archive, first extract the hashes assocated with EML files using:
cat preston-ls.tsv.gz | gunzip | grep "Version" | grep -v "deeplinker" | grep -v "query/solr" | cut -f1,3 | tr '\t' ' ' | grep "hash://" | sort | uniq > eml-hashes.txt
extract data.tar.gz using:
~/preston-archive$ tar xzf data.tar.gz
then use Preston to extract each hash using something like:
~/preston-archive$ preston get hash://sha256/00002d0fc9e35a9194da7dd3d8ce25eddee40740533f5af2397d6708542b9baa
Alternatively, without using Preston, you can extract the data using the naming convention:
data/[x]/[y]/[z]/[hash]/data
where x is the first 2 characters of the hash, y the second 2 characters, z the third 2 characters, and hash the full sha256 content hash of the EML file.
For example, the hash hash://sha256/00002d0fc9e35a9194da7dd3d8ce25eddee40740533f5af2397d6708542b9baa can be found in the file: data/00/00/2d/00002d0fc9e35a9194da7dd3d8ce25eddee40740533f5af2397d6708542b9baa/data . For more information, see [2].
The intended use of this archive is to facilitate meta-analysis of the DataONE dataset network.
[1] DataONE, https://www.dataone.org
[2] https://preston.guoda.bio, https://doi.org/10.5281/zenodo.1410543 . DataONE was crawled via Preston with "preston update -u https://dataone.org".
[3] cat preston-ls.tsv.gz | gunzip | grep "Version" | grep -v "deeplinker" | grep -v "query/solr" | cut -f1,3 | tr '\t' '
' | grep -v "hash://" | sort | uniq | wc -l
[4] cat preston-ls.tsv.gz | gunzip | grep "Version" | grep -v "deeplinker" | grep -v "query/solr" | cut -f1,3 | tr '\t' '
' | grep "hash://" | sort | uniq | wc -l
[5] cat preston-ls.tsv.gz | gunzip | grep "Version" | grep "deeplinker" | grep -v "query/solr" | cut -f1,3 | tr '\t' '
' | grep -v "hash://" | sort | uniq | wc -l
This work is funded in part by grant NSF OAC 1839201 from the National Science Foundation.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
It contains the text of an article and also all the images from that article along with metadata such as image titles and descriptions. From Wikipedia, we selected good articles, which are just a small subset of all available ones, because they are manually reviewed and protected from edits.
You can find more details in "Image Recommendation for Wikipedia Articles" thesis.
The high-level structure of the dataset is as follows:
.
+-- page1
| +-- text.json
| +-- img
| +-- meta.json
+-- page2
| +-- text.json
| +-- img
| +-- meta.json
:
+-- pageN
| +-- text.json
| +-- img
| +-- meta.json
where:
jpg
format where width of each image is set to 600px. Name of the image is md5 hashcode of original image title. Below you see an example of how data is stored:
{
"title": "Naval Battle of Guadalcanal",
"id": 405411,
"url": "https://en.wikipedia.org/wiki/Naval_Battle_of_Guadalcanal",
"text": "The Naval Battle of Guadalcanal, sometimes referred to as... ",
}
where:
{
"img_meta": [
{
"filename": "d681a3776d93663fc2788e7e469b27d7.jpg",
"title": "Metallica Damaged Justice Tour.jpg",
"description": "Metallica en concert",
"url": "https://en.wikipedia.org/wiki/File%3AMetallica_Damaged_Justice_Tour.jpg",
"features": [123.23, 10.21, ..., 24.17],
},
]
}
where:
jpeg
format with fixed width of 600px. Practically, it is a list of floats with len = 2048.ORIGINAL
in a filename
and absence of key features
. Raw images are available in complete version of dataset on Google DriveData was collected by fetching featured articles text&image content with pywikibot library.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
From 28 to 30 October 2024, ZB MED Information Centre for Life Sciences organized a hackathon in Cologne within the scope of NFDI4DataScience with the purpose of identifying core elements for a machine-actionable Data Managemnt Plan (maDMP) across the German National Research Data Infrastructure (NFDI). We used as a starting/reference point the maDMP application profile from the Research Data Alliance DMP Commons Working Group. An additional element, ManagementPlan, was also included. Management Plan comes originally from DataCite v4.4, its representation as an extension proposed by the ZB MED/NFDI4DataScience machine-actionable Software Management Plan metadata schema was used.
This dataset comprises crosswalks corresponding to the following elements from the RDA maDMP: DMP, Project, Dataset, Distribution. The crosswalks were created following a template prepared by the organizers. Participants freely select a metadata schema or platform to be mapped with regards to the attributes in the RDA maDMP application profile. Crosswalks were created for DMP, Project, Dataset, and Distribution, while comments were provided by DMP and Dataset.
The list of files in this dataset is as follows:
More information about the activities carried out during the hackathon and the analysis of the crosswalks will be available soon.
The activities and discussion reported here were carried out during a hackathon organized by the Semantic Technologies team at ZB MED Information Centre from 28 to 30 October 2024 in Cologne, Germany, and sponsored by the NFDI4DataScience consortium. NFI4DataScience is funded by the German Research Foundation (Deutsche Forschungsgemeinschaft – DFG) under the grant No. 460234259
The DMP4NFDI team acknowledges the support of DFG - German Research Foundation - through the coordination fund (project number 521466146).
David Wallace and Jürgen Windeck would like to thank the Federal Government and the Heads of Government of the Länder, as well as the Joint Science Conference (GWK), for their funding and support within the framework of the NFDI4ING consortium. Funded by the German Research Foundation (DFG) - project number 442146713.
Marco Reidelbach is supported by MaRDI, funded by the Deutsche Forschungsgemeinschaft (DFG), project number 460135501, NFDI 29/1 “MaRDI – Mathematische Forschungsdateninitiative”.
Dataset As reference dataset, 2,218 public draft or complete genome assemblies and available metadata of Escherichia coli have been downloaded from EnteroBase in April 2017. Genomes have been selected on the basis of the ribosomal ST (rST) classification available in EnteroBase: from the same rST, genomes have been randomly selected and downloaded. The number of samples for each rST in the final dataset is proportional to those available in EnteroBase in April 2017. The dataset includes also 119 Shiga toxin-producing E.coli genomes assembled with INNUca v3.1 belonging to the INNUENDO Sequence Dataset (PRJEB27020). File 'Metadata/Ecoli_metadata.txt' contains metadata information for each strain including source classification, taxa of the hosts, country and year of isolation, serotype, pathotype, classical pubMLST 7 genes ST classification, assembly source/method and Enterobase barcode. The directory 'Genomes' contains the 119 INNUca v3.1 assemblies of the strains listed in 'Metadata/Ecoli_metadata.txt'. Enterobase assemblies can be downloaded from http://enterobase.warwick.ac.uk/species/ecoli/search_strains using 'barcode'. Schema creation and validation The wgMLST schema from EnteroBase have been downloaded and curated using chewBBACA AutoAlleleCDSCuration for removing all alleles that are not coding sequences (CDS). The quality of the remain loci have been assessed using chewBBACA Schema Evaluation and loci with single alleles, those with high length variability (i.e. if more than 1 allele is outside the mode +/- 0.05 size) and those present in less than 0.5% of the Escherichia genomes in EnteroBase at the date of the analysis (April 2017) have been removed. The wgMLST schema have been further curated, excluding all those loci detected as “Repeated Loci” and loci annotated as “non-informative paralogous hit (NIPH/ NIPHEM)” or “Allele Larger/ Smaller than length mode (ALM/ ASM)” by the chewBBACA Allele Calling engine in more than 1% of a dataset composed by 2,337 Escherichia coli genomes. File 'Schema/Ecoli_wgMLST_7601_schema.tar.gz' contains the wgMLST schema formatted for chewBBACA and includes a total of 7,601 loci. File 'Schema/Ecoli_cgMLST_2360_listGenes.txt' contains the list of genes from the wgMLST schema which defines the cgMLST schema. The cgMLST schema consists of 2,360 loci and has been defined as the loci present in at least the 99% of the 2,337 Escherichia coli genomes. Genomes have no more than 2% of missing loci. File 'Allele_Profles/Ecoli_wgMLST_alleleProfiles.tsv' contains the wgMLST allelic profile of the 2,337 Escherichia coli genomes of the dataset. Please note that missing loci follow the annotation of chewBBACA Allele Calling software. File 'Allele_Profles/Ecoli_cgMLST_alleleProfiles.tsv' contains the cgMLST allelic profile of the 2,337 Escherichia coli genomes of the dataset. Please note that missing loci are indicated with a zero. Additional citations The schema are prepared to be used with chewBBACA. When using the schema in this repository please cite also: Silva M, Machado M, Silva D, Rossi M, Moran-Gilad J, Santos S, Ramirez M, Carriço J. chewBBACA: A complete suite for gene-by-gene schema creation and strain identification. 15/03/2018. M Gen 4(3): doi:10.1099/mgen.0.000166 http://mgen.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000166 Escherichia coli schema is a derivation of EnteroBase E. coli EnteroBase wgMLST schema. When using the schema in this repository please cite also: Alikhan N-F, Zhou Z, Sergeant MJ, Achtman M (2018) A genomic overview of the population structure of Salmonella. PLoS Genet 14 (4):e1007261. https://doi.org/10.1371/journal.pgen.1007261 The isolates' genomes raw sequence data produced within the activity of the INNUENDO project were submitted to the European Nucleotide Archive (ENA) database and are publicly available under the project accession number PRJEB27020. When using the schemas, the assemblies or the allele profiles please include the project number in your publication. The research from the INNUENDO project has received funding from European Food Safety Authority (EFSA), grant agreement GP/EFSA/AFSCO/2015/01/CT2 (New approaches in identifying and characterizing microbial and chemical hazards) and from the Government of the Basque Country. The conclusions, findings, and opinions expressed in this repository reflect only the view of the INNUENDO consortium members and not the official position of EFSA nor of the Government of the Basque Country. EFSA and the Government of the Basque Country are not responsible for any use that may be made of the information included in this repository. The INNUENDO consortium thanks the Austrian Agency for Health and Food Safety Limited for participating in the project by providing strains. The consortium thanks all the researchers and the authorities worldwide which are contributing by submitting the raw sequences of the bacterial strains in public repositories. The project wa...
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
It contains the text of an article and also all the images from that article along with metadata such as image titles and descriptions. From Wikipedia, we selected featured articles, which are just a small subset of all available ones, because they are manually reviewed and protected from edits. Thus it's the best theoretical quality human editors on Wikipedia can offer.
You can find more details in "Image Recommendation for Wikipedia Articles" thesis.
The high-level structure of the dataset is as follows:
.
+-- page1
| +-- text.json
| +-- img
| +-- meta.json
+-- page2
| +-- text.json
| +-- img
| +-- meta.json
:
+-- pageN
| +-- text.json
| +-- img
| +-- meta.json
label | description |
---|---|
pageN | is the title of N-th Wikipedia page and contains all information about the page |
text.json | text of the page saved as JSON. Please refer to the details of JSON schema below. |
meta.json | a collection of all images of the page. Please refer to the details of JSON schema below. |
imageN | is the N-th image of an article, saved in jpg format where the width of each image is set to 600px. Name of the image is md5 hashcode of original image title. |
Below you see an example of how data is stored:
{
"title": "Naval Battle of Guadalcanal",
"id": 405411,
"url": "https://en.wikipedia.org/wiki/Naval_Battle_of_Guadalcanal",
"html": "...
...", "wikitext": "... The '''Naval Battle of Guadalcanal''', sometimes referred to as ...", }
key | description |
---|---|
title | page title |
id | unique page id |
url | url of a page on Wikipedia |
html | HTML content of the article |
wikitext | wikitext content of the article |
Please note that @html and @wikitext properties represent the same information in different formats, so just choose the one which is easier to parse in your circumstances.
{
"img_meta": [
{
"filename": "702105f83a2aa0d2a89447be6b61c624.jpg",
"title": "IronbottomSound.jpg",
"parsed_title": "ironbottom sound",
"url": "https://en.wikipedia.org/wiki/File%3AIronbottomSound.jpg",
"is_icon": False,
"on_commons": True,
"description": "A U.S. destroyer steams up what later became known as ...",
"caption": "Ironbottom Sound. The majority of the warship surface ...",
"headings": ['Naval Battle of Guadalcanal', 'First Naval Battle of Guadalcanal', ...],
"features": ['4.8618264', '0.49436468', '7.0841103', '2.7377882', '2.1305492', ...],
},
...
]
}
key | description |
---|---|
filename | unique image id, md5 hashcode of original image title |
title | image title retrieved from Commons, if applicable |
parsed_title | image title split into words, i.e. "helloWorld.jpg" -> "hello world" |
url | url of an image on Wikipedia |
is_icon | True if image is an icon, e.g. category icon. We assume that image is an icon if you cannot load a preview on Wikipedia after clicking on it |
on_commons | True if image is available from Wikimedia Commons dataset |
description | description of an image parsed from Wikimedia Commons page, if available |
caption | caption of an image parsed from Wikipedia article, if available |
headings | list of all nested headings of location where article is placed in Wikipedia article. The first element is top-most heading |
features | output of 5-th convolutional layer of ResNet152 trained on ImageNet dataset. That output of shape (19, 24, 2048) is then max-pooled to a shape (2048,). Features taken from original images downloaded in jpeg format with fixed width of 600px. Practically, it is a list of floats with len = 2048 |
Data was collected by fetching featured articles text&image content with pywikibot library and then parsing out a lot of additional metadata from HTML pages from Wikipedia and Commons.
This dataset contains a list of sales and movement data by item and department appended monthly. Update Frequency : Monthly
This resource is a metadata compilation for Michigan geothermal related data in data exchange models submitted to the AASG National Geothermal Data System project to fulfill Year 3 data deliverables by the Michigan Geological Survey, Western Michigan University. Descriptions, links, and contact information for the ESRI Map Services created with Michigan data are also available here, including borehole temperature data, drill stem test data, lithology interval data, heat pump installations, physical samples, and well header data for the state of Michigan. The data and associated services were provided by the Michigan Geological Survey, Western Michigan University. The compilation is published as an Excel workbook containing header features including title, description, author, citation, originator, distributor, and resource URL links to scanned maps for download. The Excel workbook contains 6 worksheets, including information about the template, notes related to revisions of the template, resource provider information, the metadata, a field list (data mapping view) and vocabularies (data valid terms) used to populate the data worksheet. This metadata compilation was provided by the Michigan Geological Survey at Western Michigan University and made available for distribution through the National Geothermal Data System.
This shapefile contains points that describe the location of gas samples collected in Afghanistan and adjacent areas and the results of organic geochemical analysis.
ClaimsKG is a knowledge graph of metadata information for thousands of fact-checked claims which facilitates structured queries about their truth values, authors, dates, and other kinds of metadata. ClaimsKG is generated through a (semi-)automated pipeline, which harvests claim-related data from popular fact-checking web sites, annotates them with related entities from DBpedia, and lifts all data to RDF using an RDF/S model that makes use of established vocabularies (such as schema.org).
ClaimsKG does NOT contain the text of the reviews from the fact-checking web sites; it only contains structured metadata information and links to the reviews.
More information, such as statistics, query examples and a user friendly interface to explore the knowledge graph, is available at: https://data.gesis.org/claimskg/site
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The provided ZIP archive contains an XML file "main-database-description.xml" with the description of all tables (VIEWS) that are exposed publicly at the PLBD server (https://plbd.org/). In the XML file, all columns of the visible tables are described, specifying their SQL types, measurement units, semantics, calculation formulae, SQL statements that can be used to generate values in these columns, and publications of the formulae derivations.
The XML file conforms to the published XSD schema created for descriptions of relational databases for specifications of scientific measurement data. The XSD schema ("relational-database_v2.0.0-rc.18.xsd") and all included sub-schemas are provided in the same archive for convenience. All XSD schemas are validated against the "XMLSchema.xsd" schema from the W3C consortium.
The ZIP file contains the excerpt from the files hosted in the https://plbd.org/ at the moment of submission of the PLBD database in the Scientific Data journal, and is provided to conform the journal policies. The current data and schemas should be fetched from the published URIs:
https://plbd.org/
https://plbd.org/doc/db/schemas
https://plbd.org/doc/xml/schemas
Software that is used to generate SQL schemas, RestfulDB metadata and the RestfulDB middleware that allows to publish the databases generated from the XML description on the Web are available at public Subversion repositories:
svn://www.crystallography.net/solsa-database-scripts
svn://saulius-grazulis.lt/restfuldb
The unpacked ZIP file will create the "db/" directory with the tree layout given below. In addition to the database description file "main-database-description.xml", all XSD schemas necessary for validation of the XML file are provided. On a GNU/Linux operating system with a GNU Make package installed, the XML file validity can be checked by unpacking the ZIP file, entering the unpacked directory, and running 'make distclean; make'. For example, on a Linux Mint distribution, the following commands should work:
unzip main-database-description.zip
cd db/release/v0.10.0/tables/
sh -x dependencies/Linuxmint-20.1/install.sh
make distclean
make
If necessary, additional packages can be installed using the 'install.sh' script in the 'dependencies/' subdirectory corresponding to your operating system. As of the moment of writing, Debian-10 and Linuxmint-20.1 OSes are supported out of the box; similar OSes might work with the same 'install.sh' scripts. The installation scripts require to run package installation command under system administrator privileges, but they use only the standard system package manager, thus they should not put your system at risk. For validation and syntax checking, the 'rxp' and 'xmllint' programs are used.
The log files provided in the "outputs/validation" subdirectory contain validation logs obtained on the system where the XML files were last checked and should indicate validity of the provided XML file against the references schemas.
db/
└── release
└── v0.10.0
└── tables
├── Makeconfig-validate-xml
├── Makefile
├── Makelocal-validate-xml
├── dependencies
├── main-database-description.xml
├── outputs
└── schema
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Apollo 16 Coarse Fines (4-10 mm): Sample Classification, Description, and Inventory by U. Marvin
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This product consists of the XML schema and the JSON schema aligned with the Scholix Metadata Guidelines Version 4. It contains the .json and /xsd files together with examples of compatible metadata records. Changes from the previous update are backward compatible and include the following: The schema admits for the field type (typology of source/target objects) terms of the following vocabulary: publications, datasets, software, other research types (version 3.0 included only literature and dataset) The schema includes a new optional field subtype which includes the specific sub-type of the objects, according to the OpenAIRE classification of publications, datasets, software, and other products (for more) The schema admits multiple entries for the field Identifier in both the source and target objects; this is to specify the list of PIDs resulting from the deduplication on OpenAIRE (i.e. the same publication may have been collected from Crossref and from EuropePMC, thus including both PIDs).