38 datasets found

g
The Scholix Metadata JSON Schema
gimi9.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Scholix Metadata JSON Schema [Dataset]. https://gimi9.com/dataset/eu_oai-zenodo-org-6351557
Explore at:
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This product consists of the XML schema and the JSON schema aligned with the Scholix Metadata Guidelines Version 4. It contains the .json and /xsd files together with examples of compatible metadata records. Changes from the previous update are backward compatible and include the following: The schema admits for the field type (typology of source/target objects) terms of the following vocabulary: publications, datasets, software, other research types (version 3.0 included only literature and dataset) The schema includes a new optional field subtype which includes the specific sub-type of the objects, according to the OpenAIRE classification of publications, datasets, software, and other products (for more) The schema admits multiple entries for the field Identifier in both the source and target objects; this is to specify the list of PIDs resulting from the deduplication on OpenAIRE (i.e. the same publication may have been collected from Crossref and from EuropePMC, thus including both PIDs).
The Red Queen in the Repository: metadata quality in an ever-changing...
zenodo.org
researchdata.se
bin, csv, zip
Updated Jul 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joakim Philipson; Joakim Philipson (2024). The Red Queen in the Repository: metadata quality in an ever-changing environment (preprint of paper, presentation slides and dataset collection with validation schemas to IDCC2019 conference paper) [Dataset]. http://doi.org/10.5281/zenodo.2276777
Explore at:
zip, bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2276777
Dataset updated
Jul 25, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Joakim Philipson; Joakim Philipson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This fileset contains a preprint version of the conference paper (.pdf), presentation slides (as .pptx) and the dataset(s) and validation schema(s) for the IDCC 2019 (Melbourne) conference paper: The Red Queen in the Repository: metadata quality in an ever-changing environment. Datasets and schemas are in .xml, .xsd , Excel (.xlsx) and .csv (two files representing two different sheets in the .xslx -file). The validationSchemas.zip holds the additional validation schemas (.xsd), that were not found in the schemaLocations of the metadata xml-files to be validated. The schemas must all be placed in the same folder, and are to be used for validating the Dataverse dcterms records (with metadataDCT.xsd) and the Zenodo oai_datacite feeds respectively (schema.datacite.org_oai_oai-1.0_oai.xsd). In the latter case, a simpler way of doing it might be to replace the incorrect URL "http://schema.datacite.org/oai/oai-1.0/ oai_datacite.xsd" in the schemaLocation of these xml-files by the CORRECT: schemaLocation="http://schema.datacite.org/oai/oai-1.0/ http://schema.datacite.org/oai/oai-1.0/oai.xsd" as has been done already in the sample files here. The sample file folders testDVNcoll.zip (Dataverse), testFigColl.zip (Figshare) and testZenColl.zip (Zenodo) contain all the metadata files tested and validated that are registered in the spreadsheet with objectIDs.
In the case of Zenodo, one original file feed,
zen2018oai_datacite3orig-https%20_zenodo.org_oai2d%20verb=ListRecords%26metadata
Prefix=oai_datacite%26from=2018-11-29%26until=2018-11-30.xml ,
is also supplied to show what was necessary to change in order to perform validation as indicated in the paper.

For Dataverse, a corrected version of a file,
dvn2014ddi-27595Corr_https%20_dataverse.harvard.edu_api_datasets_export%20
exporter=ddi%26persistentId=doi%253A10.7910_DVN_27595Corr.xml ,
is also supplied in order to show the changes it would take to make the file validate without error.
Zenodo Open Metadata snapshot - Training dataset for records and communities...
data.niaid.nih.gov
explore.openaire.eu
Updated Dec 15, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo team (2022). Zenodo Open Metadata snapshot - Training dataset for records and communities classifier building [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_787062
Explore at:
Dataset updated
Dec 15, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Zenodo team
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains Zenodo's published open access records and communities metadata, including entries marked by the Zenodo staff as spam and deleted.

The datasets are gzipped compressed JSON-lines files, where each line is a JSON object representation of a Zenodo record or community.

Records dataset

Filename: zenodo_open_metadata_{ date of export }.jsonl.gz

Each object contains the terms: part_of, thesis, description, doi, meeting, imprint, references, recid, alternate_identifiers, resource_type, journal, related_identifiers, title, subjects, notes, creators, communities, access_right, keywords, contributors, publication_date

which correspond to the fields with the same name available in Zenodo's record JSON Schema at https://zenodo.org/schemas/records/record-v1.0.0.json.

In addition, some terms have been altered:

The term files contains a list of dictionaries containing filetype, size, and filename only.

The term license contains a short Zenodo ID of the license (e.g. "cc-by").

Communities dataset

Filename: zenodo_community_metadata_{ date of export }.jsonl.gz

Each object contains the terms: id, title, description, curation_policy, page

which correspond to the fields with the same name available in Zenodo's community creation form.

Notes for all datasets

For each object the term spam contains a boolean value, determining whether a given record/community was marked as spam content by Zenodo staff.

Some values for the top-level terms, which were missing in the metadata may contain a null value.

A smaller uncompressed random sample of 200 JSON lines is also included for each dataset to test and get familiar with the format without having to download the entire dataset.
Extracted Schemas from the Life Sciences Linked Open Data Cloud
figshare.com
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maulik Kamdar (2023). Extracted Schemas from the Life Sciences Linked Open Data Cloud [Dataset]. http://doi.org/10.6084/m9.figshare.12402425.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12402425.v2
Dataset updated
Jun 1, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Maulik Kamdar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is related to the manuscript "An empirical meta-analysis of the life sciences linked open data on the web" published at Nature Scientific Data. If you use the dataset, please cite the manuscript as follows:Kamdar, M.R., Musen, M.A. An empirical meta-analysis of the life sciences linked open data on the web. Sci Data 8, 24 (2021). https://doi.org/10.1038/s41597-021-00797-yWe have extracted schemas from more than 80 publicly available biomedical linked data graphs in the Life Sciences Linked Open Data (LSLOD) cloud into an LSLOD schema graph and conduct an empirical meta-analysis to evaluate the extent of semantic heterogeneity across the LSLOD cloud. The dataset published here contains the following files:- The set of Linked Data Graphs from the LSLOD cloud from which schemas are extracted.- Refined Sets of extracted classes, object properties, data properties, and datatypes, shared across the Linked Data Graphs on LSLOD cloud. Where the schema element is reused from a Linked Open Vocabulary or an ontology, it is explicitly indicated.- The LSLOD Schema Graph, which contains all the above extracted schema elements interlinked with each other based on the underlying content. Sample instances and sample assertions are also provided along with broad level characteristics of the modeled content. The LSLOD Schema Graph is saved as a JSON Pickle File. To read the JSON object in this Pickle file use the Python command as follows:with open('LSLOD-Schema-Graph.json.pickle' , 'rb') as infile: x = pickle.load(infile, encoding='iso-8859-1')Check the Referenced Link for more details on this research, raw data files, and code references.
L
Example vocabulary
liveschema.eu
csv, rdf, ttl
Updated Dec 17, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Linked Open Vocabulary (2020). Example vocabulary [Dataset]. http://liveschema.eu/dataset/lov_ex
Explore at:
csv, ttl, rdfAvailable download formats
Dataset updated
Dec 17, 2020
Dataset provided by
Linked Open Vocabulary
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
スキーマにサンプルコードを埋め込むための語彙。XSLTを使って、XHTMLの中で例を表示させることができる。 @ja, Vocabulary to include sample codes in a schema. Can work with XSLT (http://purl.org/net/ns/ns-schema.xsl) to present schema as XHTML list with examples. @en

GitTables 1M

zenodo.org

zip

Updated May 10, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Madelon Hulsebos; Madelon Hulsebos; Çağatay Demiralp; Paul Groth; Çağatay Demiralp; Paul Groth (2022). GitTables 1M [Dataset]. http://doi.org/10.5281/zenodo.6517052

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6517052

Dataset updated

May 10, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Madelon Hulsebos; Madelon Hulsebos; Çağatay Demiralp; Paul Groth; Çağatay Demiralp; Paul Groth

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Summary

GitTables 1M (https://gittables.github.io) is a corpus of currently 1M relational tables extracted from CSV files in GitHub repositories, that are associated with a license that allows distribution. We aim to grow this to at least 10M tables.

Each parquet file in this corpus represents a table with the original content (e.g. values and header) as extracted from the corresponding CSV file. Table columns are enriched with annotations corresponding to >2K semantic types from Schema.org and DBpedia (provided as metadata of the parquet file). These column annotations consist of, for example, semantic types, hierarchical relations to other types, and descriptions.

We believe GitTables can facilitate many use-cases, among which:

Data integration, search and validation.
Data visualization and analysis recommendation.
Schema analysis and completion for e.g. database or knowledge base design.

If you have questions, the paper, documentation, and contact details are provided on the website: https://gittables.github.io. We recommend using Zenodo's API to easily download the full dataset (i.e. all zipped topic subsets).

Dataset contents

The data is provided in subsets of tables stored in parquet files, each subset corresponds to a term that was used to query GitHub with. The column annotations and other metadata (e.g. URL and repository license) are attached to the metadata of the parquet file. This version corresponds to this version of the paper https://arxiv.org/abs/2106.07258v4.

In summary, this dataset can be characterized as follows:

Statistic	Value
# tables	1M
average # columns	12
average # rows	142
# annotated tables (at least 1 column annotation)	723K+ (DBpedia), 738K+ (Schema.org)
# unique semantic types	835 (DBpedia), 677 (Schema.org)

How to download

The dataset can be downloaded through Zenodo's interface directly, or using Zenodo's API (recommended for full download).

Future releases

Future releases will include the following:

Increased number of tables (expected at least 10M)

Associated datasets

- GitTables benchmark - column type detection: https://zenodo.org/record/5706316

- GitTables 1M - CSV files: https://zenodo.org/record/6515973

c
ClaimsKG - A Knowledge Graph of Fact-Checked Claims (January, 2023)
datacatalogue.cessda.eu
search.gesis.org
+1more
Updated Oct 17, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gangopadhyay, Susmita; Schellhammer, Sebastian; Boland, Katarina; Schüller, Sascha; Todorov, Konstantin; Tchechmedjiev, Andon; Zapilko, Benjamin; Fafalios, Pavlos; Jabeen, Hajira; Dietze, Stefan (2023). ClaimsKG - A Knowledge Graph of Fact-Checked Claims (January, 2023) [Dataset]. http://doi.org/10.7802/2620
Explore at:
Unique identifier
https://doi.org/10.7802/2620
Dataset updated
Oct 17, 2023
Dataset provided by
GESIS - Leibniz-Institut für Sozialwissenschaften
LIRMM / University of Montpellier
LGI2P / IMT Mines Ales / University of Montpellier
GESIS - Leibniz-Institut für Sozialwissenschaften & Heinrich-Heine-University Düsseldorf
Institute of Computer Science, FORTH-ICS
Authors
Gangopadhyay, Susmita; Schellhammer, Sebastian; Boland, Katarina; Schüller, Sascha; Todorov, Konstantin; Tchechmedjiev, Andon; Zapilko, Benjamin; Fafalios, Pavlos; Jabeen, Hajira; Dietze, Stefan
Measurement technique
Web scraping
Description
ClaimsKG is a knowledge graph of metadata information for fact-checked claims scraped from popular fact-checking sites. In addition to providing a single dataset of claims and associated metadata, truth ratings are harmonized and additional information is provided for each claim, e.g., about mentioned entities. Please see ( https://data.gesis.org/claimskg/ ) for further details about the data model, query examples and statistics.

The dataset facilitates structured queries about claims, their truth values, involved entities, authors, dates, and other kinds of metadata. ClaimsKG is generated through a (semi-)automated pipeline, which harvests claim-related data from popular fact-checking web sites, annotates them with related entities from DBpedia/Wikipedia, and lifts all data to RDF using established vocabularies (such as schema.org).

The latest release of ClaimsKG covers 74066 claims and 72127 Claim Reviews. This is the fourth release of the dataset where data was scraped till Jan 31, 2023 containing claims published between 1996 and 2023 from 13 fact-checking websites. The websites are Fullfact, Politifact, TruthOrFiction, Checkyourfact, Vishvanews, AFP (French), AFP, Polygraph, EU factcheck, Factograph, Fatabyyano, Snopes and Africacheck. The claim-review (fact-checking) period for claims ranges between the year 1996 to 2023. Similar to the previous release, the Entity fishing python client ( https://github.com/hirmeos/entity-fishing-client-python ) has been used for entity linking and disambiguation in this release. Improvements have been made in the web scraping and data preprocessing pipeline to extract more entities from both claims and claims reviews. Currently, ClaimsKG contains 3408386 entities detected and referenced with DBpedia.

This latest release of ClaimsKG supersedes the previous versions as it contained all the claims from the previous versions together in addition to the additional new claims as well as improved entity annotation resulting in a higher number of entities.
The Yelp Collaborative Knowledge Graph
zenodo.org
data.niaid.nih.gov
application/gzip, bin +1
Updated Jun 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mads Corfixen; Magnus Olesen; Thomas Heede; Christian Filip Pinderup Nielsen; Mads Corfixen; Magnus Olesen; Thomas Heede; Christian Filip Pinderup Nielsen (2023). The Yelp Collaborative Knowledge Graph [Dataset]. http://doi.org/10.5281/zenodo.8049832
Explore at:
application/gzip, csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8049832
Dataset updated
Jun 17, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mads Corfixen; Magnus Olesen; Thomas Heede; Christian Filip Pinderup Nielsen; Mads Corfixen; Magnus Olesen; Thomas Heede; Christian Filip Pinderup Nielsen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the The Yelp Collaborative Knowledge Graph (YCKG) - a transformation of the Yelp Open Dataset into RDF format using Y2KG.

Paper Abstract

The Yelp Open Dataset (YOD) contains data about businesses, reviews, and users from the Yelp website and is available for research purposes. This dataset has been widely used to develop and test Recommender Systems (RS), especially those using Knowledge Graphs (KGs), e.g., integrating taxonomies, product categories, business locations, and social network information. Unfortunately, researchers applied naive or wrong mappings while converting YOD in KGs, consequently obtaining unrealistic results. Among the various issues, the conversion processes usually do not follow state-of-the-art methodologies, fail to properly link to other KGs and reuse existing vocabularies. In this work, we overcome these issues by introducing Y2KG, a utility to convert the Yelp dataset into a KG. Y2KG consists of two components. The first is a dataset including (1) a vocabulary that extends Schema.org with properties to describe the concepts in YOD and (2) mappings between the Yelp entities and Wikidata. The second component is a set of scripts to transform YOD in RDF and obtain the Yelp Collaborative Knowledge Graph (YCKG). The design of Y2KG was driven by 16 core competency questions. YCKG includes 150k businesses and 16.9M reviews from 1.9M distinct real users, resulting in over 244 million triples (with 144 distinct predicates) for about 72 million resources, with an average in-degree and out-degree of 3.3 and 12.2, respectively.

Links

Latest GitHub release: https://github.com/MadsCorfixen/The-Yelp-Collaborative-Knowledge-Graph/releases/latest

PURL domain: https://purl.archive.org/domain/yckg

Files

Graph Data Triple Files

One sample file for each of the Yelp domains (Businesses, Users, Reviews, Tips and Checkins), each containing 20 entities.

yelp_schema_mappings.nt.gz containing the mappings from Yelp categories to Schema things.

schema_hierarchy.nt.gz containing the full hierarchy of the mapped Schema things.

yelp_wiki_mappings.nt.gz containing the mappings from Yelp categories to Wikidata entities.

wikidata_location_mappings.nt.gz containing the mappings from Yelp locations to Wikidata entities.

Graph Metadata Triple Files

yelp_categories.ttl contains metadata for all Yelp categories.

yelp_entities.ttl contains metadata regarding the dataset

yelp_vocabulary.ttl contains metadata on the created Yelp vocabulary and properties.

Utility Files

yelp_category_schema_mappings.csv. This file contains the 310 mappings from Yelp categories to Schema types. These mappings have been manually verified to be correct.

yelp_predicate_schema_mappings.csv. This file contains the 14 mappings from Yelp attributes to Schema properties. These mappings are manually found.

ground_truth_yelp_category_schema_mappings.csv. This file contains the ground truth, based on 200 manually verified mappings from Yelp categories to Schema things. The ground truth mappings were used to calculate precision and recall for the semantic mappings.

manually_split_categories.csv. This file contains all Yelp categories containing either a & or /, and their manually split versions. The split versions have been used in the semantic mappings to Schema things.
XML Metadata Template for the U.S. Fish and Wildlife Service
data.wu.ac.at
datadiscoverystudio.org
+1more
xml
Updated Jul 25, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of the Interior (2012). XML Metadata Template for the U.S. Fish and Wildlife Service [Dataset]. https://data.wu.ac.at/schema/data_gov/YzZhYzFkMDctZDhmOS00YjRjLTg4MDAtMjUyYzNiMjhkZWEy
Explore at:
xmlAvailable download formats
Dataset updated
Jul 25, 2012
Dataset provided by
United States Department of the Interiorhttp://www.doi.gov/
Description
This dataset is provided as an example of XML metadata that can be used to create a records in ServCat for GIS datasets.
Data from: A biodiversity dataset graph: DataONE
zenodo.org
application/gzip, bin +2
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jorrit H. Poelen; Jorrit H. Poelen (2023). A biodiversity dataset graph: DataONE [Dataset]. http://doi.org/10.5281/zenodo.1486279
Explore at:
bin, application/gzip, bz2, tsvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1486279
Dataset updated
Jun 2, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jorrit H. Poelen; Jorrit H. Poelen
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The intended use of this archive is to facilitate meta-analysis of the Data Observation Network for Earth (DataONE, [1]).

DataONE is a distributed infrastructure that provides information about earth observation data. This dataset was derived from the DataONE network using Preston [2] between 17 October 2018 and 6 November 2018, resolving 335,213 urls at an average retrieval rate of about 5 seconds per url, or 720 files per hour, resulting in a data gzip compressed tar archive of 837.3 MB .

The archive associates 325,757 unique metadata urls [3] to 202,063 unique ecological metadata files [4]. Also, the DataONE search index was captured to establish provenance of how the dataset descriptors were found and acquired. During the creation of the snapshot (or crawl), 15,389 urls [5], or 4.7% of urls, did not successfully resolve.

To facilitate discovery, the record of the Preston snapshot crawl is included in the preston-ls-* files . There files are derived from the rdf/nquad file with hash://sha256/8c67e0741d1c90db54740e08d2e39d91dfd73566ea69c1f2da0d9ab9780a9a9f . This file can also be found in the data.tar.gz at data/8c/67/e0/8c67e0741d1c90db54740e08d2e39d91dfd73566ea69c1f2da0d9ab9780a9a9f/data . For more information about concepts and format, please see [2].

To extract all EML files from the included Preston archive, first extract the hashes assocated with EML files using:

cat preston-ls.tsv.gz | gunzip | grep "Version" | grep -v "deeplinker" | grep -v "query/solr" | cut -f1,3 | tr '\t' ' ' | grep "hash://" | sort | uniq > eml-hashes.txt

extract data.tar.gz using:

~/preston-archive$ tar xzf data.tar.gz

then use Preston to extract each hash using something like:

~/preston-archive$ preston get hash://sha256/00002d0fc9e35a9194da7dd3d8ce25eddee40740533f5af2397d6708542b9baa

Alternatively, without using Preston, you can extract the data using the naming convention:

data/[x]/[y]/[z]/[hash]/data

where x is the first 2 characters of the hash, y the second 2 characters, z the third 2 characters, and hash the full sha256 content hash of the EML file.

For example, the hash hash://sha256/00002d0fc9e35a9194da7dd3d8ce25eddee40740533f5af2397d6708542b9baa can be found in the file: data/00/00/2d/00002d0fc9e35a9194da7dd3d8ce25eddee40740533f5af2397d6708542b9baa/data . For more information, see [2].

The intended use of this archive is to facilitate meta-analysis of the DataONE dataset network.

[1] DataONE, https://www.dataone.org
[2] https://preston.guoda.bio, https://doi.org/10.5281/zenodo.1410543 . DataONE was crawled via Preston with "preston update -u https://dataone.org".
[3] cat preston-ls.tsv.gz | gunzip | grep "Version" | grep -v "deeplinker" | grep -v "query/solr" | cut -f1,3 | tr '\t' ' ' | grep -v "hash://" | sort | uniq | wc -l
[4] cat preston-ls.tsv.gz | gunzip | grep "Version" | grep -v "deeplinker" | grep -v "query/solr" | cut -f1,3 | tr '\t' ' ' | grep "hash://" | sort | uniq | wc -l
[5] cat preston-ls.tsv.gz | gunzip | grep "Version" | grep "deeplinker" | grep -v "query/solr" | cut -f1,3 | tr '\t' ' ' | grep -v "hash://" | sort | uniq | wc -l

This work is funded in part by grant NSF OAC 1839201 from the National Science Foundation.
Wikipedia Multimodal Dataset of Good Articles
kaggle.com
Updated Dec 26, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oleh Onyshchak (2019). Wikipedia Multimodal Dataset of Good Articles [Dataset]. http://doi.org/10.34740/kaggle/dsv/861624
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/861624
Dataset updated
Dec 26, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Oleh Onyshchak
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Wikipedia Featured Articles multimodal dataset

Overview

This is a multimodal dataset of good articles containing 36,476 articles and 216,463 images.

Its subset of more high quality featured articles is also hosted on Kaggle.

It contains the text of an article and also all the images from that article along with metadata such as image titles and descriptions. From Wikipedia, we selected good articles, which are just a small subset of all available ones, because they are manually reviewed and protected from edits.

You can find more details in "Image Recommendation for Wikipedia Articles" thesis.

Dataset structure

The high-level structure of the dataset is as follows:

. +-- page1 | +-- text.json | +-- img | +-- meta.json +-- page2 | +-- text.json | +-- img | +-- meta.json : +-- pageN | +-- text.json | +-- img | +-- meta.json

where:

pageN - is the title of N-th Wikipedia page and contains all information about the page

text.json - text of the page saved as JSON. Please refer to the details of JSON schema below.

meta.json- a collection of all images of the page. Please refer to the detals o of JSON schema below.

imageN - is the N-th image of an article, saved in jpg format where width of each image is set to 600px. Name of the image is md5 hashcode of original image title.

text.JSON Schema

Below you see an example of how data is stored:

{ "title": "Naval Battle of Guadalcanal", "id": 405411, "url": "https://en.wikipedia.org/wiki/Naval_Battle_of_Guadalcanal", "text": "The Naval Battle of Guadalcanal, sometimes referred to as... ", }

where:

title - page title

id - unique page id

url - url of a page on Wikipedia

text - text content of the article escaped from Wikipedia formatting

meta.JSON Schema

{ "img_meta": [ { "filename": "d681a3776d93663fc2788e7e469b27d7.jpg", "title": "Metallica Damaged Justice Tour.jpg", "description": "Metallica en concert", "url": "https://en.wikipedia.org/wiki/File%3AMetallica_Damaged_Justice_Tour.jpg", "features": [123.23, 10.21, ..., 24.17], }, ] }

where:

filename- unique image id, md5 hashcode of original image title

title- image title retrieved from Commons, if applicable

url - url of an image on Wikipedia

features - output of 5-th convolutional layer of ResNet152 trained on ImageNet dataset. Features taken from original images downloaded in jpeg format with fixed width of 600px. Practically, it is a list of floats with len = 2048

Note on Images

Some images aren't embedded on Wikipedia page from Commons, thus we can only download them in original type&size. If you want to use those as well, those images should be properly processed later. Each such image can be identified by suffix .ORIGINAL in a filename and absence of key features. Raw images are available in complete version of dataset on Google Drive

Collection method

Data was collected by fetching featured articles text&image content with pywikibot library.

collection script

current list of Featured Articles
z
Crosswalks from the NFDI4DS hackathon "machine-actionable Data Management...
zenodo.org
bin, pdf, tsv
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dhwani Solanki; Dhwani Solanki; Suhasini Venkatesh; Suhasini Venkatesh; Safial Islam Ayon; Safial Islam Ayon; Martin Armbruster; Martin Armbruster; Katja Diederichs; Katja Diederichs; Sara El-Gebali; Sara El-Gebali; Giacomo Lanza; Giacomo Lanza; Antonia Leidel; Antonia Leidel; Jimena Linares Gómez; Jimena Linares Gómez; Olaf Michaelis; Olaf Michaelis; Rajendran Rajapreethi; Rajendran Rajapreethi; Marco Reidelbach; Marco Reidelbach; Gabriel Schneider; Gabriel Schneider; Sabine Schönau; Sabine Schönau; Christoph Steinbeck; Christoph Steinbeck; David Wallace; David Wallace; Jürgen Windeck; Jürgen Windeck; Xiao-Ran Zhou; Xiao-Ran Zhou; Leyla Jael Castro; Leyla Jael Castro (2025). Crosswalks from the NFDI4DS hackathon "machine-actionable Data Management Plan for NFDI" [Dataset]. http://doi.org/10.5281/zenodo.15129830
Explore at:
tsv, bin, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15129830
Dataset updated
Apr 11, 2025
Dataset provided by
Zenodo
Authors
Dhwani Solanki; Dhwani Solanki; Suhasini Venkatesh; Suhasini Venkatesh; Safial Islam Ayon; Safial Islam Ayon; Martin Armbruster; Martin Armbruster; Katja Diederichs; Katja Diederichs; Sara El-Gebali; Sara El-Gebali; Giacomo Lanza; Giacomo Lanza; Antonia Leidel; Antonia Leidel; Jimena Linares Gómez; Jimena Linares Gómez; Olaf Michaelis; Olaf Michaelis; Rajendran Rajapreethi; Rajendran Rajapreethi; Marco Reidelbach; Marco Reidelbach; Gabriel Schneider; Gabriel Schneider; Sabine Schönau; Sabine Schönau; Christoph Steinbeck; Christoph Steinbeck; David Wallace; David Wallace; Jürgen Windeck; Jürgen Windeck; Xiao-Ran Zhou; Xiao-Ran Zhou; Leyla Jael Castro; Leyla Jael Castro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary

From 28 to 30 October 2024, ZB MED Information Centre for Life Sciences organized a hackathon in Cologne within the scope of NFDI4DataScience with the purpose of identifying core elements for a machine-actionable Data Managemnt Plan (maDMP) across the German National Research Data Infrastructure (NFDI). We used as a starting/reference point the maDMP application profile from the Research Data Alliance DMP Commons Working Group. An additional element, ManagementPlan, was also included. Management Plan comes originally from DataCite v4.4, its representation as an extension proposed by the ZB MED/NFDI4DataScience machine-actionable Software Management Plan metadata schema was used.

This dataset comprises crosswalks corresponding to the following elements from the RDA maDMP: DMP, Project, Dataset, Distribution. The crosswalks were created following a template prepared by the organizers. Participants freely select a metadata schema or platform to be mapped with regards to the attributes in the RDA maDMP application profile. Crosswalks were created for DMP, Project, Dataset, and Distribution, while comments were provided by DMP and Dataset.

The list of files in this dataset is as follows:

Crosswalk template: two spreadsheets (AllTemplates.ods and AllTemplates.xlsx) containing 13 tabs corresponding to the instructions (InstructionsAllTemplates.pdf and InstructionsAllTemplates.tsv), crosswalks templates for Management Plan (CrosswalkTemplate-MngPlan.tsv), DMP (CrosswalkTemplate-DMP.tsv), Project (CrosswalkTemplate-Project.tsv), Dataset (CrosswalkTemplate-Dataset.tsv), Distribution (CrosswalkTemplate-Distribution.tsv) and Host (CrosswalkTemplate-Host.tsv), and comment templates for all the same elements Management Plan (CommentsTemplate-MngPlan.tsv), DMP (CommentsTemplate-DMP.tsv), Project (CommentsTemplate-Project.tsv), Dataset (CommentsTemplate-Dataset.tsv), Distribution (CommentsTemplate-Distribution.tsv) and Host (CommentsTemplate-Host.tsv). The crosswalk templates are meant to be used to compare a source against the reference point while the comment templates are meant to be used to provide comments to the reference point. Eash tab/crosswalk is also provided as individual TSV files. The crosswalk and comment templates contain the following columns:

Crosswalk for the reference source: Property, Range, Description, Cardinality (One, Many), Requirement (minimum, recommended, optional), Example

Crosswalk for the mapped source: Property, Range, Description, Cardinality (One, Many), Requirement (minimum, recommended, optional), Recommended Vocabulary, Comment, Review

Comments for the reference source: Property, Range, Description, Cardinality (One, Many), Requirement (minimum, recommended, optional), Example

Comments for the mapped source: Comments to Property, Comments to Range, Comments to Description, Comments to Cardinality (One, Many), Comments to Requirement (minimum, recommended, optional), Recommended Vocabulary

Crosswalk summary: two spreadsheets (Summary.ods and Summary.xlsx) containing six tabs corresponding to crosswalks summary for DMP (Summary-DMP.tsv), Project (Summary-Project.tsv), Dataset (Summary-Dataset.tsv), and Distribution (Summary-Distribution.tsv), and comments summary forDMP (Summary-Comments-DMP.tsv), and Dataset (Summary-Comments-Dataset.tsv). Eash tab/crosswalk is also provided as individual TSV files. The crosswalks and comments always contain the reference point (i.e, as defined by the RDA maDMP application profile) and the following mapped resources, separated by an empty (blue) column:

Crosswalks for DMP: RDMO, DataPLAN, NFDIxCS, Horizon Europe, DFG Checklist, DataCite, and GFBio_DMPT

Comments to DMP: NFDIxCS

Crosswalks for Project: RDMO, Metadata4Ing, NFDIxCS, Schema.org, and DFG Checklist

Crosswalks for Dataset: RDMO, DataPLAN, Metadata4Ing, NFDIxCS, Horizon Europe, DFG Checklist, and DataCite

Comments to Dataset: NFDIxCS

Crosswalks for Distribution: RDMO and Metadata4Ing

Crosswalks to RDMO: two spreadsheets (RDMO.ods and RDMO.xlsx) containing four tabs corresponding to the crosswalks for DMP (RDMO-DMP.tsv), Project (RDMO-Project.tsv), Dataset (RDMO-Dataset.tsv) and Distribution (RDMO-Distribution.tsv). Eash tab/crosswalk is also provided as individual TSV files.

Crosswalks to DataPLAN: two spreadsheets (DataPLAN.ods and DataPLAN.xlsx) containing two tabs corresponding to the crosswalks for DMP (DataPLAN-DMP.tsv) and Dataset (DataPLAN-Dataset.tsv). Eash tab/crosswalk is also provided as individual TSV files.

Crosswalks to Metadata4Ing: two spreadsheets (Metadata4Ing.ods and Metadata4Ing.xlsx) containing three tabs corresponding to the crosswalks for Project (Metadata4Ing-Project.tsv), Dataset (Metadata4Ing-Dataset.tsv) and Distribution (Metadata4Ing-Distribution.tsv). Eash tab/crosswalk is also provided as individual TSV files.

Crosswalks to NFDIxCS: two spreadsheets (NFDIxCS.ods and NFDIxCS.xlsx) containing five tabs corresponding to the crosswalks for DMP (NFDIxCS-DMP.tsv), Project (NFDIxCS-Project.tsv), and Dataset (NFDIxCS-Dataset.tsv), and additional comments for DMP (NFDIxCS-Comments-to-DMP.tsv) and Dataset (NFDIxCS-Comments-to-Dataset.tsv). Eash tab/crosswalk is also provided as individual TSV files.

Crosswalks to Schema.org: two spreadsheets (SchemaOrg.ods and SchemaOrg.xlsx) containing one tab corresponding to the crosswalk for Project (SchemaOrg-Project.tsv). Eash tab/crosswalk is also provided as individual TSV files.

Crosswalks to Horizon Europe as recorded in its RDMO template: two spreadsheets (HorizonEurope.ods and HorizonEurope.xlsx) containing two tabs corresponding to the crosswalks for DMP (HorizonEurope-DMP.tsv) and Dataset (HorizonEurope-Dataset.tsv). Eash tab/crosswalk is also provided as individual TSV files.

Crosswalks to DFG Checklist as recorded in its template in RDMO: two spreadsheets (DFGChecklist.ods and RDFGChecklistMO.xlsx) containing three tabs corresponding to the crosswalks for DMP (DFGChecklist-DMP.tsv), Project (DFGChecklist-Project.tsv), and Dataset (DFGChecklist-Dataset.tsv). Eash tab/crosswalk is also provided as individual TSV files.

Crosswalks to DataCite v4.5: two spreadsheets (DataCite.ods and DataCite.xlsx) containing two tabs corresponding to the crosswalks for DMP (DataCite-DMP.tsv) and Dataset (DataCite-Dataset.tsv). Eash tab/crosswalk is also provided as individual TSV files.

Crosswalks to GFBio_DMPT: two spreadsheets (GFBio_DMPT.ods and GFBio_DMPT.xlsx) containing one tabs corresponding to the crosswalk for DMP (GFBio_DMPT-DMP.tsv). Eash tab/crosswalk is also provided as individual TSV files.

More information about the activities carried out during the hackathon and the analysis of the crosswalks will be available soon.

Acknowledgements

The activities and discussion reported here were carried out during a hackathon organized by the Semantic Technologies team at ZB MED Information Centre from 28 to 30 October 2024 in Cologne, Germany, and sponsored by the NFDI4DataScience consortium. NFI4DataScience is funded by the German Research Foundation (Deutsche Forschungsgemeinschaft – DFG) under the grant No. 460234259

The DMP4NFDI team acknowledges the support of DFG - German Research Foundation - through the coordination fund (project number 521466146).

David Wallace and Jürgen Windeck would like to thank the Federal Government and the Heads of Government of the Länder, as well as the Joint Science Conference (GWK), for their funding and support within the framework of the NFDI4ING consortium. Funded by the German Research Foundation (DFG) - project number 442146713.

Marco Reidelbach is supported by MaRDI, funded by the Deutsche Forschungsgemeinschaft (DFG), project number 460135501, NFDI 29/1 “MaRDI – Mathematische Forschungsdateninitiative”.
o
Data from: Innuendo Whole Genome And Core Genome Mlst Schemas And Datasets...
explore.openaire.eu
euskadi.osasuna.ezagutzarenataria.eus
+2more
Updated Jul 30, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mirko Rossi; Mickael Santos Da Silva; Bruno Filipe Ribeiro-Gonçalves; Diogo Nuno Silva; Miguel Paulo Machado; Mónica Oleastro; Vítor Borges; Joana Isidro; Luis Viera; Jani Halkilahti; Anniina Jaakkonen; Federica Palma; Saara Salmenlinna; Marjaana Hakkinen; Javier Garaizar; Joseba Bikandi; Friederike Hilbert; João André Carriço (2018). Innuendo Whole Genome And Core Genome Mlst Schemas And Datasets For Escherichia Coli [Dataset]. http://doi.org/10.5281/zenodo.1323690
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.1323690
Dataset updated
Jul 30, 2018
Authors
Mirko Rossi; Mickael Santos Da Silva; Bruno Filipe Ribeiro-Gonçalves; Diogo Nuno Silva; Miguel Paulo Machado; Mónica Oleastro; Vítor Borges; Joana Isidro; Luis Viera; Jani Halkilahti; Anniina Jaakkonen; Federica Palma; Saara Salmenlinna; Marjaana Hakkinen; Javier Garaizar; Joseba Bikandi; Friederike Hilbert; João André Carriço
Description
Dataset As reference dataset, 2,218 public draft or complete genome assemblies and available metadata of Escherichia coli have been downloaded from EnteroBase in April 2017. Genomes have been selected on the basis of the ribosomal ST (rST) classification available in EnteroBase: from the same rST, genomes have been randomly selected and downloaded. The number of samples for each rST in the final dataset is proportional to those available in EnteroBase in April 2017. The dataset includes also 119 Shiga toxin-producing E.coli genomes assembled with INNUca v3.1 belonging to the INNUENDO Sequence Dataset (PRJEB27020). File 'Metadata/Ecoli_metadata.txt' contains metadata information for each strain including source classification, taxa of the hosts, country and year of isolation, serotype, pathotype, classical pubMLST 7 genes ST classification, assembly source/method and Enterobase barcode. The directory 'Genomes' contains the 119 INNUca v3.1 assemblies of the strains listed in 'Metadata/Ecoli_metadata.txt'. Enterobase assemblies can be downloaded from http://enterobase.warwick.ac.uk/species/ecoli/search_strains using 'barcode'. Schema creation and validation The wgMLST schema from EnteroBase have been downloaded and curated using chewBBACA AutoAlleleCDSCuration for removing all alleles that are not coding sequences (CDS). The quality of the remain loci have been assessed using chewBBACA Schema Evaluation and loci with single alleles, those with high length variability (i.e. if more than 1 allele is outside the mode +/- 0.05 size) and those present in less than 0.5% of the Escherichia genomes in EnteroBase at the date of the analysis (April 2017) have been removed. The wgMLST schema have been further curated, excluding all those loci detected as “Repeated Loci” and loci annotated as “non-informative paralogous hit (NIPH/ NIPHEM)” or “Allele Larger/ Smaller than length mode (ALM/ ASM)” by the chewBBACA Allele Calling engine in more than 1% of a dataset composed by 2,337 Escherichia coli genomes. File 'Schema/Ecoli_wgMLST_7601_schema.tar.gz' contains the wgMLST schema formatted for chewBBACA and includes a total of 7,601 loci. File 'Schema/Ecoli_cgMLST_2360_listGenes.txt' contains the list of genes from the wgMLST schema which defines the cgMLST schema. The cgMLST schema consists of 2,360 loci and has been defined as the loci present in at least the 99% of the 2,337 Escherichia coli genomes. Genomes have no more than 2% of missing loci. File 'Allele_Profles/Ecoli_wgMLST_alleleProfiles.tsv' contains the wgMLST allelic profile of the 2,337 Escherichia coli genomes of the dataset. Please note that missing loci follow the annotation of chewBBACA Allele Calling software. File 'Allele_Profles/Ecoli_cgMLST_alleleProfiles.tsv' contains the cgMLST allelic profile of the 2,337 Escherichia coli genomes of the dataset. Please note that missing loci are indicated with a zero. Additional citations The schema are prepared to be used with chewBBACA. When using the schema in this repository please cite also: Silva M, Machado M, Silva D, Rossi M, Moran-Gilad J, Santos S, Ramirez M, Carriço J. chewBBACA: A complete suite for gene-by-gene schema creation and strain identification. 15/03/2018. M Gen 4(3): doi:10.1099/mgen.0.000166 http://mgen.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000166 Escherichia coli schema is a derivation of EnteroBase E. coli EnteroBase wgMLST schema. When using the schema in this repository please cite also: Alikhan N-F, Zhou Z, Sergeant MJ, Achtman M (2018) A genomic overview of the population structure of Salmonella. PLoS Genet 14 (4):e1007261. https://doi.org/10.1371/journal.pgen.1007261 The isolates' genomes raw sequence data produced within the activity of the INNUENDO project were submitted to the European Nucleotide Archive (ENA) database and are publicly available under the project accession number PRJEB27020. When using the schemas, the assemblies or the allele profiles please include the project number in your publication. The research from the INNUENDO project has received funding from European Food Safety Authority (EFSA), grant agreement GP/EFSA/AFSCO/2015/01/CT2 (New approaches in identifying and characterizing microbial and chemical hazards) and from the Government of the Basque Country. The conclusions, findings, and opinions expressed in this repository reflect only the view of the INNUENDO consortium members and not the official position of EFSA nor of the Government of the Basque Country. EFSA and the Government of the Basque Country are not responsible for any use that may be made of the information included in this repository. The INNUENDO consortium thanks the Austrian Agency for Health and Food Safety Limited for participating in the project by providing strains. The consortium thanks all the researchers and the authorities worldwide which are contributing by submitting the raw sequences of the bacterial strains in public repositories. The project wa...

Extended Wikipedia Multimodal Dataset

kaggle.com

Updated Apr 4, 2020

Facebook

Twitter

Click to copy link

Link copied

Cite

Oleh Onyshchak (2020). Extended Wikipedia Multimodal Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/1058023

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.34740/kaggle/dsv/1058023

Dataset updated

Apr 4, 2020

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Oleh Onyshchak

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Wikipedia Featured Articles multimodal dataset

Overview

This is a multimodal dataset of featured articles containing 5,638 articles and 57,454 images.
Its superset of good articles is also hosted on Kaggle. It has six times more entries although with a little worse quality.

It contains the text of an article and also all the images from that article along with metadata such as image titles and descriptions. From Wikipedia, we selected featured articles, which are just a small subset of all available ones, because they are manually reviewed and protected from edits. Thus it's the best theoretical quality human editors on Wikipedia can offer.

You can find more details in "Image Recommendation for Wikipedia Articles" thesis.

Dataset structure

The high-level structure of the dataset is as follows:

.
+-- page1 
|  +-- text.json 
|  +-- img 
|    +-- meta.json
+-- page2 
|  +-- text.json 
|  +-- img 
|    +-- meta.json
: 
+-- pageN 
|  +-- text.json 
|  +-- img 
|    +-- meta.json

label	description
pageN	is the title of N-th Wikipedia page and contains all information about the page
text.json	text of the page saved as JSON. Please refer to the details of JSON schema below.
meta.json	a collection of all images of the page. Please refer to the details of JSON schema below.
imageN	is the N-th image of an article, saved in `jpg` format where the width of each image is set to 600px. Name of the image is md5 hashcode of original image title.

text.JSON Schema

Below you see an example of how data is stored:

{
 "title": "Naval Battle of Guadalcanal",
 "id": 405411,
 "url": "https://en.wikipedia.org/wiki/Naval_Battle_of_Guadalcanal",
 "html": "...

...", "wikitext": "... The '''Naval Battle of Guadalcanal''', sometimes referred to as ...", }

key	description
title	page title
id	unique page id
url	url of a page on Wikipedia
html	HTML content of the article
wikitext	wikitext content of the article

Please note that @html and @wikitext properties represent the same information in different formats, so just choose the one which is easier to parse in your circumstances.

meta.JSON Schema

{
 "img_meta": [
  {
   "filename": "702105f83a2aa0d2a89447be6b61c624.jpg",
   "title": "IronbottomSound.jpg",
   "parsed_title": "ironbottom sound",
   "url": "https://en.wikipedia.org/wiki/File%3AIronbottomSound.jpg",
   "is_icon": False,
   "on_commons": True,
   "description": "A U.S. destroyer steams up what later became known as ...",
   "caption": "Ironbottom Sound. The majority of the warship surface ...",
   "headings": ['Naval Battle of Guadalcanal', 'First Naval Battle of Guadalcanal', ...],
   "features": ['4.8618264', '0.49436468', '7.0841103', '2.7377882', '2.1305492', ...],
   },
   ...
  ]
}

key	description
filename	unique image id, md5 hashcode of original image title
title	image title retrieved from Commons, if applicable
parsed_title	image title split into words, i.e. "helloWorld.jpg" -> "hello world"
url	url of an image on Wikipedia
is_icon	True if image is an icon, e.g. category icon. We assume that image is an icon if you cannot load a preview on Wikipedia after clicking on it
on_commons	True if image is available from Wikimedia Commons dataset
description	description of an image parsed from Wikimedia Commons page, if available
caption	caption of an image parsed from Wikipedia article, if available
headings	list of all nested headings of location where article is placed in Wikipedia article. The first element is top-most heading
features	output of 5-th convolutional layer of ResNet152 trained on ImageNet dataset. That output of shape (19, 24, 2048) is then max-pooled to a shape (2048,). Features taken from original images downloaded in `jpeg` format with fixed width of 600px. Practically, it is a list of floats with len = 2048

Collection method

Data was collected by fetching featured articles text&image content with pywikibot library and then parsing out a lot of additional metadata from HTML pages from Wikipedia and Commons.

Kaggle Data Collection Notebook...

d
Warehouse and Retail Sales
catalog.data.gov
data.montgomerycountymd.gov
+3more
Updated May 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.montgomerycountymd.gov (2025). Warehouse and Retail Sales [Dataset]. https://catalog.data.gov/dataset/warehouse-and-retail-sales
Explore at:
Dataset updated
May 10, 2025
Dataset provided by
data.montgomerycountymd.gov
Description
This dataset contains a list of sales and movement data by item and department appended monthly. Update Frequency : Monthly
w
Michigan 3 Data Exchange Content, NGDS YR 3 Deliverables - Metadata...
data.wu.ac.at
datadiscoverystudio.org
zip
Updated Dec 5, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2017). Michigan 3 Data Exchange Content, NGDS YR 3 Deliverables - Metadata Compilation [Dataset]. https://data.wu.ac.at/schema/geothermaldata_org/NmZhOTA1OTItY2ZiZS00ZDBkLWExZGEtZDk5Yzk5NDczN2I2
Explore at:
zipAvailable download formats
Dataset updated
Dec 5, 2017
Area covered
Michigan, 962e6a363ee0f33a928a0b372acb3c3e3ee3d75f
Description
This resource is a metadata compilation for Michigan geothermal related data in data exchange models submitted to the AASG National Geothermal Data System project to fulfill Year 3 data deliverables by the Michigan Geological Survey, Western Michigan University. Descriptions, links, and contact information for the ESRI Map Services created with Michigan data are also available here, including borehole temperature data, drill stem test data, lithology interval data, heat pump installations, physical samples, and well header data for the state of Michigan. The data and associated services were provided by the Michigan Geological Survey, Western Michigan University. The compilation is published as an Excel workbook containing header features including title, description, author, citation, originator, distributor, and resource URL links to scanned maps for download. The Excel workbook contains 6 worksheets, including information about the template, notes related to revisions of the template, resource provider information, the metadata, a field list (data mapping view) and vocabularies (data valid terms) used to populate the data worksheet. This metadata compilation was provided by the Michigan Geological Survey at Western Michigan University and made available for distribution through the National Geothermal Data System.
w
Gas samples of Afghanistan and adjacent areas (gasafg.shp)
data.wu.ac.at
datadiscoverystudio.org
+3more
Updated Jun 8, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of the Interior (2018). Gas samples of Afghanistan and adjacent areas (gasafg.shp) [Dataset]. https://data.wu.ac.at/schema/data_gov/ZjIwZmUwOTYtMGIwYy00OTUzLTk2YTItYzllOTI5MzRmOTNm
Explore at:
zip, internet map serviceAvailable download formats
Dataset updated
Jun 8, 2018
Dataset provided by
Department of the Interior
Area covered
83026b0b4bdbdb42fc209230cd190d317c5ff448
Description
This shapefile contains points that describe the location of gas samples collected in Afghanistan and adjacent areas and the results of organic geochemical analysis.
o
Data from: ClaimsKG - A Knowledge Graph of Fact-Checked Claims
explore.openaire.eu
data.niaid.nih.gov
+1more
Updated Oct 1, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andon Tchechmedjiev; Pavlos Fafalios; Konstantin Todorov; Stefan Dietze; Boland; Zapilko (2019). ClaimsKG - A Knowledge Graph of Fact-Checked Claims [Dataset]. http://doi.org/10.5281/zenodo.2628744
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.2628744
Dataset updated
Oct 1, 2019
Authors
Andon Tchechmedjiev; Pavlos Fafalios; Konstantin Todorov; Stefan Dietze; Boland; Zapilko
Description
ClaimsKG is a knowledge graph of metadata information for thousands of fact-checked claims which facilitates structured queries about their truth values, authors, dates, and other kinds of metadata. ClaimsKG is generated through a (semi-)automated pipeline, which harvests claim-related data from popular fact-checking web sites, annotates them with related entities from DBpedia, and lifts all data to RDF using an RDF/S model that makes use of established vocabularies (such as schema.org).

ClaimsKG does NOT contain the text of the reviews from the fact-checking web sites; it only contains structured metadata information and links to the reviews.

More information, such as statistics, query examples and a user friendly interface to explore the knowledge graph, is available at: https://data.gesis.org/claimskg/site

There is a newer version of the dataset available!
Z
PLBD (Protein Ligand Binding Database) table description XML file
data.niaid.nih.gov
zenodo.org
Updated Dec 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Konovalovas, Aleksandras (2022). PLBD (Protein Ligand Binding Database) table description XML file [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_7482007
Explore at:
Dataset updated
Dec 26, 2022
Dataset provided by
Čapkauskaitė, Edita
Gražulis, Saulius
Vaitkus, Antanas
Baranauskienė, Lina
Urniežius, Ernestas
Matulis, Daumantas
Petrauskas, Vytautas
Zakšauskas, Audrius
Smirnovienė, Joana
Dudutienė, Virginija
Konovalovas, Aleksandras
Grybauskas, Algirdas
Merkys, Andrius
Zubrienė, Asta
Mickevičiūtė, Aurelija
Paketurytė, Vaida
Gedgaudas, Marius
Lingė, Darius
Kazlauskas, Egidijus
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PLBD (Protein Ligand Binding Database) table description XML file

General

The provided ZIP archive contains an XML file "main-database-description.xml" with the description of all tables (VIEWS) that are exposed publicly at the PLBD server (https://plbd.org/). In the XML file, all columns of the visible tables are described, specifying their SQL types, measurement units, semantics, calculation formulae, SQL statements that can be used to generate values in these columns, and publications of the formulae derivations.

The XML file conforms to the published XSD schema created for descriptions of relational databases for specifications of scientific measurement data. The XSD schema ("relational-database_v2.0.0-rc.18.xsd") and all included sub-schemas are provided in the same archive for convenience. All XSD schemas are validated against the "XMLSchema.xsd" schema from the W3C consortium.

The ZIP file contains the excerpt from the files hosted in the https://plbd.org/ at the moment of submission of the PLBD database in the Scientific Data journal, and is provided to conform the journal policies. The current data and schemas should be fetched from the published URIs:

https://plbd.org/ https://plbd.org/doc/db/schemas https://plbd.org/doc/xml/schemas

Software that is used to generate SQL schemas, RestfulDB metadata and the RestfulDB middleware that allows to publish the databases generated from the XML description on the Web are available at public Subversion repositories:

svn://www.crystallography.net/solsa-database-scripts svn://saulius-grazulis.lt/restfuldb

Usage

The unpacked ZIP file will create the "db/" directory with the tree layout given below. In addition to the database description file "main-database-description.xml", all XSD schemas necessary for validation of the XML file are provided. On a GNU/Linux operating system with a GNU Make package installed, the XML file validity can be checked by unpacking the ZIP file, entering the unpacked directory, and running 'make distclean; make'. For example, on a Linux Mint distribution, the following commands should work:

unzip main-database-description.zip cd db/release/v0.10.0/tables/ sh -x dependencies/Linuxmint-20.1/install.sh make distclean make

If necessary, additional packages can be installed using the 'install.sh' script in the 'dependencies/' subdirectory corresponding to your operating system. As of the moment of writing, Debian-10 and Linuxmint-20.1 OSes are supported out of the box; similar OSes might work with the same 'install.sh' scripts. The installation scripts require to run package installation command under system administrator privileges, but they use only the standard system package manager, thus they should not put your system at risk. For validation and syntax checking, the 'rxp' and 'xmllint' programs are used.

The log files provided in the "outputs/validation" subdirectory contain validation logs obtained on the system where the XML files were last checked and should indicate validity of the provided XML file against the references schemas.

Layout of the archived file tree

db/ └── release └── v0.10.0 └── tables ├── Makeconfig-validate-xml ├── Makefile ├── Makelocal-validate-xml ├── dependencies ├── main-database-description.xml ├── outputs └── schema
Data from: Apollo 16 Coarse Fines (4-10 mm): Sample Classification,...
data.wu.ac.at
datadiscoverystudio.org
+3more
pdf
Updated Aug 9, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Aeronautics and Space Administration (2018). Apollo 16 Coarse Fines (4-10 mm): Sample Classification, Description, and Inventory [Dataset]. https://data.wu.ac.at/schema/data_gov/NWM4MDgwMzgtMGVmOS00NjJkLTg5MDUtMDY4NDEzYjBmZTUw
Explore at:
pdfAvailable download formats
Dataset updated
Aug 9, 2018
Dataset provided by
NASAhttp://nasa.gov/
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Apollo 16 Coarse Fines (4-10 mm): Sample Classification, Description, and Inventory by U. Marvin

Facebook

Twitter

Click to copy link

Link copied

Cite

The Scholix Metadata JSON Schema [Dataset]. https://gimi9.com/dataset/eu_oai-zenodo-org-6351557

The Scholix Metadata JSON Schema

Explore at:

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

This product consists of the XML schema and the JSON schema aligned with the Scholix Metadata Guidelines Version 4. It contains the .json and /xsd files together with examples of compatible metadata records. Changes from the previous update are backward compatible and include the following: The schema admits for the field type (typology of source/target objects) terms of the following vocabulary: publications, datasets, software, other research types (version 3.0 included only literature and dataset) The schema includes a new optional field subtype which includes the specific sub-type of the objects, according to the OpenAIRE classification of publications, datasets, software, and other products (for more) The schema admits multiple entries for the field Identifier in both the source and target objects; this is to specify the list of PIDs resulting from the deduplication on OpenAIRE (i.e. the same publication may have been collected from Crossref and from EuropePMC, thus including both PIDs).

Clear search

Close search

Google apps

Main menu

The Scholix Metadata JSON Schema

The Red Queen in the Repository: metadata quality in an ever-changing...

Zenodo Open Metadata snapshot - Training dataset for records and communities...

Extracted Schemas from the Life Sciences Linked Open Data Cloud

Example vocabulary

GitTables 1M

ClaimsKG - A Knowledge Graph of Fact-Checked Claims (January, 2023)

The Yelp Collaborative Knowledge Graph

XML Metadata Template for the U.S. Fish and Wildlife Service

Data from: A biodiversity dataset graph: DataONE

Wikipedia Multimodal Dataset of Good Articles

Wikipedia Featured Articles multimodal dataset

Overview

Dataset structure

text.JSON Schema

meta.JSON Schema

Note on Images

Collection method

Crosswalks from the NFDI4DS hackathon "machine-actionable Data Management...

Summary

Acknowledgements

Data from: Innuendo Whole Genome And Core Genome Mlst Schemas And Datasets...

Extended Wikipedia Multimodal Dataset

Wikipedia Featured Articles multimodal dataset

Overview

Dataset structure

text.JSON Schema

meta.JSON Schema

Collection method

Warehouse and Retail Sales

Michigan 3 Data Exchange Content, NGDS YR 3 Deliverables - Metadata...

Gas samples of Afghanistan and adjacent areas (gasafg.shp)

Data from: ClaimsKG - A Knowledge Graph of Fact-Checked Claims

PLBD (Protein Ligand Binding Database) table description XML file

PLBD (Protein Ligand Binding Database) table description XML file

General

Usage

Layout of the archived file tree

Data from: Apollo 16 Coarse Fines (4-10 mm): Sample Classification,...

The Scholix Metadata JSON Schema