38 datasets found
  1. g

    The Scholix Metadata JSON Schema

    • gimi9.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Scholix Metadata JSON Schema [Dataset]. https://gimi9.com/dataset/eu_oai-zenodo-org-6351557
    Explore at:
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This product consists of the XML schema and the JSON schema aligned with the Scholix Metadata Guidelines Version 4. It contains the .json and /xsd files together with examples of compatible metadata records. Changes from the previous update are backward compatible and include the following: The schema admits for the field type (typology of source/target objects) terms of the following vocabulary: publications, datasets, software, other research types (version 3.0 included only literature and dataset) The schema includes a new optional field subtype which includes the specific sub-type of the objects, according to the OpenAIRE classification of publications, datasets, software, and other products (for more) The schema admits multiple entries for the field Identifier in both the source and target objects; this is to specify the list of PIDs resulting from the deduplication on OpenAIRE (i.e. the same publication may have been collected from Crossref and from EuropePMC, thus including both PIDs).

  2. The Red Queen in the Repository: metadata quality in an ever-changing...

    • zenodo.org
    • researchdata.se
    bin, csv, zip
    Updated Jul 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joakim Philipson; Joakim Philipson (2024). The Red Queen in the Repository: metadata quality in an ever-changing environment (preprint of paper, presentation slides and dataset collection with validation schemas to IDCC2019 conference paper) [Dataset]. http://doi.org/10.5281/zenodo.2276777
    Explore at:
    zip, bin, csvAvailable download formats
    Dataset updated
    Jul 25, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Joakim Philipson; Joakim Philipson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This fileset contains a preprint version of the conference paper (.pdf), presentation slides (as .pptx) and the dataset(s) and validation schema(s) for the IDCC 2019 (Melbourne) conference paper: The Red Queen in the Repository: metadata quality in an ever-changing environment. Datasets and schemas are in .xml, .xsd , Excel (.xlsx) and .csv (two files representing two different sheets in the .xslx -file). The validationSchemas.zip holds the additional validation schemas (.xsd), that were not found in the schemaLocations of the metadata xml-files to be validated. The schemas must all be placed in the same folder, and are to be used for validating the Dataverse dcterms records (with metadataDCT.xsd) and the Zenodo oai_datacite feeds respectively (schema.datacite.org_oai_oai-1.0_oai.xsd). In the latter case, a simpler way of doing it might be to replace the incorrect URL "http://schema.datacite.org/oai/oai-1.0/ oai_datacite.xsd" in the schemaLocation of these xml-files by the CORRECT: schemaLocation="http://schema.datacite.org/oai/oai-1.0/ http://schema.datacite.org/oai/oai-1.0/oai.xsd" as has been done already in the sample files here. The sample file folders testDVNcoll.zip (Dataverse), testFigColl.zip (Figshare) and testZenColl.zip (Zenodo) contain all the metadata files tested and validated that are registered in the spreadsheet with objectIDs.
    In the case of Zenodo, one original file feed,
    zen2018oai_datacite3orig-https%20_zenodo.org_oai2d%20verb=ListRecords%26metadata
    Prefix=oai_datacite%26from=2018-11-29%26until=2018-11-30.xml
    ,
    is also supplied to show what was necessary to change in order to perform validation as indicated in the paper.

    For Dataverse, a corrected version of a file,
    dvn2014ddi-27595Corr_https%20_dataverse.harvard.edu_api_datasets_export%20
    exporter=ddi%26persistentId=doi%253A10.7910_DVN_27595Corr.xml
    ,
    is also supplied in order to show the changes it would take to make the file validate without error.

  3. Zenodo Open Metadata snapshot - Training dataset for records and communities...

    • data.niaid.nih.gov
    • explore.openaire.eu
    Updated Dec 15, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo team (2022). Zenodo Open Metadata snapshot - Training dataset for records and communities classifier building [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_787062
    Explore at:
    Dataset updated
    Dec 15, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Zenodo team
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains Zenodo's published open access records and communities metadata, including entries marked by the Zenodo staff as spam and deleted.

    The datasets are gzipped compressed JSON-lines files, where each line is a JSON object representation of a Zenodo record or community.

    Records dataset

    Filename: zenodo_open_metadata_{ date of export }.jsonl.gz

    Each object contains the terms: part_of, thesis, description, doi, meeting, imprint, references, recid, alternate_identifiers, resource_type, journal, related_identifiers, title, subjects, notes, creators, communities, access_right, keywords, contributors, publication_date

    which correspond to the fields with the same name available in Zenodo's record JSON Schema at https://zenodo.org/schemas/records/record-v1.0.0.json.

    In addition, some terms have been altered:

    The term files contains a list of dictionaries containing filetype, size, and filename only.

    The term license contains a short Zenodo ID of the license (e.g. "cc-by").

    Communities dataset

    Filename: zenodo_community_metadata_{ date of export }.jsonl.gz

    Each object contains the terms: id, title, description, curation_policy, page

    which correspond to the fields with the same name available in Zenodo's community creation form.

    Notes for all datasets

    For each object the term spam contains a boolean value, determining whether a given record/community was marked as spam content by Zenodo staff.

    Some values for the top-level terms, which were missing in the metadata may contain a null value.

    A smaller uncompressed random sample of 200 JSON lines is also included for each dataset to test and get familiar with the format without having to download the entire dataset.

  4. Extracted Schemas from the Life Sciences Linked Open Data Cloud

    • figshare.com
    txt
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maulik Kamdar (2023). Extracted Schemas from the Life Sciences Linked Open Data Cloud [Dataset]. http://doi.org/10.6084/m9.figshare.12402425.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Maulik Kamdar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is related to the manuscript "An empirical meta-analysis of the life sciences linked open data on the web" published at Nature Scientific Data. If you use the dataset, please cite the manuscript as follows:Kamdar, M.R., Musen, M.A. An empirical meta-analysis of the life sciences linked open data on the web. Sci Data 8, 24 (2021). https://doi.org/10.1038/s41597-021-00797-yWe have extracted schemas from more than 80 publicly available biomedical linked data graphs in the Life Sciences Linked Open Data (LSLOD) cloud into an LSLOD schema graph and conduct an empirical meta-analysis to evaluate the extent of semantic heterogeneity across the LSLOD cloud. The dataset published here contains the following files:- The set of Linked Data Graphs from the LSLOD cloud from which schemas are extracted.- Refined Sets of extracted classes, object properties, data properties, and datatypes, shared across the Linked Data Graphs on LSLOD cloud. Where the schema element is reused from a Linked Open Vocabulary or an ontology, it is explicitly indicated.- The LSLOD Schema Graph, which contains all the above extracted schema elements interlinked with each other based on the underlying content. Sample instances and sample assertions are also provided along with broad level characteristics of the modeled content. The LSLOD Schema Graph is saved as a JSON Pickle File. To read the JSON object in this Pickle file use the Python command as follows:with open('LSLOD-Schema-Graph.json.pickle' , 'rb') as infile: x = pickle.load(infile, encoding='iso-8859-1')Check the Referenced Link for more details on this research, raw data files, and code references.

  5. L

    Example vocabulary

    • liveschema.eu
    csv, rdf, ttl
    Updated Dec 17, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Linked Open Vocabulary (2020). Example vocabulary [Dataset]. http://liveschema.eu/dataset/lov_ex
    Explore at:
    csv, ttl, rdfAvailable download formats
    Dataset updated
    Dec 17, 2020
    Dataset provided by
    Linked Open Vocabulary
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    スキーマにサンプルコードを埋め込むための語彙。XSLTを使って、XHTMLの中で例を表示させることができる。 @ja, Vocabulary to include sample codes in a schema. Can work with XSLT (http://purl.org/net/ns/ns-schema.xsl) to present schema as XHTML list with examples. @en

  6. GitTables 1M

    • zenodo.org
    zip
    Updated May 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Madelon Hulsebos; Madelon Hulsebos; Çağatay Demiralp; Paul Groth; Çağatay Demiralp; Paul Groth (2022). GitTables 1M [Dataset]. http://doi.org/10.5281/zenodo.6517052
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 10, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Madelon Hulsebos; Madelon Hulsebos; Çağatay Demiralp; Paul Groth; Çağatay Demiralp; Paul Groth
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Summary

    GitTables 1M (https://gittables.github.io) is a corpus of currently 1M relational tables extracted from CSV files in GitHub repositories, that are associated with a license that allows distribution. We aim to grow this to at least 10M tables.

    Each parquet file in this corpus represents a table with the original content (e.g. values and header) as extracted from the corresponding CSV file. Table columns are enriched with annotations corresponding to >2K semantic types from Schema.org and DBpedia (provided as metadata of the parquet file). These column annotations consist of, for example, semantic types, hierarchical relations to other types, and descriptions.

    We believe GitTables can facilitate many use-cases, among which:

    • Data integration, search and validation.

    • Data visualization and analysis recommendation.

    • Schema analysis and completion for e.g. database or knowledge base design.

    If you have questions, the paper, documentation, and contact details are provided on the website: https://gittables.github.io. We recommend using Zenodo's API to easily download the full dataset (i.e. all zipped topic subsets).

    Dataset contents

    The data is provided in subsets of tables stored in parquet files, each subset corresponds to a term that was used to query GitHub with. The column annotations and other metadata (e.g. URL and repository license) are attached to the metadata of the parquet file. This version corresponds to this version of the paper https://arxiv.org/abs/2106.07258v4.

    In summary, this dataset can be characterized as follows:

    Statistic

    Value

    # tables

    1M

    average # columns

    12

    average # rows

    142

    # annotated tables (at least 1 column annotation)

    723K+ (DBpedia), 738K+ (Schema.org)

    # unique semantic types

    835 (DBpedia), 677 (Schema.org)

    How to download

    The dataset can be downloaded through Zenodo's interface directly, or using Zenodo's API (recommended for full download).

    Future releases

    Future releases will include the following:

    • Increased number of tables (expected at least 10M)

    Associated datasets

    - GitTables benchmark - column type detection: https://zenodo.org/record/5706316

    - GitTables 1M - CSV files: https://zenodo.org/record/6515973

  7. c

    ClaimsKG - A Knowledge Graph of Fact-Checked Claims (January, 2023)

    • datacatalogue.cessda.eu
    • search.gesis.org
    • +1more
    Updated Oct 17, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gangopadhyay, Susmita; Schellhammer, Sebastian; Boland, Katarina; Schüller, Sascha; Todorov, Konstantin; Tchechmedjiev, Andon; Zapilko, Benjamin; Fafalios, Pavlos; Jabeen, Hajira; Dietze, Stefan (2023). ClaimsKG - A Knowledge Graph of Fact-Checked Claims (January, 2023) [Dataset]. http://doi.org/10.7802/2620
    Explore at:
    Dataset updated
    Oct 17, 2023
    Dataset provided by
    GESIS - Leibniz-Institut für Sozialwissenschaften
    LIRMM / University of Montpellier
    LGI2P / IMT Mines Ales / University of Montpellier
    GESIS - Leibniz-Institut für Sozialwissenschaften & Heinrich-Heine-University Düsseldorf
    Institute of Computer Science, FORTH-ICS
    Authors
    Gangopadhyay, Susmita; Schellhammer, Sebastian; Boland, Katarina; Schüller, Sascha; Todorov, Konstantin; Tchechmedjiev, Andon; Zapilko, Benjamin; Fafalios, Pavlos; Jabeen, Hajira; Dietze, Stefan
    Measurement technique
    Web scraping
    Description

    ClaimsKG is a knowledge graph of metadata information for fact-checked claims scraped from popular fact-checking sites. In addition to providing a single dataset of claims and associated metadata, truth ratings are harmonized and additional information is provided for each claim, e.g., about mentioned entities. Please see ( https://data.gesis.org/claimskg/ ) for further details about the data model, query examples and statistics.

    The dataset facilitates structured queries about claims, their truth values, involved entities, authors, dates, and other kinds of metadata. ClaimsKG is generated through a (semi-)automated pipeline, which harvests claim-related data from popular fact-checking web sites, annotates them with related entities from DBpedia/Wikipedia, and lifts all data to RDF using established vocabularies (such as schema.org).

    The latest release of ClaimsKG covers 74066 claims and 72127 Claim Reviews. This is the fourth release of the dataset where data was scraped till Jan 31, 2023 containing claims published between 1996 and 2023 from 13 fact-checking websites. The websites are Fullfact, Politifact, TruthOrFiction, Checkyourfact, Vishvanews, AFP (French), AFP, Polygraph, EU factcheck, Factograph, Fatabyyano, Snopes and Africacheck. The claim-review (fact-checking) period for claims ranges between the year 1996 to 2023. Similar to the previous release, the Entity fishing python client ( https://github.com/hirmeos/entity-fishing-client-python ) has been used for entity linking and disambiguation in this release. Improvements have been made in the web scraping and data preprocessing pipeline to extract more entities from both claims and claims reviews. Currently, ClaimsKG contains 3408386 entities detected and referenced with DBpedia.

    This latest release of ClaimsKG supersedes the previous versions as it contained all the claims from the previous versions together in addition to the additional new claims as well as improved entity annotation resulting in a higher number of entities.

  8. The Yelp Collaborative Knowledge Graph

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, bin +1
    Updated Jun 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mads Corfixen; Magnus Olesen; Thomas Heede; Christian Filip Pinderup Nielsen; Mads Corfixen; Magnus Olesen; Thomas Heede; Christian Filip Pinderup Nielsen (2023). The Yelp Collaborative Knowledge Graph [Dataset]. http://doi.org/10.5281/zenodo.8049832
    Explore at:
    application/gzip, csv, binAvailable download formats
    Dataset updated
    Jun 17, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mads Corfixen; Magnus Olesen; Thomas Heede; Christian Filip Pinderup Nielsen; Mads Corfixen; Magnus Olesen; Thomas Heede; Christian Filip Pinderup Nielsen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the The Yelp Collaborative Knowledge Graph (YCKG) - a transformation of the Yelp Open Dataset into RDF format using Y2KG.

    Paper Abstract

    The Yelp Open Dataset (YOD) contains data about businesses, reviews, and users from the Yelp website and is available for research purposes. This dataset has been widely used to develop and test Recommender Systems (RS), especially those using Knowledge Graphs (KGs), e.g., integrating taxonomies, product categories, business locations, and social network information. Unfortunately, researchers applied naive or wrong mappings while converting YOD in KGs, consequently obtaining unrealistic results. Among the various issues, the conversion processes usually do not follow state-of-the-art methodologies, fail to properly link to other KGs and reuse existing vocabularies. In this work, we overcome these issues by introducing Y2KG, a utility to convert the Yelp dataset into a KG. Y2KG consists of two components. The first is a dataset including (1) a vocabulary that extends Schema.org with properties to describe the concepts in YOD and (2) mappings between the Yelp entities and Wikidata. The second component is a set of scripts to transform YOD in RDF and obtain the Yelp Collaborative Knowledge Graph (YCKG). The design of Y2KG was driven by 16 core competency questions. YCKG includes 150k businesses and 16.9M reviews from 1.9M distinct real users, resulting in over 244 million triples (with 144 distinct predicates) for about 72 million resources, with an average in-degree and out-degree of 3.3 and 12.2, respectively.

    Links

    Latest GitHub release: https://github.com/MadsCorfixen/The-Yelp-Collaborative-Knowledge-Graph/releases/latest

    PURL domain: https://purl.archive.org/domain/yckg

    Files

    • Graph Data Triple Files
      • One sample file for each of the Yelp domains (Businesses, Users, Reviews, Tips and Checkins), each containing 20 entities.
      • yelp_schema_mappings.nt.gz containing the mappings from Yelp categories to Schema things.
      • schema_hierarchy.nt.gz containing the full hierarchy of the mapped Schema things.
      • yelp_wiki_mappings.nt.gz containing the mappings from Yelp categories to Wikidata entities.
      • wikidata_location_mappings.nt.gz containing the mappings from Yelp locations to Wikidata entities.
    • Graph Metadata Triple Files
      • yelp_categories.ttl contains metadata for all Yelp categories.
      • yelp_entities.ttl contains metadata regarding the dataset
      • yelp_vocabulary.ttl contains metadata on the created Yelp vocabulary and properties.
    • Utility Files
      • yelp_category_schema_mappings.csv. This file contains the 310 mappings from Yelp categories to Schema types. These mappings have been manually verified to be correct.
      • yelp_predicate_schema_mappings.csv. This file contains the 14 mappings from Yelp attributes to Schema properties. These mappings are manually found.
      • ground_truth_yelp_category_schema_mappings.csv. This file contains the ground truth, based on 200 manually verified mappings from Yelp categories to Schema things. The ground truth mappings were used to calculate precision and recall for the semantic mappings.
      • manually_split_categories.csv. This file contains all Yelp categories containing either a & or /, and their manually split versions. The split versions have been used in the semantic mappings to Schema things.
  9. XML Metadata Template for the U.S. Fish and Wildlife Service

    • data.wu.ac.at
    • datadiscoverystudio.org
    • +1more
    xml
    Updated Jul 25, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of the Interior (2012). XML Metadata Template for the U.S. Fish and Wildlife Service [Dataset]. https://data.wu.ac.at/schema/data_gov/YzZhYzFkMDctZDhmOS00YjRjLTg4MDAtMjUyYzNiMjhkZWEy
    Explore at:
    xmlAvailable download formats
    Dataset updated
    Jul 25, 2012
    Dataset provided by
    United States Department of the Interiorhttp://www.doi.gov/
    Description

    This dataset is provided as an example of XML metadata that can be used to create a records in ServCat for GIS datasets.

  10. Data from: A biodiversity dataset graph: DataONE

    • zenodo.org
    application/gzip, bin +2
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jorrit H. Poelen; Jorrit H. Poelen (2023). A biodiversity dataset graph: DataONE [Dataset]. http://doi.org/10.5281/zenodo.1486279
    Explore at:
    bin, application/gzip, bz2, tsvAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jorrit H. Poelen; Jorrit H. Poelen
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The intended use of this archive is to facilitate meta-analysis of the Data Observation Network for Earth (DataONE, [1]).

    DataONE is a distributed infrastructure that provides information about earth observation data. This dataset was derived from the DataONE network using Preston [2] between 17 October 2018 and 6 November 2018, resolving 335,213 urls at an average retrieval rate of about 5 seconds per url, or 720 files per hour, resulting in a data gzip compressed tar archive of 837.3 MB .

    The archive associates 325,757 unique metadata urls [3] to 202,063 unique ecological metadata files [4]. Also, the DataONE search index was captured to establish provenance of how the dataset descriptors were found and acquired. During the creation of the snapshot (or crawl), 15,389 urls [5], or 4.7% of urls, did not successfully resolve.

    To facilitate discovery, the record of the Preston snapshot crawl is included in the preston-ls-* files . There files are derived from the rdf/nquad file with hash://sha256/8c67e0741d1c90db54740e08d2e39d91dfd73566ea69c1f2da0d9ab9780a9a9f . This file can also be found in the data.tar.gz at data/8c/67/e0/8c67e0741d1c90db54740e08d2e39d91dfd73566ea69c1f2da0d9ab9780a9a9f/data . For more information about concepts and format, please see [2].

    To extract all EML files from the included Preston archive, first extract the hashes assocated with EML files using:

    cat preston-ls.tsv.gz | gunzip | grep "Version" | grep -v "deeplinker" | grep -v "query/solr" | cut -f1,3 | tr '\t' ' ' | grep "hash://" | sort | uniq > eml-hashes.txt

    extract data.tar.gz using:

    ~/preston-archive$ tar xzf data.tar.gz

    then use Preston to extract each hash using something like:

    ~/preston-archive$ preston get hash://sha256/00002d0fc9e35a9194da7dd3d8ce25eddee40740533f5af2397d6708542b9baa

    Alternatively, without using Preston, you can extract the data using the naming convention:

    data/[x]/[y]/[z]/[hash]/data

    where x is the first 2 characters of the hash, y the second 2 characters, z the third 2 characters, and hash the full sha256 content hash of the EML file.

    For example, the hash hash://sha256/00002d0fc9e35a9194da7dd3d8ce25eddee40740533f5af2397d6708542b9baa can be found in the file: data/00/00/2d/00002d0fc9e35a9194da7dd3d8ce25eddee40740533f5af2397d6708542b9baa/data . For more information, see [2].

    The intended use of this archive is to facilitate meta-analysis of the DataONE dataset network.

    [1] DataONE, https://www.dataone.org
    [2] https://preston.guoda.bio, https://doi.org/10.5281/zenodo.1410543 . DataONE was crawled via Preston with "preston update -u https://dataone.org".
    [3] cat preston-ls.tsv.gz | gunzip | grep "Version" | grep -v "deeplinker" | grep -v "query/solr" | cut -f1,3 | tr '\t' ' ' | grep -v "hash://" | sort | uniq | wc -l
    [4] cat preston-ls.tsv.gz | gunzip | grep "Version" | grep -v "deeplinker" | grep -v "query/solr" | cut -f1,3 | tr '\t' ' ' | grep "hash://" | sort | uniq | wc -l
    [5] cat preston-ls.tsv.gz | gunzip | grep "Version" | grep "deeplinker" | grep -v "query/solr" | cut -f1,3 | tr '\t' ' ' | grep -v "hash://" | sort | uniq | wc -l

    This work is funded in part by grant NSF OAC 1839201 from the National Science Foundation.

  11. Wikipedia Multimodal Dataset of Good Articles

    • kaggle.com
    Updated Dec 26, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oleh Onyshchak (2019). Wikipedia Multimodal Dataset of Good Articles [Dataset]. http://doi.org/10.34740/kaggle/dsv/861624
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 26, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Oleh Onyshchak
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Wikipedia Featured Articles multimodal dataset

    Overview

    It contains the text of an article and also all the images from that article along with metadata such as image titles and descriptions. From Wikipedia, we selected good articles, which are just a small subset of all available ones, because they are manually reviewed and protected from edits.

    You can find more details in "Image Recommendation for Wikipedia Articles" thesis.

    Dataset structure

    The high-level structure of the dataset is as follows:

    .
    +-- page1 
    |  +-- text.json 
    |  +-- img 
    |    +-- meta.json
    +-- page2 
    |  +-- text.json 
    |  +-- img 
    |    +-- meta.json
    : 
    +-- pageN 
    |  +-- text.json 
    |  +-- img 
    |    +-- meta.json
    

    where:

    • pageN - is the title of N-th Wikipedia page and contains all information about the page
    • text.json - text of the page saved as JSON. Please refer to the details of JSON schema below.
    • meta.json- a collection of all images of the page. Please refer to the detals o of JSON schema below.
    • imageN - is the N-th image of an article, saved in jpg format where width of each image is set to 600px. Name of the image is md5 hashcode of original image title.

    text.JSON Schema

    Below you see an example of how data is stored:

    {
     "title": "Naval Battle of Guadalcanal",
     "id": 405411,
     "url": "https://en.wikipedia.org/wiki/Naval_Battle_of_Guadalcanal",
     "text": "The Naval Battle of Guadalcanal, sometimes referred to as... ",
    }
    

    where:

    • title - page title
    • id - unique page id
    • url - url of a page on Wikipedia
    • text - text content of the article escaped from Wikipedia formatting

    meta.JSON Schema

    {
     "img_meta": [
      {
       "filename": "d681a3776d93663fc2788e7e469b27d7.jpg",
       "title": "Metallica Damaged Justice Tour.jpg",
       "description": "Metallica en concert",
       "url": "https://en.wikipedia.org/wiki/File%3AMetallica_Damaged_Justice_Tour.jpg",
       "features": [123.23, 10.21, ..., 24.17],
       },
      ]
    }
    

    where:

    • filename- unique image id, md5 hashcode of original image title
    • title- image title retrieved from Commons, if applicable
    • url - url of an image on Wikipedia
    • features - output of 5-th convolutional layer of ResNet152 trained on ImageNet dataset. Features taken from original images downloaded in jpeg format with fixed width of 600px. Practically, it is a list of floats with len = 2048

    Note on Images

    • Some images aren't embedded on Wikipedia page from Commons, thus we can only download them in original type&size. If you want to use those as well, those images should be properly processed later. Each such image can be identified by suffix .ORIGINAL in a filename and absence of key features. Raw images are available in complete version of dataset on Google Drive

    Collection method

    Data was collected by fetching featured articles text&image content with pywikibot library.

  12. z

    Crosswalks from the NFDI4DS hackathon "machine-actionable Data Management...

    • zenodo.org
    bin, pdf, tsv
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dhwani Solanki; Dhwani Solanki; Suhasini Venkatesh; Suhasini Venkatesh; Safial Islam Ayon; Safial Islam Ayon; Martin Armbruster; Martin Armbruster; Katja Diederichs; Katja Diederichs; Sara El-Gebali; Sara El-Gebali; Giacomo Lanza; Giacomo Lanza; Antonia Leidel; Antonia Leidel; Jimena Linares Gómez; Jimena Linares Gómez; Olaf Michaelis; Olaf Michaelis; Rajendran Rajapreethi; Rajendran Rajapreethi; Marco Reidelbach; Marco Reidelbach; Gabriel Schneider; Gabriel Schneider; Sabine Schönau; Sabine Schönau; Christoph Steinbeck; Christoph Steinbeck; David Wallace; David Wallace; Jürgen Windeck; Jürgen Windeck; Xiao-Ran Zhou; Xiao-Ran Zhou; Leyla Jael Castro; Leyla Jael Castro (2025). Crosswalks from the NFDI4DS hackathon "machine-actionable Data Management Plan for NFDI" [Dataset]. http://doi.org/10.5281/zenodo.15129830
    Explore at:
    tsv, bin, pdfAvailable download formats
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Zenodo
    Authors
    Dhwani Solanki; Dhwani Solanki; Suhasini Venkatesh; Suhasini Venkatesh; Safial Islam Ayon; Safial Islam Ayon; Martin Armbruster; Martin Armbruster; Katja Diederichs; Katja Diederichs; Sara El-Gebali; Sara El-Gebali; Giacomo Lanza; Giacomo Lanza; Antonia Leidel; Antonia Leidel; Jimena Linares Gómez; Jimena Linares Gómez; Olaf Michaelis; Olaf Michaelis; Rajendran Rajapreethi; Rajendran Rajapreethi; Marco Reidelbach; Marco Reidelbach; Gabriel Schneider; Gabriel Schneider; Sabine Schönau; Sabine Schönau; Christoph Steinbeck; Christoph Steinbeck; David Wallace; David Wallace; Jürgen Windeck; Jürgen Windeck; Xiao-Ran Zhou; Xiao-Ran Zhou; Leyla Jael Castro; Leyla Jael Castro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summary

    From 28 to 30 October 2024, ZB MED Information Centre for Life Sciences organized a hackathon in Cologne within the scope of NFDI4DataScience with the purpose of identifying core elements for a machine-actionable Data Managemnt Plan (maDMP) across the German National Research Data Infrastructure (NFDI). We used as a starting/reference point the maDMP application profile from the Research Data Alliance DMP Commons Working Group. An additional element, ManagementPlan, was also included. Management Plan comes originally from DataCite v4.4, its representation as an extension proposed by the ZB MED/NFDI4DataScience machine-actionable Software Management Plan metadata schema was used.

    This dataset comprises crosswalks corresponding to the following elements from the RDA maDMP: DMP, Project, Dataset, Distribution. The crosswalks were created following a template prepared by the organizers. Participants freely select a metadata schema or platform to be mapped with regards to the attributes in the RDA maDMP application profile. Crosswalks were created for DMP, Project, Dataset, and Distribution, while comments were provided by DMP and Dataset.

    The list of files in this dataset is as follows:

    • Crosswalk template: two spreadsheets (AllTemplates.ods and AllTemplates.xlsx) containing 13 tabs corresponding to the instructions (InstructionsAllTemplates.pdf and InstructionsAllTemplates.tsv), crosswalks templates for Management Plan (CrosswalkTemplate-MngPlan.tsv), DMP (CrosswalkTemplate-DMP.tsv), Project (CrosswalkTemplate-Project.tsv), Dataset (CrosswalkTemplate-Dataset.tsv), Distribution (CrosswalkTemplate-Distribution.tsv) and Host (CrosswalkTemplate-Host.tsv), and comment templates for all the same elements Management Plan (CommentsTemplate-MngPlan.tsv), DMP (CommentsTemplate-DMP.tsv), Project (CommentsTemplate-Project.tsv), Dataset (CommentsTemplate-Dataset.tsv), Distribution (CommentsTemplate-Distribution.tsv) and Host (CommentsTemplate-Host.tsv). The crosswalk templates are meant to be used to compare a source against the reference point while the comment templates are meant to be used to provide comments to the reference point. Eash tab/crosswalk is also provided as individual TSV files. The crosswalk and comment templates contain the following columns:
      • Crosswalk for the reference source: Property, Range, Description, Cardinality (One, Many), Requirement (minimum, recommended, optional), Example
      • Crosswalk for the mapped source: Property, Range, Description, Cardinality (One, Many), Requirement (minimum, recommended, optional), Recommended Vocabulary, Comment, Review
      • Comments for the reference source: Property, Range, Description, Cardinality (One, Many), Requirement (minimum, recommended, optional), Example
      • Comments for the mapped source: Comments to Property, Comments to Range, Comments to Description, Comments to Cardinality (One, Many), Comments to Requirement (minimum, recommended, optional), Recommended Vocabulary
    • Crosswalk summary: two spreadsheets (Summary.ods and Summary.xlsx) containing six tabs corresponding to crosswalks summary for DMP (Summary-DMP.tsv), Project (Summary-Project.tsv), Dataset (Summary-Dataset.tsv), and Distribution (Summary-Distribution.tsv), and comments summary forDMP (Summary-Comments-DMP.tsv), and Dataset (Summary-Comments-Dataset.tsv). Eash tab/crosswalk is also provided as individual TSV files. The crosswalks and comments always contain the reference point (i.e, as defined by the RDA maDMP application profile) and the following mapped resources, separated by an empty (blue) column:
      • Crosswalks for DMP: RDMO, DataPLAN, NFDIxCS, Horizon Europe, DFG Checklist, DataCite, and GFBio_DMPT
      • Comments to DMP: NFDIxCS
      • Crosswalks for Project: RDMO, Metadata4Ing, NFDIxCS, Schema.org, and DFG Checklist
      • Crosswalks for Dataset: RDMO, DataPLAN, Metadata4Ing, NFDIxCS, Horizon Europe, DFG Checklist, and DataCite
      • Comments to Dataset: NFDIxCS
      • Crosswalks for Distribution: RDMO and Metadata4Ing
    • Crosswalks to RDMO: two spreadsheets (RDMO.ods and RDMO.xlsx) containing four tabs corresponding to the crosswalks for DMP (RDMO-DMP.tsv), Project (RDMO-Project.tsv), Dataset (RDMO-Dataset.tsv) and Distribution (RDMO-Distribution.tsv). Eash tab/crosswalk is also provided as individual TSV files.
    • Crosswalks to DataPLAN: two spreadsheets (DataPLAN.ods and DataPLAN.xlsx) containing two tabs corresponding to the crosswalks for DMP (DataPLAN-DMP.tsv) and Dataset (DataPLAN-Dataset.tsv). Eash tab/crosswalk is also provided as individual TSV files.
    • Crosswalks to Metadata4Ing: two spreadsheets (Metadata4Ing.ods and Metadata4Ing.xlsx) containing three tabs corresponding to the crosswalks for Project (Metadata4Ing-Project.tsv), Dataset (Metadata4Ing-Dataset.tsv) and Distribution (Metadata4Ing-Distribution.tsv). Eash tab/crosswalk is also provided as individual TSV files.
    • Crosswalks to NFDIxCS: two spreadsheets (NFDIxCS.ods and NFDIxCS.xlsx) containing five tabs corresponding to the crosswalks for DMP (NFDIxCS-DMP.tsv), Project (NFDIxCS-Project.tsv), and Dataset (NFDIxCS-Dataset.tsv), and additional comments for DMP (NFDIxCS-Comments-to-DMP.tsv) and Dataset (NFDIxCS-Comments-to-Dataset.tsv). Eash tab/crosswalk is also provided as individual TSV files.
    • Crosswalks to Schema.org: two spreadsheets (SchemaOrg.ods and SchemaOrg.xlsx) containing one tab corresponding to the crosswalk for Project (SchemaOrg-Project.tsv). Eash tab/crosswalk is also provided as individual TSV files.
    • Crosswalks to Horizon Europe as recorded in its RDMO template: two spreadsheets (HorizonEurope.ods and HorizonEurope.xlsx) containing two tabs corresponding to the crosswalks for DMP (HorizonEurope-DMP.tsv) and Dataset (HorizonEurope-Dataset.tsv). Eash tab/crosswalk is also provided as individual TSV files.
    • Crosswalks to DFG Checklist as recorded in its template in RDMO: two spreadsheets (DFGChecklist.ods and RDFGChecklistMO.xlsx) containing three tabs corresponding to the crosswalks for DMP (DFGChecklist-DMP.tsv), Project (DFGChecklist-Project.tsv), and Dataset (DFGChecklist-Dataset.tsv). Eash tab/crosswalk is also provided as individual TSV files.
    • Crosswalks to DataCite v4.5: two spreadsheets (DataCite.ods and DataCite.xlsx) containing two tabs corresponding to the crosswalks for DMP (DataCite-DMP.tsv) and Dataset (DataCite-Dataset.tsv). Eash tab/crosswalk is also provided as individual TSV files.
    • Crosswalks to GFBio_DMPT: two spreadsheets (GFBio_DMPT.ods and GFBio_DMPT.xlsx) containing one tabs corresponding to the crosswalk for DMP (GFBio_DMPT-DMP.tsv). Eash tab/crosswalk is also provided as individual TSV files.

    More information about the activities carried out during the hackathon and the analysis of the crosswalks will be available soon.

    Acknowledgements

    The activities and discussion reported here were carried out during a hackathon organized by the Semantic Technologies team at ZB MED Information Centre from 28 to 30 October 2024 in Cologne, Germany, and sponsored by the NFDI4DataScience consortium. NFI4DataScience is funded by the German Research Foundation (Deutsche Forschungsgemeinschaft – DFG) under the grant No. 460234259

    The DMP4NFDI team acknowledges the support of DFG - German Research Foundation - through the coordination fund (project number 521466146).

    David Wallace and Jürgen Windeck would like to thank the Federal Government and the Heads of Government of the Länder, as well as the Joint Science Conference (GWK), for their funding and support within the framework of the NFDI4ING consortium. Funded by the German Research Foundation (DFG) - project number 442146713.

    Marco Reidelbach is supported by MaRDI, funded by the Deutsche Forschungsgemeinschaft (DFG), project number 460135501, NFDI 29/1 “MaRDI – Mathematische Forschungsdateninitiative”.

  13. o

    Data from: Innuendo Whole Genome And Core Genome Mlst Schemas And Datasets...

    • explore.openaire.eu
    • euskadi.osasuna.ezagutzarenataria.eus
    • +2more
    Updated Jul 30, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mirko Rossi; Mickael Santos Da Silva; Bruno Filipe Ribeiro-Gonçalves; Diogo Nuno Silva; Miguel Paulo Machado; Mónica Oleastro; Vítor Borges; Joana Isidro; Luis Viera; Jani Halkilahti; Anniina Jaakkonen; Federica Palma; Saara Salmenlinna; Marjaana Hakkinen; Javier Garaizar; Joseba Bikandi; Friederike Hilbert; João André Carriço (2018). Innuendo Whole Genome And Core Genome Mlst Schemas And Datasets For Escherichia Coli [Dataset]. http://doi.org/10.5281/zenodo.1323690
    Explore at:
    Dataset updated
    Jul 30, 2018
    Authors
    Mirko Rossi; Mickael Santos Da Silva; Bruno Filipe Ribeiro-Gonçalves; Diogo Nuno Silva; Miguel Paulo Machado; Mónica Oleastro; Vítor Borges; Joana Isidro; Luis Viera; Jani Halkilahti; Anniina Jaakkonen; Federica Palma; Saara Salmenlinna; Marjaana Hakkinen; Javier Garaizar; Joseba Bikandi; Friederike Hilbert; João André Carriço
    Description

    Dataset As reference dataset, 2,218 public draft or complete genome assemblies and available metadata of Escherichia coli have been downloaded from EnteroBase in April 2017. Genomes have been selected on the basis of the ribosomal ST (rST) classification available in EnteroBase: from the same rST, genomes have been randomly selected and downloaded. The number of samples for each rST in the final dataset is proportional to those available in EnteroBase in April 2017. The dataset includes also 119 Shiga toxin-producing E.coli genomes assembled with INNUca v3.1 belonging to the INNUENDO Sequence Dataset (PRJEB27020). File 'Metadata/Ecoli_metadata.txt' contains metadata information for each strain including source classification, taxa of the hosts, country and year of isolation, serotype, pathotype, classical pubMLST 7 genes ST classification, assembly source/method and Enterobase barcode. The directory 'Genomes' contains the 119 INNUca v3.1 assemblies of the strains listed in 'Metadata/Ecoli_metadata.txt'. Enterobase assemblies can be downloaded from http://enterobase.warwick.ac.uk/species/ecoli/search_strains using 'barcode'. Schema creation and validation The wgMLST schema from EnteroBase have been downloaded and curated using chewBBACA AutoAlleleCDSCuration for removing all alleles that are not coding sequences (CDS). The quality of the remain loci have been assessed using chewBBACA Schema Evaluation and loci with single alleles, those with high length variability (i.e. if more than 1 allele is outside the mode +/- 0.05 size) and those present in less than 0.5% of the Escherichia genomes in EnteroBase at the date of the analysis (April 2017) have been removed. The wgMLST schema have been further curated, excluding all those loci detected as “Repeated Loci” and loci annotated as “non-informative paralogous hit (NIPH/ NIPHEM)” or “Allele Larger/ Smaller than length mode (ALM/ ASM)” by the chewBBACA Allele Calling engine in more than 1% of a dataset composed by 2,337 Escherichia coli genomes. File 'Schema/Ecoli_wgMLST_7601_schema.tar.gz' contains the wgMLST schema formatted for chewBBACA and includes a total of 7,601 loci. File 'Schema/Ecoli_cgMLST_2360_listGenes.txt' contains the list of genes from the wgMLST schema which defines the cgMLST schema. The cgMLST schema consists of 2,360 loci and has been defined as the loci present in at least the 99% of the 2,337 Escherichia coli genomes. Genomes have no more than 2% of missing loci. File 'Allele_Profles/Ecoli_wgMLST_alleleProfiles.tsv' contains the wgMLST allelic profile of the 2,337 Escherichia coli genomes of the dataset. Please note that missing loci follow the annotation of chewBBACA Allele Calling software. File 'Allele_Profles/Ecoli_cgMLST_alleleProfiles.tsv' contains the cgMLST allelic profile of the 2,337 Escherichia coli genomes of the dataset. Please note that missing loci are indicated with a zero. Additional citations The schema are prepared to be used with chewBBACA. When using the schema in this repository please cite also: Silva M, Machado M, Silva D, Rossi M, Moran-Gilad J, Santos S, Ramirez M, Carriço J. chewBBACA: A complete suite for gene-by-gene schema creation and strain identification. 15/03/2018. M Gen 4(3): doi:10.1099/mgen.0.000166 http://mgen.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000166 Escherichia coli schema is a derivation of EnteroBase E. coli EnteroBase wgMLST schema. When using the schema in this repository please cite also: Alikhan N-F, Zhou Z, Sergeant MJ, Achtman M (2018) A genomic overview of the population structure of Salmonella. PLoS Genet 14 (4):e1007261. https://doi.org/10.1371/journal.pgen.1007261 The isolates' genomes raw sequence data produced within the activity of the INNUENDO project were submitted to the European Nucleotide Archive (ENA) database and are publicly available under the project accession number PRJEB27020. When using the schemas, the assemblies or the allele profiles please include the project number in your publication. The research from the INNUENDO project has received funding from European Food Safety Authority (EFSA), grant agreement GP/EFSA/AFSCO/2015/01/CT2 (New approaches in identifying and characterizing microbial and chemical hazards) and from the Government of the Basque Country. The conclusions, findings, and opinions expressed in this repository reflect only the view of the INNUENDO consortium members and not the official position of EFSA nor of the Government of the Basque Country. EFSA and the Government of the Basque Country are not responsible for any use that may be made of the information included in this repository. The INNUENDO consortium thanks the Austrian Agency for Health and Food Safety Limited for participating in the project by providing strains. The consortium thanks all the researchers and the authorities worldwide which are contributing by submitting the raw sequences of the bacterial strains in public repositories. The project wa...

  14. Extended Wikipedia Multimodal Dataset

    • kaggle.com
    Updated Apr 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oleh Onyshchak (2020). Extended Wikipedia Multimodal Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/1058023
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 4, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Oleh Onyshchak
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Wikipedia Featured Articles multimodal dataset

    Overview

    • This is a multimodal dataset of featured articles containing 5,638 articles and 57,454 images.
    • Its superset of good articles is also hosted on Kaggle. It has six times more entries although with a little worse quality.

    It contains the text of an article and also all the images from that article along with metadata such as image titles and descriptions. From Wikipedia, we selected featured articles, which are just a small subset of all available ones, because they are manually reviewed and protected from edits. Thus it's the best theoretical quality human editors on Wikipedia can offer.

    You can find more details in "Image Recommendation for Wikipedia Articles" thesis.

    Dataset structure

    The high-level structure of the dataset is as follows:

    .
    +-- page1 
    |  +-- text.json 
    |  +-- img 
    |    +-- meta.json
    +-- page2 
    |  +-- text.json 
    |  +-- img 
    |    +-- meta.json
    : 
    +-- pageN 
    |  +-- text.json 
    |  +-- img 
    |    +-- meta.json
    
    labeldescription
    pageNis the title of N-th Wikipedia page and contains all information about the page
    text.jsontext of the page saved as JSON. Please refer to the details of JSON schema below.
    meta.jsona collection of all images of the page. Please refer to the details of JSON schema below.
    imageNis the N-th image of an article, saved in jpg format where the width of each image is set to 600px. Name of the image is md5 hashcode of original image title.

    text.JSON Schema

    Below you see an example of how data is stored:

    {
     "title": "Naval Battle of Guadalcanal",
     "id": 405411,
     "url": "https://en.wikipedia.org/wiki/Naval_Battle_of_Guadalcanal",
     "html": "... 
    

    ...", "wikitext": "... The '''Naval Battle of Guadalcanal''', sometimes referred to as ...", }

    keydescription
    titlepage title
    idunique page id
    urlurl of a page on Wikipedia
    htmlHTML content of the article
    wikitextwikitext content of the article

    Please note that @html and @wikitext properties represent the same information in different formats, so just choose the one which is easier to parse in your circumstances.

    meta.JSON Schema

    {
     "img_meta": [
      {
       "filename": "702105f83a2aa0d2a89447be6b61c624.jpg",
       "title": "IronbottomSound.jpg",
       "parsed_title": "ironbottom sound",
       "url": "https://en.wikipedia.org/wiki/File%3AIronbottomSound.jpg",
       "is_icon": False,
       "on_commons": True,
       "description": "A U.S. destroyer steams up what later became known as ...",
       "caption": "Ironbottom Sound. The majority of the warship surface ...",
       "headings": ['Naval Battle of Guadalcanal', 'First Naval Battle of Guadalcanal', ...],
       "features": ['4.8618264', '0.49436468', '7.0841103', '2.7377882', '2.1305492', ...],
       },
       ...
      ]
    }
    
    keydescription
    filenameunique image id, md5 hashcode of original image title
    titleimage title retrieved from Commons, if applicable
    parsed_titleimage title split into words, i.e. "helloWorld.jpg" -> "hello world"
    urlurl of an image on Wikipedia
    is_iconTrue if image is an icon, e.g. category icon. We assume that image is an icon if you cannot load a preview on Wikipedia after clicking on it
    on_commonsTrue if image is available from Wikimedia Commons dataset
    descriptiondescription of an image parsed from Wikimedia Commons page, if available
    captioncaption of an image parsed from Wikipedia article, if available
    headingslist of all nested headings of location where article is placed in Wikipedia article. The first element is top-most heading
    featuresoutput of 5-th convolutional layer of ResNet152 trained on ImageNet dataset. That output of shape (19, 24, 2048) is then max-pooled to a shape (2048,). Features taken from original images downloaded in jpeg format with fixed width of 600px. Practically, it is a list of floats with len = 2048

    Collection method

    Data was collected by fetching featured articles text&image content with pywikibot library and then parsing out a lot of additional metadata from HTML pages from Wikipedia and Commons.

  15. d

    Warehouse and Retail Sales

    • catalog.data.gov
    • data.montgomerycountymd.gov
    • +3more
    Updated May 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.montgomerycountymd.gov (2025). Warehouse and Retail Sales [Dataset]. https://catalog.data.gov/dataset/warehouse-and-retail-sales
    Explore at:
    Dataset updated
    May 10, 2025
    Dataset provided by
    data.montgomerycountymd.gov
    Description

    This dataset contains a list of sales and movement data by item and department appended monthly. Update Frequency : Monthly

  16. w

    Michigan 3 Data Exchange Content, NGDS YR 3 Deliverables - Metadata...

    • data.wu.ac.at
    • datadiscoverystudio.org
    zip
    Updated Dec 5, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). Michigan 3 Data Exchange Content, NGDS YR 3 Deliverables - Metadata Compilation [Dataset]. https://data.wu.ac.at/schema/geothermaldata_org/NmZhOTA1OTItY2ZiZS00ZDBkLWExZGEtZDk5Yzk5NDczN2I2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 5, 2017
    Area covered
    Michigan, 962e6a363ee0f33a928a0b372acb3c3e3ee3d75f
    Description

    This resource is a metadata compilation for Michigan geothermal related data in data exchange models submitted to the AASG National Geothermal Data System project to fulfill Year 3 data deliverables by the Michigan Geological Survey, Western Michigan University. Descriptions, links, and contact information for the ESRI Map Services created with Michigan data are also available here, including borehole temperature data, drill stem test data, lithology interval data, heat pump installations, physical samples, and well header data for the state of Michigan. The data and associated services were provided by the Michigan Geological Survey, Western Michigan University. The compilation is published as an Excel workbook containing header features including title, description, author, citation, originator, distributor, and resource URL links to scanned maps for download. The Excel workbook contains 6 worksheets, including information about the template, notes related to revisions of the template, resource provider information, the metadata, a field list (data mapping view) and vocabularies (data valid terms) used to populate the data worksheet. This metadata compilation was provided by the Michigan Geological Survey at Western Michigan University and made available for distribution through the National Geothermal Data System.

  17. w

    Gas samples of Afghanistan and adjacent areas (gasafg.shp)

    • data.wu.ac.at
    • datadiscoverystudio.org
    • +3more
    Updated Jun 8, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of the Interior (2018). Gas samples of Afghanistan and adjacent areas (gasafg.shp) [Dataset]. https://data.wu.ac.at/schema/data_gov/ZjIwZmUwOTYtMGIwYy00OTUzLTk2YTItYzllOTI5MzRmOTNm
    Explore at:
    zip, internet map serviceAvailable download formats
    Dataset updated
    Jun 8, 2018
    Dataset provided by
    Department of the Interior
    Area covered
    83026b0b4bdbdb42fc209230cd190d317c5ff448
    Description

    This shapefile contains points that describe the location of gas samples collected in Afghanistan and adjacent areas and the results of organic geochemical analysis.

  18. o

    Data from: ClaimsKG - A Knowledge Graph of Fact-Checked Claims

    • explore.openaire.eu
    • data.niaid.nih.gov
    • +1more
    Updated Oct 1, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andon Tchechmedjiev; Pavlos Fafalios; Konstantin Todorov; Stefan Dietze; Boland; Zapilko (2019). ClaimsKG - A Knowledge Graph of Fact-Checked Claims [Dataset]. http://doi.org/10.5281/zenodo.2628744
    Explore at:
    Dataset updated
    Oct 1, 2019
    Authors
    Andon Tchechmedjiev; Pavlos Fafalios; Konstantin Todorov; Stefan Dietze; Boland; Zapilko
    Description

    ClaimsKG is a knowledge graph of metadata information for thousands of fact-checked claims which facilitates structured queries about their truth values, authors, dates, and other kinds of metadata. ClaimsKG is generated through a (semi-)automated pipeline, which harvests claim-related data from popular fact-checking web sites, annotates them with related entities from DBpedia, and lifts all data to RDF using an RDF/S model that makes use of established vocabularies (such as schema.org).

    ClaimsKG does NOT contain the text of the reviews from the fact-checking web sites; it only contains structured metadata information and links to the reviews.

    More information, such as statistics, query examples and a user friendly interface to explore the knowledge graph, is available at: https://data.gesis.org/claimskg/site

    There is a newer version of the dataset available!

  19. Z

    PLBD (Protein Ligand Binding Database) table description XML file

    • data.niaid.nih.gov
    • zenodo.org
    Updated Dec 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Konovalovas, Aleksandras (2022). PLBD (Protein Ligand Binding Database) table description XML file [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_7482007
    Explore at:
    Dataset updated
    Dec 26, 2022
    Dataset provided by
    Čapkauskaitė, Edita
    Gražulis, Saulius
    Vaitkus, Antanas
    Baranauskienė, Lina
    Urniežius, Ernestas
    Matulis, Daumantas
    Petrauskas, Vytautas
    Zakšauskas, Audrius
    Smirnovienė, Joana
    Dudutienė, Virginija
    Konovalovas, Aleksandras
    Grybauskas, Algirdas
    Merkys, Andrius
    Zubrienė, Asta
    Mickevičiūtė, Aurelija
    Paketurytė, Vaida
    Gedgaudas, Marius
    Lingė, Darius
    Kazlauskas, Egidijus
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PLBD (Protein Ligand Binding Database) table description XML file

    General

    The provided ZIP archive contains an XML file "main-database-description.xml" with the description of all tables (VIEWS) that are exposed publicly at the PLBD server (https://plbd.org/). In the XML file, all columns of the visible tables are described, specifying their SQL types, measurement units, semantics, calculation formulae, SQL statements that can be used to generate values in these columns, and publications of the formulae derivations.

    The XML file conforms to the published XSD schema created for descriptions of relational databases for specifications of scientific measurement data. The XSD schema ("relational-database_v2.0.0-rc.18.xsd") and all included sub-schemas are provided in the same archive for convenience. All XSD schemas are validated against the "XMLSchema.xsd" schema from the W3C consortium.

    The ZIP file contains the excerpt from the files hosted in the https://plbd.org/ at the moment of submission of the PLBD database in the Scientific Data journal, and is provided to conform the journal policies. The current data and schemas should be fetched from the published URIs:

    https://plbd.org/
    https://plbd.org/doc/db/schemas
    https://plbd.org/doc/xml/schemas
    

    Software that is used to generate SQL schemas, RestfulDB metadata and the RestfulDB middleware that allows to publish the databases generated from the XML description on the Web are available at public Subversion repositories:

    svn://www.crystallography.net/solsa-database-scripts
    svn://saulius-grazulis.lt/restfuldb
    

    Usage

    The unpacked ZIP file will create the "db/" directory with the tree layout given below. In addition to the database description file "main-database-description.xml", all XSD schemas necessary for validation of the XML file are provided. On a GNU/Linux operating system with a GNU Make package installed, the XML file validity can be checked by unpacking the ZIP file, entering the unpacked directory, and running 'make distclean; make'. For example, on a Linux Mint distribution, the following commands should work:

    unzip main-database-description.zip
    cd db/release/v0.10.0/tables/
    sh -x dependencies/Linuxmint-20.1/install.sh
    make distclean
    make
    

    If necessary, additional packages can be installed using the 'install.sh' script in the 'dependencies/' subdirectory corresponding to your operating system. As of the moment of writing, Debian-10 and Linuxmint-20.1 OSes are supported out of the box; similar OSes might work with the same 'install.sh' scripts. The installation scripts require to run package installation command under system administrator privileges, but they use only the standard system package manager, thus they should not put your system at risk. For validation and syntax checking, the 'rxp' and 'xmllint' programs are used.

    The log files provided in the "outputs/validation" subdirectory contain validation logs obtained on the system where the XML files were last checked and should indicate validity of the provided XML file against the references schemas.

    Layout of the archived file tree

    db/
    └── release
      └── v0.10.0
        └── tables
          ├── Makeconfig-validate-xml
          ├── Makefile
          ├── Makelocal-validate-xml
          ├── dependencies
          ├── main-database-description.xml
          ├── outputs
          └── schema
    
  20. Data from: Apollo 16 Coarse Fines (4-10 mm): Sample Classification,...

    • data.wu.ac.at
    • datadiscoverystudio.org
    • +3more
    pdf
    Updated Aug 9, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Aeronautics and Space Administration (2018). Apollo 16 Coarse Fines (4-10 mm): Sample Classification, Description, and Inventory [Dataset]. https://data.wu.ac.at/schema/data_gov/NWM4MDgwMzgtMGVmOS00NjJkLTg5MDUtMDY4NDEzYjBmZTUw
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Aug 9, 2018
    Dataset provided by
    NASAhttp://nasa.gov/
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    Apollo 16 Coarse Fines (4-10 mm): Sample Classification, Description, and Inventory by U. Marvin

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
The Scholix Metadata JSON Schema [Dataset]. https://gimi9.com/dataset/eu_oai-zenodo-org-6351557

The Scholix Metadata JSON Schema

Explore at:
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

This product consists of the XML schema and the JSON schema aligned with the Scholix Metadata Guidelines Version 4. It contains the .json and /xsd files together with examples of compatible metadata records. Changes from the previous update are backward compatible and include the following: The schema admits for the field type (typology of source/target objects) terms of the following vocabulary: publications, datasets, software, other research types (version 3.0 included only literature and dataset) The schema includes a new optional field subtype which includes the specific sub-type of the objects, according to the OpenAIRE classification of publications, datasets, software, and other products (for more) The schema admits multiple entries for the field Identifier in both the source and target objects; this is to specify the list of PIDs resulting from the deduplication on OpenAIRE (i.e. the same publication may have been collected from Crossref and from EuropePMC, thus including both PIDs).

Search
Clear search
Close search
Google apps
Main menu