100+ datasets found
  1. Z

    Pairwise Multi-Class Document Classification for Semantic Relations between...

    • data.niaid.nih.gov
    Updated Aug 1, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Terry Ruas (2020). Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles (Dataset, Models & Code) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3713182
    Explore at:
    Dataset updated
    Aug 1, 2020
    Dataset provided by
    Moritz Schubotz
    Bela Gipp
    Malte Ostendorff
    Georg Rehm
    Terry Ruas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Many digital libraries recommend literature to their users considering the similarity between a query document and their repository. However, they often fail to distinguish what is the relationship that makes two documents alike. In this paper, we model the problem of finding the relationship between two documents as a pairwise document classification task. To find the semantic relation between documents, we apply a series of techniques, such as GloVe, Paragraph-Vectors, BERT, and XLNet under different configurations (e.g., sequence length, vector concatenation scheme), including a Siamese architecture for the Transformer-based systems. We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations. Our results show vanilla BERT as the best performing system with an F1-score of 0.93, which we manually examine to better understand its applicability to other domains. Our findings suggest that classifying semantic relations between documents is a solvable task and motivates the development of recommender systems based on the evaluated techniques. The discussions in this paper serve as first steps in the exploration of documents through SPARQL-like queries such that one could find documents that are similar in one aspect but dissimilar in another.

    Additional information can be found on GitHub.

    The following data is supplemental to the experiments described in our research paper. The data consists of:

    Datasets (articles, class labels, cross-validation splits)

    Pretrained models (Transformers, GloVe, Doc2vec)

    Model output (prediction) for the best performing models

    Dataset

    The Wikipedia article corpus is available in enwiki-20191101-pages-articles.weighted.10k.jsonl.bz2. The original data have been downloaded as XML dump, and the corresponding articles were extracted as plain-text with gensim.scripts.segment_wiki. The archive contains only articles that are available in training or test data.

    The actual dataset is provided as used in the stratified k-fold with k=4 in train_testdata_4folds.tar.gz.

    ├── 1 │ ├── test.csv │ └── train.csv ├── 2 │ ├── test.csv │ └── train.csv ├── 3 │ ├── test.csv │ └── train.csv └── 4 ├── test.csv └── train.csv

    4 directories, 8 files

    Pretrained models

    PyTorch: vanilla and Siamese BERT + XLNet

    Pretrained model for each fold is available in the corresponding model archives:

    Vanilla

    model_wiki.bert_base_joint_seq512.tar.gz model_wiki.xlnet_base_joint_seq512.tar.gz

    Siamese

    model_wiki.bert_base_siamese_seq512_4d.tar.gz model_wiki.xlnet_base_siamese_seq512_4d.tar.gz

  2. R

    Fyp Semantic Dataset

    • universe.roboflow.com
    zip
    Updated Mar 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    fyp (2025). Fyp Semantic Dataset [Dataset]. https://universe.roboflow.com/fyp-efein/fyp-semantic/model/7
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 4, 2025
    Dataset authored and provided by
    fyp
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Greenery VFcH Masks
    Description

    Fyp Semantic

    ## Overview
    
    Fyp Semantic is a dataset for semantic segmentation tasks - it contains Greenery VFcH annotations for 1,000 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  3. R

    Dataset Corrosion Seg Semantic Dataset

    • universe.roboflow.com
    zip
    Updated Oct 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    computervision (2024). Dataset Corrosion Seg Semantic Dataset [Dataset]. https://universe.roboflow.com/computervision-laxn2/dataset-corrosion-seg-semantic/model/3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 8, 2024
    Dataset authored and provided by
    computervision
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Corrosion Masks
    Description

    Dataset Corrosion Seg Semantic

    ## Overview
    
    Dataset Corrosion Seg Semantic is a dataset for semantic segmentation tasks - it contains Corrosion annotations for 978 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  4. h

    sick

    • huggingface.co
    Updated Sep 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roberto Zamparelli (2023). sick [Dataset]. https://huggingface.co/datasets/RobZamp/sick
    Explore at:
    Dataset updated
    Sep 1, 2023
    Authors
    Roberto Zamparelli
    License

    Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
    License information was derived automatically

    Description

    Shared and internationally recognized benchmarks are fundamental for the development of any computational system. We aim to help the research community working on compositional distributional semantic models (CDSMs) by providing SICK (Sentences Involving Compositional Knowldedge), a large size English benchmark tailored for them. SICK consists of about 10,000 English sentence pairs that include many examples of the lexical, syntactic and semantic phenomena that CDSMs are expected to account for, but do not require dealing with other aspects of existing sentential data sets (idiomatic multiword expressions, named entities, telegraphic language) that are not within the scope of CDSMs. By means of crowdsourcing techniques, each pair was annotated for two crucial semantic tasks: relatedness in meaning (with a 5-point rating scale as gold score) and entailment relation between the two elements (with three possible gold labels: entailment, contradiction, and neutral). The SICK data set was used in SemEval-2014 Task 1, and it freely available for research purposes.

  5. z

    Biotea dataset (vr. July 2012)

    • zenodo.org
    zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leyla Jael Garcia Castro; Olga Giraldo; Casey McLaughlin; Alexander Garcia; Leyla Jael Garcia Castro; Olga Giraldo; Casey McLaughlin; Alexander Garcia (2020). Biotea dataset (vr. July 2012) [Dataset]. http://doi.org/10.5281/zenodo.376814
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodo
    Authors
    Leyla Jael Garcia Castro; Olga Giraldo; Casey McLaughlin; Alexander Garcia; Leyla Jael Garcia Castro; Olga Giraldo; Casey McLaughlin; Alexander Garcia
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Background

    Information reported by scientific literature still remains locked up in discrete documents that are not always interconnected or machine-readable. The Semantic Web together with approaches such as the Resource Description Framework (RDF) and the Linked Open Data (LOD) initiative offer a connectivity tissue that can be used to support the generation of self-describing, machine-readable documents.

    Results

    Biotea is an approach to generate RDF from scholarly documents. Our RDF model makes extensive use of existing ontologies and semantic enrichment services. Our dataset comprises 270,834 articles from PubMed Open Central in RDF/XML distributed in 404 zipped files. The RDFization process takes care of metadata, e.g., title, authors and journal, as well as semantic annotations on biological entities along the full text. Biological entities are extracted by using the NCBO Annotator and Whatizit.

    We use the Bibliographic Ontology (BIBO), Dublin Core Metadata Initiative Terms (DCMI-terms), and the Provenance Ontology (PROV-O) to model the bibliographic metadata. Links to related pages such as PubMed HTML articles are provided via rdfs:seeAlso while links to other semantic representation such as Bio2RDF PubMed articles are provided via owl:sameAs.

    The NCBO Annotator is used to extract entities covering ChEBI for chemicals; Pathway, and Functional Genomics Data Society (MGED) for genes and proteins; Master Drug Data Base (MDDB), NDDF, and NDFRT for drugs; SNOMED, SYMP, MedDRA, MeSH, MedlinePlus Health Topics (MedlinePlus), Online Mendelian Inheritance in Man (OMIM), FMA, ICD10, and Ontology for Biomedical Investigations (OBI) for diseases and medical terms; PO for plants; and MeSH, SNOMED, and NCIt for general terms.

    Whatizit is used for GO, UniProt proteins, UniProt Taxonomy, and diseases mapped to the UMLS; UniProt taxa are also mapped to NCBI Taxon vocabulary.

    Conclusions

    Biotea delivers models and tools for metadata enrichment and semantic processing of biomedical documents. Our dataset makes it easier to access to the first bunch of RDFized articles following the Biotea model. Our future plans include updating our dataset on regular basis in order to incorporate the latest articles added to the PubMed Open Central collection, next delivery is planned for the first half of 2017. Following datasets will support a mapping to the Semanticscience Integrated Ontology (SIO) in order to accomplish to the guidelines set by Bio2RDF.

    Notes

    Biotea approach in full is available at http://jbiomedsem.biomedcentral.com/articles/10.1186/2041-1480-4-S1-S5 (Garcia Castro, L.J., C. McLaughlin, and A. Garcia, Biotea: RDFizing PubMed Central in Support for the Paper as an Interface to the Web of Data. Biomedical semantics, 2013. 4 Suppl 1: p. S5).

    Biotea algorithms are publicly available at https://github.com/biotea

  6. f

    Data from: Augmentation of Semantic Processes for Deep Learning Applications...

    • tandf.figshare.com
    txt
    Updated Jun 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maximilian Hoffmann; Lukas Malburg; Ralph Bergmann (2025). Augmentation of Semantic Processes for Deep Learning Applications [Dataset]. http://doi.org/10.6084/m9.figshare.29212617.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 2, 2025
    Dataset provided by
    Taylor & Francis
    Authors
    Maximilian Hoffmann; Lukas Malburg; Ralph Bergmann
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The popularity of Deep Learning (DL) methods used in business process management research and practice is constantly increasing. One important factor that hinders the adoption of DL in certain areas is the availability of sufficiently large training datasets, particularly affecting domains where process models are mainly defined manually with a high knowledge-acquisition effort. In this paper, we examine process model augmentation in combination with semi-supervised transfer learning to enlarge existing datasets and train DL models effectively. The use case of similarity learning between manufacturing process models is discussed. Based on a literature study of existing augmentation techniques, a concept is presented with different categories of augmentation from knowledge-light approaches to knowledge-intensive ones, e. g. based on automated planning. Specifically, the impacts of augmentation approaches on the syntactic and semantic correctness of the augmented process models are considered. The concept also proposes a semi-supervised transfer learning approach to integrate augmented and non-augmented process model datasets in a two-phased training procedure. The experimental evaluation investigates augmented process model datasets regarding their quality for model training in the context of similarity learning between manufacturing process models. The results indicate a large potential with a reduction of the prediction error of up to 53%.

  7. R

    Urban Semantic Dataset

    • universe.roboflow.com
    zip
    Updated Jan 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LabIA (2025). Urban Semantic Dataset [Dataset]. https://universe.roboflow.com/labia-z0pkg/urban-semantic/model/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 17, 2025
    Dataset authored and provided by
    LabIA
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Traffic Signs Streetlamp Bounding Boxes
    Description

    Urban Semantic

    ## Overview
    
    Urban Semantic is a dataset for object detection tasks - it contains Traffic Signs Streetlamp annotations for 7,107 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  8. S

    A dataset of grape multimodal object detection and semantic segmentation

    • scidb.cn
    Updated Aug 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wenjun Chen; Yuan Rao; Fengyi Wang; Yu Zhang; Yumeng Yang; Qing Luo; Tong Zhang; Tianyu Wan; Xinyu Liu; Mengyu Zhang; Rui Zhang (2023). A dataset of grape multimodal object detection and semantic segmentation [Dataset]. http://doi.org/10.57760/sciencedb.j00001.00883
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 28, 2023
    Dataset provided by
    Science Data Bank
    Authors
    Wenjun Chen; Yuan Rao; Fengyi Wang; Yu Zhang; Yumeng Yang; Qing Luo; Tong Zhang; Tianyu Wan; Xinyu Liu; Mengyu Zhang; Rui Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The accuracy of grape picking point localization is dependent on grape detection and semantic segmentation network performance. However, in practical application scenarios, the accuracy and segmentation precision of grape targets based on visible light images are susceptible to light variations and complex environments, often performing poorly. Moreover, grapes grow in bunches, and the existing multimodal datasets for apples and pears can hardly meet the recognition needs of bunch-shaped grapes. The construction of visible, depth, and near-infrared multimodal object detection and semantic segmentation datasets of grapes is crucial to exploring better recognition rates and stronger generalization capabilities for grape detection and semantic segmentation models. This dataset, totaling about 39.08 GB, contains high-quality multimodal video stream data of green and purple grapes, including six varieties, under different illumination and obscuration conditions. Additionally, the dataset offers 3954 labeled image samples extracted from the aforementioned multimodal video. By means of rotation, deflation, mis-slicing, panning, and Gaussian blur, the dataset can be augmented for the training implementation of mainstream deep learning models. The dataset can provide valuable basic data resources for multimodal fusion, grape semantic segmentation, and object detection, which have important practical application value for promoting research in the field of agricultural machinery and equipment intelligence.

  9. code_x_glue_cc_clone_detection_poj104

    • huggingface.co
    • opendatalab.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google, code_x_glue_cc_clone_detection_poj104 [Dataset]. https://huggingface.co/datasets/google/code_x_glue_cc_clone_detection_poj104
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Googlehttp://google.com/
    License

    https://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/

    Description

    Dataset Card for "code_x_glue_cc_clone_detection_poj_104"

      Dataset Summary
    

    CodeXGLUE Clone-detection-POJ-104 dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-POJ-104 Given a code and a collection of candidates as the input, the task is to return Top K codes with the same semantic. Models are evaluated by MAP score. We use POJ-104 dataset on this task.

      Supported Tasks and Leaderboards
    

    document-retrieval: The… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_cc_clone_detection_poj104.

  10. Aerial Semantic Drone Dataset

    • kaggle.com
    Updated May 25, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lalu Erfandi Maula Yusnu (2021). Aerial Semantic Drone Dataset [Dataset]. https://www.kaggle.com/nunenuh/semantic-drone/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 25, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Lalu Erfandi Maula Yusnu
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Aerial Semantic Drone Dataset

    The Semantic Drone Dataset focuses on semantic understanding of urban scenes for increasing the safety of autonomous drone flight and landing procedures. The imagery depicts more than 20 houses from nadir (bird's eye) view acquired at an altitude of 5 to 30 meters above the ground. A high-resolution camera was used to acquire images at a size of 6000x4000px (24Mpx). The training set contains 400 publicly available images and the test set is made up of 200 private images.

    This dataset is taken from https://www.kaggle.com/awsaf49/semantic-drone-dataset. We remove and add files and information that we needed for our research purpose. We create our tiff files with a resolution of 1200x800 pixel in 24 channel with each channel represent classes that have been preprocessed from png files label. We reduce the resolution and compress the tif files with tiffile python library.

    If you have any problem with tif dataset that we have been modified you can contact nunenuh@gmail.com and gaungalif@gmail.com.

    This dataset was a copy from the original dataset (link below), we provide and add some improvement in the semantic data and classes. There are the availability of semantic data in png and tiff format with a smaller size as needed.

    Semantic Annotation

    The images are labelled densely using polygons and contain the following 24 classes:

    unlabeled paved-area dirt grass gravel water rocks pool vegetation roof wall window door fence fence-pole person dog car bicycle tree bald-tree ar-marker obstacle conflicting

    Directory Structure and Files

    > images
    > labels/png
    > labels/tiff
     - class_to_idx.json
     - classes.csv
     - classes.json
     - idx_to_class.json
    

    Included Data

    • 400 training images in jpg format can be found in "aerial_semantic_drone/images"
    • Dense semantic annotations in png format can be found in "aerial_semantic_drone/labels/png"
    • Dense semantic annotations in tiff format can be found in "aerial_semantic_drone/labels/tiff"
    • Semantic class definition in csv format can be found in "aerial_semantic_drone/classes.csv"
    • Semantic class definition in json can be found in "aerial_semantic_drone/classes.json"
    • Index to class name file can be found in "aerial_semantic_drone/idx_to_class.json"
    • Class name to index file can be found in "aerial_semantic_drone/idx_to_class.json"

    Contact

    aerial@icg.tugraz.at

    Citation

    If you use this dataset in your research, please cite the following URL: www.dronedataset.icg.tugraz.at

    License

    The Drone Dataset is made freely available to academic and non-academic entities for non-commercial purposes such as academic research, teaching, scientific publications, or personal experimentation. Permission is granted to use the data given that you agree:

    That the dataset comes "AS IS", without express or implied warranty. Although every effort has been made to ensure accuracy, we (Graz University of Technology) do not accept any responsibility for errors or omissions. That you include a reference to the Semantic Drone Dataset in any work that makes use of the dataset. For research papers or other media link to the Semantic Drone Dataset webpage.

    That you do not distribute this dataset or modified versions. It is permissible to distribute derivative works in as far as they are abstract representations of this dataset (such as models trained on it or additional annotations that do not directly include any of our data) and do not allow to recover the dataset or something similar in character. That you may not use the dataset or any derivative work for commercial purposes as, for example, licensing or selling the data, or using the data with a purpose to procure a commercial gain. That all rights not expressly granted to you are reserved by us (Graz University of Technology).

  11. Data from: S1S2-Water: A global dataset for semantic segmentation of water...

    • zenodo.org
    • data.niaid.nih.gov
    json, zip
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marc Wieland; Marc Wieland; Florian Fichtner; Sandro Martinis; Sandro Groth; Christian Krullikowski; Simon Plank; Mahdi Motagh; Florian Fichtner; Sandro Martinis; Sandro Groth; Christian Krullikowski; Simon Plank; Mahdi Motagh (2023). S1S2-Water: A global dataset for semantic segmentation of water bodies from Sentinel-1 and Sentinel-2 satellite images [Dataset]. http://doi.org/10.5281/zenodo.8314175
    Explore at:
    zip, jsonAvailable download formats
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marc Wieland; Marc Wieland; Florian Fichtner; Sandro Martinis; Sandro Groth; Christian Krullikowski; Simon Plank; Mahdi Motagh; Florian Fichtner; Sandro Martinis; Sandro Groth; Christian Krullikowski; Simon Plank; Mahdi Motagh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The S1S2-Water dataset is a global reference dataset for training, validation and testing of convolutional neural networks for semantic segmentation of surface water bodies in publicly available Sentinel-1 and Sentinel-2 satellite images. The dataset consists of 65 triplets of Sentinel-1 and Sentinel-2 images with quality checked binary water mask. Samples are drawn globally on the basis of the Sentinel-2 tile-grid (100 x 100 km) under consideration of pre-dominant landcover and availability of water bodies. Each sample is complemented with metadata and Digital Elevation Model (DEM) raster from the Copernicus DEM.

    This work was supported by the German Federal Ministry of Education and Research (BMBF) through the project "Künstliche Intelligenz zur Analyse von Erdbeobachtungs- und Internetdaten zur Entscheidungsunterstützung im Katastrophenfall" (AIFER) under Grant 13N15525, and by the Helmholtz Artificial Intelligence Cooperation Unit through the project "AI for Near Real Time Satellite-based Flood Response" (AI4FLOOD) under Grant ZT-IPF-5-39.

  12. Cityscapes Image Pairs

    • kaggle.com
    Updated Apr 20, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DanB (2018). Cityscapes Image Pairs [Dataset]. https://www.kaggle.com/datasets/dansbecker/cityscapes-image-pairs/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 20, 2018
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    DanB
    Description

    Context

    Cityscapes data (dataset home page) contains labeled videos taken from vehicles driven in Germany. This version is a processed subsample created as part of the Pix2Pix paper. The dataset has still images from the original videos, and the semantic segmentation labels are shown in images alongside the original image. This is one of the best datasets around for semantic segmentation tasks.

    Content

    This dataset has 2975 training images files and 500 validation image files. Each image file is 256x512 pixels, and each file is a composite with the original photo on the left half of the image, alongside the labeled image (output of semantic segmentation) on the right half.

    Acknowledgements

    This dataset is the same as what is available here from the Berkeley AI Research group.

    License

    The Cityscapes data available from cityscapes-dataset.com has the following license:

    This dataset is made freely available to academic and non-academic entities for non-commercial purposes such as academic research, teaching, scientific publications, or personal experimentation. Permission is granted to use the data given that you agree:

    • That the dataset comes "AS IS", without express or implied warranty. Although every effort has been made to ensure accuracy, we (Daimler AG, MPI Informatics, TU Darmstadt) do not accept any responsibility for errors or omissions.
    • That you include a reference to the Cityscapes Dataset in any work that makes use of the dataset. For research papers, cite our preferred publication as listed on our website; for other media cite our preferred publication as listed on our website or link to the Cityscapes website.
    • That you do not distribute this dataset or modified versions. It is permissible to distribute derivative works in as far as they are abstract representations of this dataset (such as models trained on it or additional annotations that do not directly include any of our data) and do not allow to recover the dataset or something similar in character.
    • That you may not use the dataset or any derivative work for commercial purposes as, for example, licensing or selling the data, or using the data with a purpose to procure a commercial gain.
    • That all rights not expressly granted to you are reserved by (Daimler AG, MPI Informatics, TU Darmstadt).

    Inspiration

    Can you identify you identify what objects are where in these images from a vehicle.

  13. Z

    Dataset - Clustering Semantic Predicates in the Open Research Knowledge...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arab Oghli, Omar (2022). Dataset - Clustering Semantic Predicates in the Open Research Knowledge Graph [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6513498
    Explore at:
    Dataset updated
    Aug 8, 2022
    Dataset authored and provided by
    Arab Oghli, Omar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset has been created for implementing a content-based recommender system in the context of the Open Research Knowledge Graph (ORKG). The recommender system accepts research paper's title and abstracts as input and recommends existing predicates in the ORKG semantically relevant to the given paper.

    The paper instances in the dataset are grouped by ORKG comparisons and therefore the data.json file is more comprehensive than training_set.json and test_set.json.

    data.json

    The main JSON object consists of a list of comparisons. Each comparisons object has an ID, label, list of papers and list of predicates, whereas each paper object has ID, label, DOI, research field, research problems and abstract. Each predicate object has an ID and a label. See an example instance below.

    { "comparisons": [ { "id": "R108331", "label": "Analysis of approaches based on required elements in way of modeling", "papers": [ { "id": "R108312", "label": "Rapid knowledge work visualization for organizations", "doi": "10.1108/13673270710762747", "research_field": { "id": "R134", "label": "Computer and Systems Architecture" }, "research_problems": [ { "id": "R108294", "label": "Enterprise engineering" } ], "abstract": "Purpose \u2013 The purpose of this contribution is to motivate a new, rapid approach to modeling knowledge work in organizational settings and to introduce a software tool that demonstrates the viability of the envisioned concept.Design/methodology/approach \u2013 Based on existing modeling structures, the KnowFlow toolset that aids knowledge analysts in rapidly conducting interviews and in conducting multi\u2010perspective analysis of organizational knowledge work is introduced.Findings \u2013 This article demonstrates how rapid knowledge work visualization can be conducted largely without human modelers by developing an interview structure that allows for self\u2010service interviews. Two application scenarios illustrate the pressing need for and the potentials of rapid knowledge work visualizations in organizational settings.Research limitations/implications \u2013 The efforts necessary for traditional modeling approaches in the area of knowledge management are often prohibitive. This contribution argues that future research needs ..." }, .... ], "predicates": [ { "id": "P37126", "label": "activities, behaviours, means [for knowledge development and/or for knowledge conveyance and transformation" }, { "id": "P36081", "label": "approach name" }, .... ] }, .... ] }

    training_set.json and test_set.json

    The main JSON object consists of a list of training/test instances. Each instance has an instance_id with the format (comparison_id X paper_id) and a text. The text is a concatenation of the paper's label (title) and abstract. See an example instance below.

    Note that test instances are not duplicated and do not occur in the training set. Training instances are also not duplicated, BUT training papers can be duplicated in a concatenation with different comparisons.

    { "instances": [ { "instance_id": "R108331xR108301", "comparison_id": "R108331", "paper_id": "R108301", "text": "A notation for Knowledge-Intensive Processes Business process modeling has become essential for managing organizational knowledge artifacts. However, this is not an easy task, especially when it comes to the so-called Knowledge-Intensive Processes (KIPs). A KIP comprises activities based on acquisition, sharing, storage, and (re)use of knowledge, as well as collaboration among participants, so that the amount of value added to the organization depends on process agents' knowledge. The previously developed Knowledge Intensive Process Ontology (KIPO) structures all the concepts (and relationships among them) to make a KIP explicit. Nevertheless, KIPO does not include a graphical notation, which is crucial for KIP stakeholders to reach a common understanding about it. This paper proposes the Knowledge Intensive Process Notation (KIPN), a notation for building knowledge-intensive processes graphical models." }, ... ] }

    Dataset Statistics:

        -
        Papers
        Predicates
        Research Fields
        Research Problems
    
    
    
    
        Min/Comparison
        2
        2
        1
        0
    
    
        Max/Comparison
        202
        112
        5
        23
    
    
        Avg./Comparison
        21,54
        12,79
        1,20
        1,09
    
    
        Total
        4060
        1816
        46
        178
    

    Dataset Splits:

        -
        Papers
        Comparisons
    
    
    
    
        Training Set
        2857
        214
    
    
        Test Set
        1203
        180
    
  14. Datasets and Models for Historical Newspaper Article Segmentation

    • zenodo.org
    • explore.openaire.eu
    json, txt, zip
    Updated Jan 31, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raphaël Barman; Maud Ehrmann; Simon Clematide; Oliveira; Raphaël Barman; Maud Ehrmann; Simon Clematide; Oliveira (2021). Datasets and Models for Historical Newspaper Article Segmentation [Dataset]. http://doi.org/10.5281/zenodo.3706863
    Explore at:
    json, txt, zipAvailable download formats
    Dataset updated
    Jan 31, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Raphaël Barman; Maud Ehrmann; Simon Clematide; Oliveira; Raphaël Barman; Maud Ehrmann; Simon Clematide; Oliveira
    Description

    This record contains the datasets and models used and produced for the work reported in the paper "Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers" (link).

    Please cite this paper if you are using the models/datasets or find it relevant to your research:

    @article{barman_combining_2020,
      title = {{Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers}},
      author = {Raphaël Barman and Maud Ehrmann and Simon Clematide and Sofia Ares Oliveira and Frédéric Kaplan},
      journal= {Journal of Data Mining \& Digital Humanities},
      volume= {HistoInformatics}
      DOI = {10.5281/zenodo.4065271},
      year = {2021},
      url = {https://jdmdh.episciences.org/7097},
    }


    Please note that this record contains data under different licenses.

    1. DATA

    • Annotations (json files): JSON files contains image annotations, with one file per newspaper containing region annotations (label and coordinates) in VIA format. The following licenses apply:
      • luxwort.json: those annotations are under a CC0 1.0 license. Please refer to the right statement specified for each image in the file.
      • GDL.json, IMP.json and JDG.json: those annotations are under a CC BY-SA 4.0 license.

    • Image files: The archive images.zip contains the Swiss titles image files (GDL, IMP, JDG) used for the experiments described in the paper. Those images are under copyright (property of the journal Le Temps and of ArcInfo) and can be used for academic research or educational purposes only. Redistribution, publication or commercial use are not permitted. These terms of use are similar to the following right statement: http://rightsstatements.org/vocab/InC-EDU/1.0/

    2. MODELS

    Some of the best models are released under a CC BY-SA 4.0 license (they are also available as assets of the current Github release).

    • JDG_flair-FT: this model was trained on JDG using french Flair and FastText embeddings. It is able to predict the four classes presented in the paper (Serial, Weather, Death notice and Stocks).
    • Luxwort_obituary_flair-bpemb: this model was trained on Luxwort using multilingual Flair and Byte-pair embeddings. It is able to predict the Death notice class.
    • Luxwort_obituary_flair-FT_indomain: this model was trained on Luxwort using in-domain Flair and FastText embeddings (trained on Luxwort data). It is also able to predict the Death notice class.

    Those models can be used to predict probabilities on new images using the same code as in the original dhSegment repository. One needs to adjust three parameters to the predict function: 1) embeddings_path (the path to the embeddings list), 2) embeddings_map_path(the path to the compressed embedding map), and 3) embeddings_dim (the size of the embeddings).

    Please refer to the paper for further information or contact us.

    3. CODE:

    https://github.com/dhlab-epfl/dhSegment-text


    4. ACKNOWLEDGEMENTS
    We warmly thank the journal Le Temps (owner of La Gazette de Lausanne and the Journal de Genève) and the group ArcInfo (owner of L'Impartial) for accepting to share the related datasets for academic purposes. We also thank the National Library of Luxembourg for its support with all steps related to the Luxemburger Wort annotation release.
    This work was realized in the context of the impresso - Media Monitoring of the Past project and supported by the Swiss National Science Foundation under grant CR- SII5_173719.

    5. CONTACT
    Maud Ehrmann (EPFL-DHLAB)
    Simon Clematide (UZH)

  15. f

    Datasets related to algorithms performance.

    • plos.figshare.com
    xlsx
    Updated Feb 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lin Jun; Zhou Chenliang (2025). Datasets related to algorithms performance. [Dataset]. http://doi.org/10.1371/journal.pone.0315143.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Feb 14, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Lin Jun; Zhou Chenliang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The smart grid is on the basis of physical grid, introducing all kinds of advanced communications technology and form a new type of power grid. It can not only meet the demand of users and realize the optimal allocation of resources, but also improve the safety, economy and reliability of power supply, it has become a major trend in the future development of electric power industry. But on the other hand, the complex network architecture of smart grid and the application of various high-tech technologies have also greatly increased the probability of equipment failure and the difficulty of fault diagnosis, and timely discovery and diagnosis of problems in the operation of smart grid equipment has become a key measure to ensure the safety of power grid operation. From the current point of view, the existing smart grid equipment fault diagnosis technology has problems that the application program is more complex, and the fault diagnosis rate is generally not high, which greatly affects the efficiency of smart grid maintenance. Therefore, Based on this, this paper adopts the multimodal semantic model of deep learning and knowledge graph, and on the basis of the original target detection network YOLOv4 architecture, introduces knowledge graph to unify the characterization and storage of the input multimodal information, and innovatively combines the YOLOv4 target detection algorithm with the knowledge graph to establish a smart grid equipment fault diagnosis model. Experiments show that compared with the existing fault detection algorithms, the YOLOv4 algorithm constructed in this paper is more accurate, faster and easier to operate.

  16. S

    Data from: GEOSatDB: global civil earth observation satellite semantic...

    • scidb.cn
    • zenodo.org
    Updated Oct 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ming Lin; Meng Jin; Juanzi Li; Yuqi Bai (2023). GEOSatDB: global civil earth observation satellite semantic database [Dataset]. http://doi.org/10.57760/sciencedb.11805
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 7, 2023
    Dataset provided by
    Science Data Bank
    Authors
    Ming Lin; Meng Jin; Juanzi Li; Yuqi Bai
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    GEOSatDB is a semantic representation of Earth observation satellites and sensors that can be used to easily discover available Earth observation resources for specific research objectives.BackgroundThe widespread availability of coordinated and publicly accessible Earth observation (EO) data empowers decision-makers worldwide to comprehend global challenges and develop more effective policies. Space-based satellite remote sensing, which serves as the primary tool for EO, provides essential information about the Earth and its environment by measuring various geophysical variables. This contributes significantly to our understanding of the fundamental Earth system and the impact of human activities.Over the past few decades, many countries and organizations have markedly improved their regional and global EO capabilities by deploying a variety of advanced remote sensing satellites. The rapid growth of EO satellites and advances in on-board sensors have significantly enhanced remote sensing data quality by expanding spectral bands and increasing spatio-temporal resolutions. However, users face challenges in accessing available EO resources, which are often maintained independently by various nations, organizations, or companies. As a result, a substantial portion of archived EO satellite resources remains underutilized. Enhancing the discoverability of EO satellites and sensors can effectively utilize the vast amount of EO resources that continue to accumulate at a rapid pace, thereby better supporting data for global change research.MethodologyThis study introduces GEOSatDB, a comprehensive semantic database specifically tailored for civil Earth observation satellites. The foundation of the database is an ontology model conforming to standards set by the International Organization for Standardization (ISO) and the World Wide Web Consortium (W3C). This conformity enables data integration and promotes the reuse of accumulated knowledge. Our approach advocates a novel method for integrating Earth observation satellite information from diverse sources. It notably incorporates a structured prompt strategy utilizing a large language model to derive detailed sensor information from vast volumes of unstructured text.Dataset InformationThe GEOSatDB portal(https://www.geosatdb.cn) has been developed to provide an interactive interface that facilitates the efficient retrieval of information on Earth observation satellites and sensors.The downloadable files in RDF Turtle format are located in the data directory and contain a total of 132,681 statements:- GEOSatDB_ontology.ttl: Ontology modeling of concepts, relations, and properties.- satellite.ttl: 2,453 Earth observation satellites and their associated entities.- sensor.ttl: 1,035 Earth observation sensors and their associated entities.- sensor2satellite.ttl: relations between Earth observation satellites and sensors.GEOSatDB undergoes quarterly updates, involving the addition of new satellites and sensors, revisions based on expert feedback, and the implementation of additional enhancements.

  17. q

    Radar Segmentation (RadSeg) Dataset

    • researchdatafinder.qut.edu.au
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zi Huang (2025). Radar Segmentation (RadSeg) Dataset [Dataset]. https://researchdatafinder.qut.edu.au/individual/n62585
    Explore at:
    Dataset updated
    May 15, 2025
    Dataset provided by
    Queensland University of Technology (QUT)
    Authors
    Zi Huang
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    RadSeg is a synthetic radar dataset designed for building semantic segmentation models for radar activity recognition. Unlike existing radio classification datasets that only provide signal-wise annotations for short and isolated I/Q sequences, RadSeg provides sample-wise annotations for interleaved radar pulse activities that extend across a long time horizon. This makes RadSeg the first annotated public dataset of its kind for radar activity recognition.

    Further information about the RadSeg dataset is available in our paper:

    Z. Huang, A. Pemasiri, S. Denman, C. Fookes and T. Martin, Multi-Stage Learning for Radar Pulse Activity Segmentation, ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea, Republic of, 2024, pp. 7340-7344, doi: 10.1109/ICASSP48485.2024.10445810

  18. p

    MS-CXR: Making the Most of Text Semantics to Improve Biomedical...

    • physionet.org
    Updated Nov 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benedikt Boecking; Naoto Usuyama; Shruthi Bannur; Daniel Coelho de Castro; Anton Schwaighofer; Stephanie Hyland; Harshita Sharma; Maria Teodora Wetscherek; Tristan Naumann; Aditya Nori; Javier Alvarez Valle; Hoifung Poon; Ozan Oktay (2024). MS-CXR: Making the Most of Text Semantics to Improve Biomedical Vision-Language Processing [Dataset]. http://doi.org/10.13026/9g2z-jg61
    Explore at:
    Dataset updated
    Nov 15, 2024
    Authors
    Benedikt Boecking; Naoto Usuyama; Shruthi Bannur; Daniel Coelho de Castro; Anton Schwaighofer; Stephanie Hyland; Harshita Sharma; Maria Teodora Wetscherek; Tristan Naumann; Aditya Nori; Javier Alvarez Valle; Hoifung Poon; Ozan Oktay
    License

    https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts

    Description

    We release a new dataset, MS-CXR, with locally-aligned phrase grounding annotations by board-certified radiologists to facilitate the study of complex semantic modelling in biomedical vision–language processing. The MS-CXR dataset provides 1162 image–sentence pairs of bounding boxes and corresponding phrases, collected across eight different cardiopulmonary radiological findings, with an approximately equal number of pairs for each finding. This dataset complements the existing MIMIC-CXR v.2 dataset and comprises: 1. Reviewed and edited bounding boxes and phrases (1026 pairs of bounding box/sentence); and 2. Manual bounding box labels from scratch (136 pairs of bounding box/sentence).

    This large, well-balanced phrase grounding benchmark dataset contains carefully curated image regions annotated with descriptions of eight radiology findings, as verified by radiologists. Unlike existing chest X-ray benchmarks, this challenging phrase grounding task evaluates joint, local image-text reasoning while requiring real-world language understanding, e.g. to parse domain-specific location references, complex negations, and bias in reporting style. This data accompany work showing that principled textual semantic modelling can improve contrastive learning in self-supervised vision–language processing.

  19. f

    Data from: Time-series China urban land use mapping (2016–2022): An approach...

    • figshare.com
    zip
    Updated Dec 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiong Shuping (2024). Time-series China urban land use mapping (2016–2022): An approach for achieving spatial-consistency and semantic-transition rationality in temporal domain [Dataset]. http://doi.org/10.6084/m9.figshare.27610683.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 27, 2024
    Dataset provided by
    figshare
    Authors
    Xiong Shuping
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    If you want to use this data, please cite our article:Xiong, S., Zhang, X., Lei, Y., Tan, G., Wang, H., & Du, S. (2024). Time-series China urban land use mapping (2016–2022): An approach for achieving spatial-consistency and semantic-transition rationality in temporal domain. Remote Sensing of Environment, 312, 114344.The global urbanization trend is geographically manifested through city expansion and the renewal of internal urban structures and functions. Time-series urban land use (ULU) maps are vital for capturing dynamic land changes in the urbanization process, giving valuable insights into urban development and its environmental consequences. Recent studies have mapped ULU in some cities with a unified model, but ignored the regional differences among cities; and they generated ULU maps year by year, but ignored temporal correlations between years; thus, they could be weak in large-scale and long time-series ULU monitoring. Accordingly, we introduce an temporal-spatial-semantic collaborative (TSS) mapping framework to generating accurate ULU maps with considering regional differences and temporal correlations. Firstly, to support model training, a large-scale ULU sample dataset based on OpenStreetMap (OSM) and Sentinel-2 imagery is automatically constructed, providing a total number of 56,412 samples with a size of 512 × 512 which are divided into six sub-regions in China and used for training different classification models. Then, an urban land use mapping network (ULUNet) is proposed to recognize ULU. This model utilizes a primary and an auxiliary encoder to process noisy OSM samples and can enhance the model's robustness under noisy labels. Finally, taking the temporal correlations of ULU into consideration, the recognized ULU are optimized, whose boundaries are unified by a time-series co-segmentation, and whose categories are modified by a knowledge-data driven method. To verify the effectiveness of the proposed method, we consider all urban areas in China (254,566 km2), and produce a time-series China urban land use dataset (CULU) at a 10-m resolution, spanning from 2016 to 2022, with an overall accuracy of CULU is 82.42%. Through comparison, it can be found that CULU outperforms existing datasets such as EULUC-China and UFZ-31cities in data accuracies, spatial boundaries consistencies and land use transitions logicality. The results indicate that the proposed method and generated dataset can play important roles in land use change monitoring, ecological-environmental evolution analysis, and also sustainable city development.

  20. Z

    Data from: Intelligent Energy Systems Ontology: Local flexibility market and...

    • data.niaid.nih.gov
    • portalcienciaytecnologia.jcyl.es
    • +2more
    Updated Nov 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pinto, Tiago (2023). Intelligent Energy Systems Ontology: Local flexibility market and power system co-simulation demonstration [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5526902
    Explore at:
    Dataset updated
    Nov 15, 2023
    Dataset provided by
    Santos, Gabriel
    Pinto, Tiago
    Morais, Hugo
    Corchado, Juan M.
    Vale, Zita
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Intelligent Energy Systems Ontology (IESO) provides semantic interoperability within a society of multi-agent systems (MAS) developed in the scope of power and energy systems (PES). It leverages the knowledge from existing and publicly available semantic models developed for specific PES subdomains to accomplish a shared vocabulary among the agents of the MAS community, overcoming heterogeneity among the reused ontologies. IESO provides agents with semantic reasoning, constraints validation, and data uniformization. The use of IESO is demonstrated through the simulation of the management of a rural distribution network, considering the validation of the grid’s technical constraints. This dataset publishes files demonstrating: i) a snapshot of the initial semantic knowledge base (KB); ii) queries to the KB to get services inputs; iii) conversions between syntactic and semantic models; iv) constraints validations; v) automatic conversion of units of measure.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Terry Ruas (2020). Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles (Dataset, Models & Code) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3713182

Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles (Dataset, Models & Code)

Explore at:
Dataset updated
Aug 1, 2020
Dataset provided by
Moritz Schubotz
Bela Gipp
Malte Ostendorff
Georg Rehm
Terry Ruas
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Many digital libraries recommend literature to their users considering the similarity between a query document and their repository. However, they often fail to distinguish what is the relationship that makes two documents alike. In this paper, we model the problem of finding the relationship between two documents as a pairwise document classification task. To find the semantic relation between documents, we apply a series of techniques, such as GloVe, Paragraph-Vectors, BERT, and XLNet under different configurations (e.g., sequence length, vector concatenation scheme), including a Siamese architecture for the Transformer-based systems. We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations. Our results show vanilla BERT as the best performing system with an F1-score of 0.93, which we manually examine to better understand its applicability to other domains. Our findings suggest that classifying semantic relations between documents is a solvable task and motivates the development of recommender systems based on the evaluated techniques. The discussions in this paper serve as first steps in the exploration of documents through SPARQL-like queries such that one could find documents that are similar in one aspect but dissimilar in another.

Additional information can be found on GitHub.

The following data is supplemental to the experiments described in our research paper. The data consists of:

Datasets (articles, class labels, cross-validation splits)

Pretrained models (Transformers, GloVe, Doc2vec)

Model output (prediction) for the best performing models

Dataset

The Wikipedia article corpus is available in enwiki-20191101-pages-articles.weighted.10k.jsonl.bz2. The original data have been downloaded as XML dump, and the corresponding articles were extracted as plain-text with gensim.scripts.segment_wiki. The archive contains only articles that are available in training or test data.

The actual dataset is provided as used in the stratified k-fold with k=4 in train_testdata_4folds.tar.gz.

├── 1 │ ├── test.csv │ └── train.csv ├── 2 │ ├── test.csv │ └── train.csv ├── 3 │ ├── test.csv │ └── train.csv └── 4 ├── test.csv └── train.csv

4 directories, 8 files

Pretrained models

PyTorch: vanilla and Siamese BERT + XLNet

Pretrained model for each fold is available in the corresponding model archives:

Vanilla

model_wiki.bert_base_joint_seq512.tar.gz model_wiki.xlnet_base_joint_seq512.tar.gz

Siamese

model_wiki.bert_base_siamese_seq512_4d.tar.gz model_wiki.xlnet_base_siamese_seq512_4d.tar.gz

Search
Clear search
Close search
Google apps
Main menu