Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Many digital libraries recommend literature to their users considering the similarity between a query document and their repository. However, they often fail to distinguish what is the relationship that makes two documents alike. In this paper, we model the problem of finding the relationship between two documents as a pairwise document classification task. To find the semantic relation between documents, we apply a series of techniques, such as GloVe, Paragraph-Vectors, BERT, and XLNet under different configurations (e.g., sequence length, vector concatenation scheme), including a Siamese architecture for the Transformer-based systems. We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations. Our results show vanilla BERT as the best performing system with an F1-score of 0.93, which we manually examine to better understand its applicability to other domains. Our findings suggest that classifying semantic relations between documents is a solvable task and motivates the development of recommender systems based on the evaluated techniques. The discussions in this paper serve as first steps in the exploration of documents through SPARQL-like queries such that one could find documents that are similar in one aspect but dissimilar in another.
Additional information can be found on GitHub.
The following data is supplemental to the experiments described in our research paper. The data consists of:
Datasets (articles, class labels, cross-validation splits)
Pretrained models (Transformers, GloVe, Doc2vec)
Model output (prediction) for the best performing models
Dataset
The Wikipedia article corpus is available in enwiki-20191101-pages-articles.weighted.10k.jsonl.bz2. The original data have been downloaded as XML dump, and the corresponding articles were extracted as plain-text with gensim.scripts.segment_wiki. The archive contains only articles that are available in training or test data.
The actual dataset is provided as used in the stratified k-fold with k=4 in train_testdata_4folds.tar.gz.
├── 1 │ ├── test.csv │ └── train.csv ├── 2 │ ├── test.csv │ └── train.csv ├── 3 │ ├── test.csv │ └── train.csv └── 4 ├── test.csv └── train.csv
4 directories, 8 files
Pretrained models
PyTorch: vanilla and Siamese BERT + XLNet
Pretrained model for each fold is available in the corresponding model archives:
model_wiki.bert_base_joint_seq512.tar.gz model_wiki.xlnet_base_joint_seq512.tar.gz
model_wiki.bert_base_siamese_seq512_4d.tar.gz model_wiki.xlnet_base_siamese_seq512_4d.tar.gz
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Fyp Semantic is a dataset for semantic segmentation tasks - it contains Greenery VFcH annotations for 1,000 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Dataset Corrosion Seg Semantic is a dataset for semantic segmentation tasks - it contains Corrosion annotations for 978 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
Shared and internationally recognized benchmarks are fundamental for the development of any computational system. We aim to help the research community working on compositional distributional semantic models (CDSMs) by providing SICK (Sentences Involving Compositional Knowldedge), a large size English benchmark tailored for them. SICK consists of about 10,000 English sentence pairs that include many examples of the lexical, syntactic and semantic phenomena that CDSMs are expected to account for, but do not require dealing with other aspects of existing sentential data sets (idiomatic multiword expressions, named entities, telegraphic language) that are not within the scope of CDSMs. By means of crowdsourcing techniques, each pair was annotated for two crucial semantic tasks: relatedness in meaning (with a 5-point rating scale as gold score) and entailment relation between the two elements (with three possible gold labels: entailment, contradiction, and neutral). The SICK data set was used in SemEval-2014 Task 1, and it freely available for research purposes.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Background
Information reported by scientific literature still remains locked up in discrete documents that are not always interconnected or machine-readable. The Semantic Web together with approaches such as the Resource Description Framework (RDF) and the Linked Open Data (LOD) initiative offer a connectivity tissue that can be used to support the generation of self-describing, machine-readable documents.
Results
Biotea is an approach to generate RDF from scholarly documents. Our RDF model makes extensive use of existing ontologies and semantic enrichment services. Our dataset comprises 270,834 articles from PubMed Open Central in RDF/XML distributed in 404 zipped files. The RDFization process takes care of metadata, e.g., title, authors and journal, as well as semantic annotations on biological entities along the full text. Biological entities are extracted by using the NCBO Annotator and Whatizit.
We use the Bibliographic Ontology (BIBO), Dublin Core Metadata Initiative Terms (DCMI-terms), and the Provenance Ontology (PROV-O) to model the bibliographic metadata. Links to related pages such as PubMed HTML articles are provided via rdfs:seeAlso while links to other semantic representation such as Bio2RDF PubMed articles are provided via owl:sameAs.
The NCBO Annotator is used to extract entities covering ChEBI for chemicals; Pathway, and Functional Genomics Data Society (MGED) for genes and proteins; Master Drug Data Base (MDDB), NDDF, and NDFRT for drugs; SNOMED, SYMP, MedDRA, MeSH, MedlinePlus Health Topics (MedlinePlus), Online Mendelian Inheritance in Man (OMIM), FMA, ICD10, and Ontology for Biomedical Investigations (OBI) for diseases and medical terms; PO for plants; and MeSH, SNOMED, and NCIt for general terms.
Whatizit is used for GO, UniProt proteins, UniProt Taxonomy, and diseases mapped to the UMLS; UniProt taxa are also mapped to NCBI Taxon vocabulary.
Conclusions
Biotea delivers models and tools for metadata enrichment and semantic processing of biomedical documents. Our dataset makes it easier to access to the first bunch of RDFized articles following the Biotea model. Our future plans include updating our dataset on regular basis in order to incorporate the latest articles added to the PubMed Open Central collection, next delivery is planned for the first half of 2017. Following datasets will support a mapping to the Semanticscience Integrated Ontology (SIO) in order to accomplish to the guidelines set by Bio2RDF.
Notes
Biotea approach in full is available at http://jbiomedsem.biomedcentral.com/articles/10.1186/2041-1480-4-S1-S5 (Garcia Castro, L.J., C. McLaughlin, and A. Garcia, Biotea: RDFizing PubMed Central in Support for the Paper as an Interface to the Web of Data. Biomedical semantics, 2013. 4 Suppl 1: p. S5).
Biotea algorithms are publicly available at https://github.com/biotea
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The popularity of Deep Learning (DL) methods used in business process management research and practice is constantly increasing. One important factor that hinders the adoption of DL in certain areas is the availability of sufficiently large training datasets, particularly affecting domains where process models are mainly defined manually with a high knowledge-acquisition effort. In this paper, we examine process model augmentation in combination with semi-supervised transfer learning to enlarge existing datasets and train DL models effectively. The use case of similarity learning between manufacturing process models is discussed. Based on a literature study of existing augmentation techniques, a concept is presented with different categories of augmentation from knowledge-light approaches to knowledge-intensive ones, e. g. based on automated planning. Specifically, the impacts of augmentation approaches on the syntactic and semantic correctness of the augmented process models are considered. The concept also proposes a semi-supervised transfer learning approach to integrate augmented and non-augmented process model datasets in a two-phased training procedure. The experimental evaluation investigates augmented process model datasets regarding their quality for model training in the context of similarity learning between manufacturing process models. The results indicate a large potential with a reduction of the prediction error of up to 53%.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Urban Semantic is a dataset for object detection tasks - it contains Traffic Signs Streetlamp annotations for 7,107 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The accuracy of grape picking point localization is dependent on grape detection and semantic segmentation network performance. However, in practical application scenarios, the accuracy and segmentation precision of grape targets based on visible light images are susceptible to light variations and complex environments, often performing poorly. Moreover, grapes grow in bunches, and the existing multimodal datasets for apples and pears can hardly meet the recognition needs of bunch-shaped grapes. The construction of visible, depth, and near-infrared multimodal object detection and semantic segmentation datasets of grapes is crucial to exploring better recognition rates and stronger generalization capabilities for grape detection and semantic segmentation models. This dataset, totaling about 39.08 GB, contains high-quality multimodal video stream data of green and purple grapes, including six varieties, under different illumination and obscuration conditions. Additionally, the dataset offers 3954 labeled image samples extracted from the aforementioned multimodal video. By means of rotation, deflation, mis-slicing, panning, and Gaussian blur, the dataset can be augmented for the training implementation of mainstream deep learning models. The dataset can provide valuable basic data resources for multimodal fusion, grape semantic segmentation, and object detection, which have important practical application value for promoting research in the field of agricultural machinery and equipment intelligence.
https://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/
Dataset Card for "code_x_glue_cc_clone_detection_poj_104"
Dataset Summary
CodeXGLUE Clone-detection-POJ-104 dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-POJ-104 Given a code and a collection of candidates as the input, the task is to return Top K codes with the same semantic. Models are evaluated by MAP score. We use POJ-104 dataset on this task.
Supported Tasks and Leaderboards
document-retrieval: The… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_cc_clone_detection_poj104.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Semantic Drone Dataset focuses on semantic understanding of urban scenes for increasing the safety of autonomous drone flight and landing procedures. The imagery depicts more than 20 houses from nadir (bird's eye) view acquired at an altitude of 5 to 30 meters above the ground. A high-resolution camera was used to acquire images at a size of 6000x4000px (24Mpx). The training set contains 400 publicly available images and the test set is made up of 200 private images.
This dataset is taken from https://www.kaggle.com/awsaf49/semantic-drone-dataset. We remove and add files and information that we needed for our research purpose. We create our tiff files with a resolution of 1200x800 pixel in 24 channel with each channel represent classes that have been preprocessed from png files label. We reduce the resolution and compress the tif files with tiffile python library.
If you have any problem with tif dataset that we have been modified you can contact nunenuh@gmail.com and gaungalif@gmail.com.
This dataset was a copy from the original dataset (link below), we provide and add some improvement in the semantic data and classes. There are the availability of semantic data in png and tiff format with a smaller size as needed.
The images are labelled densely using polygons and contain the following 24 classes:
unlabeled paved-area dirt grass gravel water rocks pool vegetation roof wall window door fence fence-pole person dog car bicycle tree bald-tree ar-marker obstacle conflicting
> images
> labels/png
> labels/tiff
- class_to_idx.json
- classes.csv
- classes.json
- idx_to_class.json
aerial@icg.tugraz.at
If you use this dataset in your research, please cite the following URL: www.dronedataset.icg.tugraz.at
The Drone Dataset is made freely available to academic and non-academic entities for non-commercial purposes such as academic research, teaching, scientific publications, or personal experimentation. Permission is granted to use the data given that you agree:
That the dataset comes "AS IS", without express or implied warranty. Although every effort has been made to ensure accuracy, we (Graz University of Technology) do not accept any responsibility for errors or omissions. That you include a reference to the Semantic Drone Dataset in any work that makes use of the dataset. For research papers or other media link to the Semantic Drone Dataset webpage.
That you do not distribute this dataset or modified versions. It is permissible to distribute derivative works in as far as they are abstract representations of this dataset (such as models trained on it or additional annotations that do not directly include any of our data) and do not allow to recover the dataset or something similar in character. That you may not use the dataset or any derivative work for commercial purposes as, for example, licensing or selling the data, or using the data with a purpose to procure a commercial gain. That all rights not expressly granted to you are reserved by us (Graz University of Technology).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The S1S2-Water dataset is a global reference dataset for training, validation and testing of convolutional neural networks for semantic segmentation of surface water bodies in publicly available Sentinel-1 and Sentinel-2 satellite images. The dataset consists of 65 triplets of Sentinel-1 and Sentinel-2 images with quality checked binary water mask. Samples are drawn globally on the basis of the Sentinel-2 tile-grid (100 x 100 km) under consideration of pre-dominant landcover and availability of water bodies. Each sample is complemented with metadata and Digital Elevation Model (DEM) raster from the Copernicus DEM.
This work was supported by the German Federal Ministry of Education and Research (BMBF) through the project "Künstliche Intelligenz zur Analyse von Erdbeobachtungs- und Internetdaten zur Entscheidungsunterstützung im Katastrophenfall" (AIFER) under Grant 13N15525, and by the Helmholtz Artificial Intelligence Cooperation Unit through the project "AI for Near Real Time Satellite-based Flood Response" (AI4FLOOD) under Grant ZT-IPF-5-39.
Cityscapes data (dataset home page) contains labeled videos taken from vehicles driven in Germany. This version is a processed subsample created as part of the Pix2Pix paper. The dataset has still images from the original videos, and the semantic segmentation labels are shown in images alongside the original image. This is one of the best datasets around for semantic segmentation tasks.
This dataset has 2975 training images files and 500 validation image files. Each image file is 256x512 pixels, and each file is a composite with the original photo on the left half of the image, alongside the labeled image (output of semantic segmentation) on the right half.
This dataset is the same as what is available here from the Berkeley AI Research group.
The Cityscapes data available from cityscapes-dataset.com has the following license:
This dataset is made freely available to academic and non-academic entities for non-commercial purposes such as academic research, teaching, scientific publications, or personal experimentation. Permission is granted to use the data given that you agree:
Can you identify you identify what objects are where in these images from a vehicle.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset has been created for implementing a content-based recommender system in the context of the Open Research Knowledge Graph (ORKG). The recommender system accepts research paper's title and abstracts as input and recommends existing predicates in the ORKG semantically relevant to the given paper.
The paper instances in the dataset are grouped by ORKG comparisons and therefore the data.json file is more comprehensive than training_set.json and test_set.json.
data.json
The main JSON object consists of a list of comparisons. Each comparisons object has an ID, label, list of papers and list of predicates, whereas each paper object has ID, label, DOI, research field, research problems and abstract. Each predicate object has an ID and a label. See an example instance below.
{ "comparisons": [ { "id": "R108331", "label": "Analysis of approaches based on required elements in way of modeling", "papers": [ { "id": "R108312", "label": "Rapid knowledge work visualization for organizations", "doi": "10.1108/13673270710762747", "research_field": { "id": "R134", "label": "Computer and Systems Architecture" }, "research_problems": [ { "id": "R108294", "label": "Enterprise engineering" } ], "abstract": "Purpose \u2013 The purpose of this contribution is to motivate a new, rapid approach to modeling knowledge work in organizational settings and to introduce a software tool that demonstrates the viability of the envisioned concept.Design/methodology/approach \u2013 Based on existing modeling structures, the KnowFlow toolset that aids knowledge analysts in rapidly conducting interviews and in conducting multi\u2010perspective analysis of organizational knowledge work is introduced.Findings \u2013 This article demonstrates how rapid knowledge work visualization can be conducted largely without human modelers by developing an interview structure that allows for self\u2010service interviews. Two application scenarios illustrate the pressing need for and the potentials of rapid knowledge work visualizations in organizational settings.Research limitations/implications \u2013 The efforts necessary for traditional modeling approaches in the area of knowledge management are often prohibitive. This contribution argues that future research needs ..." }, .... ], "predicates": [ { "id": "P37126", "label": "activities, behaviours, means [for knowledge development and/or for knowledge conveyance and transformation" }, { "id": "P36081", "label": "approach name" }, .... ] }, .... ] }
training_set.json and test_set.json
The main JSON object consists of a list of training/test instances. Each instance has an instance_id with the format (comparison_id X paper_id) and a text. The text is a concatenation of the paper's label (title) and abstract. See an example instance below.
Note that test instances are not duplicated and do not occur in the training set. Training instances are also not duplicated, BUT training papers can be duplicated in a concatenation with different comparisons.
{ "instances": [ { "instance_id": "R108331xR108301", "comparison_id": "R108331", "paper_id": "R108301", "text": "A notation for Knowledge-Intensive Processes Business process modeling has become essential for managing organizational knowledge artifacts. However, this is not an easy task, especially when it comes to the so-called Knowledge-Intensive Processes (KIPs). A KIP comprises activities based on acquisition, sharing, storage, and (re)use of knowledge, as well as collaboration among participants, so that the amount of value added to the organization depends on process agents' knowledge. The previously developed Knowledge Intensive Process Ontology (KIPO) structures all the concepts (and relationships among them) to make a KIP explicit. Nevertheless, KIPO does not include a graphical notation, which is crucial for KIP stakeholders to reach a common understanding about it. This paper proposes the Knowledge Intensive Process Notation (KIPN), a notation for building knowledge-intensive processes graphical models." }, ... ] }
Dataset Statistics:
-
Papers
Predicates
Research Fields
Research Problems
Min/Comparison
2
2
1
0
Max/Comparison
202
112
5
23
Avg./Comparison
21,54
12,79
1,20
1,09
Total
4060
1816
46
178
Dataset Splits:
-
Papers
Comparisons
Training Set
2857
214
Test Set
1203
180
This record contains the datasets and models used and produced for the work reported in the paper "Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers" (link).
Please cite this paper if you are using the models/datasets or find it relevant to your research:
@article{barman_combining_2020,
title = {{Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers}},
author = {Raphaël Barman and Maud Ehrmann and Simon Clematide and Sofia Ares Oliveira and Frédéric Kaplan},
journal= {Journal of Data Mining \& Digital Humanities},
volume= {HistoInformatics}
DOI = {10.5281/zenodo.4065271},
year = {2021},
url = {https://jdmdh.episciences.org/7097},
}
Please note that this record contains data under different licenses.
1. DATA
2. MODELS
Some of the best models are released under a CC BY-SA 4.0 license (they are also available as assets of the current Github release).
Serial
, Weather
, Death notice
and Stocks
).Death notice
class.Death notice
class.Those models can be used to predict probabilities on new images using the same code as in the original dhSegment repository. One needs to adjust three parameters to the predict
function: 1) embeddings_path
(the path to the embeddings list), 2) embeddings_map_path
(the path to the compressed embedding map), and 3) embeddings_dim
(the size of the embeddings).
Please refer to the paper for further information or contact us.
3. CODE:
https://github.com/dhlab-epfl/dhSegment-text
4. ACKNOWLEDGEMENTS
We warmly thank the journal Le Temps (owner of La Gazette de Lausanne and the Journal de Genève) and the group ArcInfo (owner of L'Impartial) for accepting to share the related datasets for academic purposes. We also thank the National Library of Luxembourg for its support with all steps related to the Luxemburger Wort annotation release.
This work was realized in the context of the impresso - Media Monitoring of the Past project and supported by the Swiss National Science Foundation under grant CR- SII5_173719.
5. CONTACT
Maud Ehrmann (EPFL-DHLAB)
Simon Clematide (UZH)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The smart grid is on the basis of physical grid, introducing all kinds of advanced communications technology and form a new type of power grid. It can not only meet the demand of users and realize the optimal allocation of resources, but also improve the safety, economy and reliability of power supply, it has become a major trend in the future development of electric power industry. But on the other hand, the complex network architecture of smart grid and the application of various high-tech technologies have also greatly increased the probability of equipment failure and the difficulty of fault diagnosis, and timely discovery and diagnosis of problems in the operation of smart grid equipment has become a key measure to ensure the safety of power grid operation. From the current point of view, the existing smart grid equipment fault diagnosis technology has problems that the application program is more complex, and the fault diagnosis rate is generally not high, which greatly affects the efficiency of smart grid maintenance. Therefore, Based on this, this paper adopts the multimodal semantic model of deep learning and knowledge graph, and on the basis of the original target detection network YOLOv4 architecture, introduces knowledge graph to unify the characterization and storage of the input multimodal information, and innovatively combines the YOLOv4 target detection algorithm with the knowledge graph to establish a smart grid equipment fault diagnosis model. Experiments show that compared with the existing fault detection algorithms, the YOLOv4 algorithm constructed in this paper is more accurate, faster and easier to operate.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
GEOSatDB is a semantic representation of Earth observation satellites and sensors that can be used to easily discover available Earth observation resources for specific research objectives.BackgroundThe widespread availability of coordinated and publicly accessible Earth observation (EO) data empowers decision-makers worldwide to comprehend global challenges and develop more effective policies. Space-based satellite remote sensing, which serves as the primary tool for EO, provides essential information about the Earth and its environment by measuring various geophysical variables. This contributes significantly to our understanding of the fundamental Earth system and the impact of human activities.Over the past few decades, many countries and organizations have markedly improved their regional and global EO capabilities by deploying a variety of advanced remote sensing satellites. The rapid growth of EO satellites and advances in on-board sensors have significantly enhanced remote sensing data quality by expanding spectral bands and increasing spatio-temporal resolutions. However, users face challenges in accessing available EO resources, which are often maintained independently by various nations, organizations, or companies. As a result, a substantial portion of archived EO satellite resources remains underutilized. Enhancing the discoverability of EO satellites and sensors can effectively utilize the vast amount of EO resources that continue to accumulate at a rapid pace, thereby better supporting data for global change research.MethodologyThis study introduces GEOSatDB, a comprehensive semantic database specifically tailored for civil Earth observation satellites. The foundation of the database is an ontology model conforming to standards set by the International Organization for Standardization (ISO) and the World Wide Web Consortium (W3C). This conformity enables data integration and promotes the reuse of accumulated knowledge. Our approach advocates a novel method for integrating Earth observation satellite information from diverse sources. It notably incorporates a structured prompt strategy utilizing a large language model to derive detailed sensor information from vast volumes of unstructured text.Dataset InformationThe GEOSatDB portal(https://www.geosatdb.cn) has been developed to provide an interactive interface that facilitates the efficient retrieval of information on Earth observation satellites and sensors.The downloadable files in RDF Turtle format are located in the data directory and contain a total of 132,681 statements:- GEOSatDB_ontology.ttl: Ontology modeling of concepts, relations, and properties.- satellite.ttl: 2,453 Earth observation satellites and their associated entities.- sensor.ttl: 1,035 Earth observation sensors and their associated entities.- sensor2satellite.ttl: relations between Earth observation satellites and sensors.GEOSatDB undergoes quarterly updates, involving the addition of new satellites and sensors, revisions based on expert feedback, and the implementation of additional enhancements.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
RadSeg is a synthetic radar dataset designed for building semantic segmentation models for radar activity recognition. Unlike existing radio classification datasets that only provide signal-wise annotations for short and isolated I/Q sequences, RadSeg provides sample-wise annotations for interleaved radar pulse activities that extend across a long time horizon. This makes RadSeg the first annotated public dataset of its kind for radar activity recognition.
Further information about the RadSeg dataset is available in our paper:
Z. Huang, A. Pemasiri, S. Denman, C. Fookes and T. Martin, Multi-Stage Learning for Radar Pulse Activity Segmentation, ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea, Republic of, 2024, pp. 7340-7344, doi: 10.1109/ICASSP48485.2024.10445810
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
We release a new dataset, MS-CXR, with locally-aligned phrase grounding annotations by board-certified radiologists to facilitate the study of complex semantic modelling in biomedical vision–language processing. The MS-CXR dataset provides 1162 image–sentence pairs of bounding boxes and corresponding phrases, collected across eight different cardiopulmonary radiological findings, with an approximately equal number of pairs for each finding. This dataset complements the existing MIMIC-CXR v.2 dataset and comprises: 1. Reviewed and edited bounding boxes and phrases (1026 pairs of bounding box/sentence); and 2. Manual bounding box labels from scratch (136 pairs of bounding box/sentence).
This large, well-balanced phrase grounding benchmark dataset contains carefully curated image regions annotated with descriptions of eight radiology findings, as verified by radiologists. Unlike existing chest X-ray benchmarks, this challenging phrase grounding task evaluates joint, local image-text reasoning while requiring real-world language understanding, e.g. to parse domain-specific location references, complex negations, and bias in reporting style. This data accompany work showing that principled textual semantic modelling can improve contrastive learning in self-supervised vision–language processing.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
If you want to use this data, please cite our article:Xiong, S., Zhang, X., Lei, Y., Tan, G., Wang, H., & Du, S. (2024). Time-series China urban land use mapping (2016–2022): An approach for achieving spatial-consistency and semantic-transition rationality in temporal domain. Remote Sensing of Environment, 312, 114344.The global urbanization trend is geographically manifested through city expansion and the renewal of internal urban structures and functions. Time-series urban land use (ULU) maps are vital for capturing dynamic land changes in the urbanization process, giving valuable insights into urban development and its environmental consequences. Recent studies have mapped ULU in some cities with a unified model, but ignored the regional differences among cities; and they generated ULU maps year by year, but ignored temporal correlations between years; thus, they could be weak in large-scale and long time-series ULU monitoring. Accordingly, we introduce an temporal-spatial-semantic collaborative (TSS) mapping framework to generating accurate ULU maps with considering regional differences and temporal correlations. Firstly, to support model training, a large-scale ULU sample dataset based on OpenStreetMap (OSM) and Sentinel-2 imagery is automatically constructed, providing a total number of 56,412 samples with a size of 512 × 512 which are divided into six sub-regions in China and used for training different classification models. Then, an urban land use mapping network (ULUNet) is proposed to recognize ULU. This model utilizes a primary and an auxiliary encoder to process noisy OSM samples and can enhance the model's robustness under noisy labels. Finally, taking the temporal correlations of ULU into consideration, the recognized ULU are optimized, whose boundaries are unified by a time-series co-segmentation, and whose categories are modified by a knowledge-data driven method. To verify the effectiveness of the proposed method, we consider all urban areas in China (254,566 km2), and produce a time-series China urban land use dataset (CULU) at a 10-m resolution, spanning from 2016 to 2022, with an overall accuracy of CULU is 82.42%. Through comparison, it can be found that CULU outperforms existing datasets such as EULUC-China and UFZ-31cities in data accuracies, spatial boundaries consistencies and land use transitions logicality. The results indicate that the proposed method and generated dataset can play important roles in land use change monitoring, ecological-environmental evolution analysis, and also sustainable city development.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Intelligent Energy Systems Ontology (IESO) provides semantic interoperability within a society of multi-agent systems (MAS) developed in the scope of power and energy systems (PES). It leverages the knowledge from existing and publicly available semantic models developed for specific PES subdomains to accomplish a shared vocabulary among the agents of the MAS community, overcoming heterogeneity among the reused ontologies. IESO provides agents with semantic reasoning, constraints validation, and data uniformization. The use of IESO is demonstrated through the simulation of the management of a rural distribution network, considering the validation of the grid’s technical constraints. This dataset publishes files demonstrating: i) a snapshot of the initial semantic knowledge base (KB); ii) queries to the KB to get services inputs; iii) conversions between syntactic and semantic models; iv) constraints validations; v) automatic conversion of units of measure.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Many digital libraries recommend literature to their users considering the similarity between a query document and their repository. However, they often fail to distinguish what is the relationship that makes two documents alike. In this paper, we model the problem of finding the relationship between two documents as a pairwise document classification task. To find the semantic relation between documents, we apply a series of techniques, such as GloVe, Paragraph-Vectors, BERT, and XLNet under different configurations (e.g., sequence length, vector concatenation scheme), including a Siamese architecture for the Transformer-based systems. We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations. Our results show vanilla BERT as the best performing system with an F1-score of 0.93, which we manually examine to better understand its applicability to other domains. Our findings suggest that classifying semantic relations between documents is a solvable task and motivates the development of recommender systems based on the evaluated techniques. The discussions in this paper serve as first steps in the exploration of documents through SPARQL-like queries such that one could find documents that are similar in one aspect but dissimilar in another.
Additional information can be found on GitHub.
The following data is supplemental to the experiments described in our research paper. The data consists of:
Datasets (articles, class labels, cross-validation splits)
Pretrained models (Transformers, GloVe, Doc2vec)
Model output (prediction) for the best performing models
Dataset
The Wikipedia article corpus is available in enwiki-20191101-pages-articles.weighted.10k.jsonl.bz2. The original data have been downloaded as XML dump, and the corresponding articles were extracted as plain-text with gensim.scripts.segment_wiki. The archive contains only articles that are available in training or test data.
The actual dataset is provided as used in the stratified k-fold with k=4 in train_testdata_4folds.tar.gz.
├── 1 │ ├── test.csv │ └── train.csv ├── 2 │ ├── test.csv │ └── train.csv ├── 3 │ ├── test.csv │ └── train.csv └── 4 ├── test.csv └── train.csv
4 directories, 8 files
Pretrained models
PyTorch: vanilla and Siamese BERT + XLNet
Pretrained model for each fold is available in the corresponding model archives:
model_wiki.bert_base_joint_seq512.tar.gz model_wiki.xlnet_base_joint_seq512.tar.gz
model_wiki.bert_base_siamese_seq512_4d.tar.gz model_wiki.xlnet_base_siamese_seq512_4d.tar.gz