33 datasets found
  1. S

    Two residential districts datasets from Kielce, Poland for building semantic...

    • scidb.cn
    Updated Sep 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agnieszka Łysak (2022). Two residential districts datasets from Kielce, Poland for building semantic segmentation task [Dataset]. http://doi.org/10.57760/sciencedb.02955
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 29, 2022
    Dataset provided by
    Science Data Bank
    Authors
    Agnieszka Łysak
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Area covered
    Poland, Kielce
    Description

    Today, deep neural networks are widely used in many computer vision problems, also for geographic information systems (GIS) data. This type of data is commonly used for urban analyzes and spatial planning. We used orthophotographic images of two residential districts from Kielce, Poland for research including urban sprawl automatic analysis with Transformer-based neural network application.Orthophotomaps were obtained from Kielce GIS portal. Then, the map was manually masked into building and building surroundings classes. Finally, the ortophotomap and corresponding classification mask were simultaneously divided into small tiles. This approach is common in image data preprocessing for machine learning algorithms learning phase. Data contains two original orthophotomaps from Wietrznia and Pod Telegrafem residential districts with corresponding masks and also their tiled version, ready to provide as a training data for machine learning models.Transformed-based neural network has undergone a training process on the Wietrznia dataset, targeted for semantic segmentation of the tiles into buildings and surroundings classes. After that, inference of the models was used to test model's generalization ability on the Pod Telegrafem dataset. The efficiency of the model was satisfying, so it can be used in automatic semantic building segmentation. Then, the process of dividing the images can be reversed and complete classification mask retrieved. This mask can be used for area of the buildings calculations and urban sprawl monitoring, if the research would be repeated for GIS data from wider time horizon.Since the dataset was collected from Kielce GIS portal, as the part of the Polish Main Office of Geodesy and Cartography data resource, it may be used only for non-profit and non-commertial purposes, in private or scientific applications, under the law "Ustawa z dnia 4 lutego 1994 r. o prawie autorskim i prawach pokrewnych (Dz.U. z 2006 r. nr 90 poz 631 z późn. zm.)". There are no other legal or ethical considerations in reuse potential.Data information is presented below.wietrznia_2019.jpg - orthophotomap of Wietrznia districtmodel's - used for training, as an explanatory imagewietrznia_2019.png - classification mask of Wietrznia district - used for model's training, as a target imagewietrznia_2019_validation.jpg - one image from Wietrznia district - used for model's validation during training phasepod_telegrafem_2019.jpg - orthophotomap of Pod Telegrafem district - used for model's evaluation after training phasewietrznia_2019 - folder with wietrznia_2019.jpg (image) and wietrznia_2019.png (annotation) images, divided into 810 tiles (512 x 512 pixels each), tiles with no information were manually removed, so the training data would contain only informative tilestiles presented - used for the model during training (images and annotations for fitting the model to the data)wietrznia_2019_vaidation - folder with wietrznia_2019_validation.jpg image divided into 16 tiles (256 x 256 pixels each) - tiles were presented to the model during training (images for validation model's efficiency); it was not the part of the training datapod_telegrafem_2019 - folder with pod_telegrafem.jpg image divided into 196 tiles (256 x 265 pixels each) - tiles were presented to the model during inference (images for evaluation model's robustness)Dataset was created as described below.Firstly, the orthophotomaps were collected from Kielce Geoportal (https://gis.kielce.eu). Kielce Geoportal offers a .pst recent map from April 2019. It is an orthophotomap with a resolution of 5 x 5 pixels, constructed from a plane flight at 700 meters over ground height, taken with a camera for vertical photos. Downloading was done by WMS in open-source QGIS software (https://www.qgis.org), as a 1:500 scale map, then converted to a 1200 dpi PNG image.Secondly, the map from Wietrznia residential district was manually labelled, also in QGIS, in the same scope, as the orthophotomap. Annotation based on land cover map information was also obtained from Kielce Geoportal. There are two classes - residential building and surrounding. Second map, from Pod Telegrafem district was not annotated, since it was used in the testing phase and imitates situation, where there is no annotation for the new data presented to the model.Next, the images was converted to an RGB JPG images, and the annotation map was converted to 8-bit GRAY PNG image.Finally, Wietrznia data files were tiled to 512 x 512 pixels tiles, in Python PIL library. Tiles with no information or a relatively small amount of information (only white background or mostly white background) were manually removed. So, from the 29113 x 15938 pixels orthophotomap, only 810 tiles with corresponding annotations were left, ready to train the machine learning model for the semantic segmentation task. Pod Telegrafem orthophotomap was tiled with no manual removing, so from the 7168 x 7168 pixels ortophotomap were created 197 tiles with 256 x 256 pixels resolution. There was also image of one residential building, used for model's validation during training phase, it was not the part of the training data, but was a part of Wietrznia residential area. It was 2048 x 2048 pixel ortophotomap, tiled to 16 tiles 256 x 265 pixels each.

  2. f

    Semantic-based Process Mining and Analysis

    • figshare.com
    pdf
    Updated May 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kingsley Okoye (2020). Semantic-based Process Mining and Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.7387523.v2
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 30, 2020
    Dataset provided by
    figshare
    Authors
    Kingsley Okoye
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Semantic-based Process Mining and AnalysisWhat is it about?The work in this research proves that semantic-based process mining and analysis is a useful technique especially in solving some didactic issues and answering some questions with regards to ontology-based methods for automatic discovery of different patterns or behaviours within any given process domain.Why is it important?The work extracts streams of event logs from a any given process execution environment and then describe formats that allows for abstract mining and improved process analysis of the captured data sets and models. Technically, the method makes use of semantic annotations, or better still, process description languages to link elements within the events log and model (e.g. using the case study of the research learning process) with concepts that they represent in an ontology specifically designed for representing the process. The results show that a systemwhich is formally encoded with semantic labelling (annotation), semantic representation(ontology) and semantic reasoning (reasoner) has the capacity to lift the process miningand analysis from the syntactic to a more conceptual level.Dr Kingsley Okoye (Author) University of East London

  3. t

    Modeling Safety and Security Compliance in a Pilot Factory: A Cobot and...

    • researchdata.tuwien.at
    bin, png
    Updated May 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mukund Padmakarrao Bhole; Mukund Padmakarrao Bhole; Mukund Padmakarrao Bhole; Mukund Padmakarrao Bhole (2025). Modeling Safety and Security Compliance in a Pilot Factory: A Cobot and Milling Machine Use Case Using AutomationML and OWL [Dataset]. http://doi.org/10.48436/x3z0z-05k44
    Explore at:
    png, binAvailable download formats
    Dataset updated
    May 13, 2025
    Dataset provided by
    TU Wien
    Authors
    Mukund Padmakarrao Bhole; Mukund Padmakarrao Bhole; Mukund Padmakarrao Bhole; Mukund Padmakarrao Bhole
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset demonstrates the representation of AutomationML (an XML-based standard for exchanging engineering data in industrial automation) and OWL (Web Ontology Language for semantic modeling) to represent safety and security aspects in a smart factory setup.

    Applications of AutomationML

    1. System Integration and Monitoring: AutomationML helps connect OT systems with real-time monitoring tools like OPC UA, enabling continuous supervision of devices such as PLCs and sensors.
    2. Asset Risk Modeling: By integrating standards like IEC 62443, AML supports the modeling of security-focused assets and risk assessments.
    3. Network Security and Topology: AML can model network structures, define security zones and secure interconnections — useful for ICS environments.
    4. RoleClass Libraries and Semantics: External classification systems like eCl@ss and IEC 62443 can be used with AML to improve semantic context and classification of assets.
    5. Detailed Asset Modeling: AML is used to represent OT components such as sensors, actuators, controllers, and network devices, including their communication protocols and connections.

    Applications of OWL

    1. Ontology Visualization: Tools like Protégé allow visualization of relationships between system components like PLCs, sensors, and firewalls.
    2. Security Risk Assessment: OWL models can be queried using SPARQL or DL queries to detect vulnerabilities in industrial systems.
    3. Compliance Reporting: OWL ontologies integrated with reasoning engines allow automated generation of reports for standards such as IEC 62443.

    Use Case: TU Wien Pilot Factory

    We demonstrate the proposed representation of AutomationML and OWL modelling with a use case illustrated in Figure below, which shows the deployment of an automated smart pilot factory setup. This setup includes an ABB collaborative robotic arm and critical components, including the SINUMERIK PCU and NCU controllers, which manage the EMCO MAXXTURN 45 CNC milling machine. The network is secured through MGUARD routers, enterprise security gateways, and managed switches for handling data traffic. A remote maintenance server is enabled via secure connections, and remote communication is facilitated by an OPC UA server connected to multiple hosts. The robotic arm has appropriate tools and end-effectors in the CNC machine's workspace. The completed workpiece from the CNC machine is picked up by the robotic arm and placed in a nearby tray for further processing. This integrated approach enables real-time monitoring, predictive maintenance, and efficient handling of maintenance tasks, thereby optimizing production processes in the CNC machining environment. Additionally, it helps identify potential security vulnerabilities.

    Classes Modeled in the System

    • System Under Consideration: Defines what is being analyzed.
    • Group: Logical or organizational groupings.
    • Component: Hardware and software parts of the system.
    • Requirement: Safety and security rules and goals.
    • Stakeholders: People or groups with an interest in the system.
    • Parameter: Technical settings or values for system components.
    • Unit: Measurement units for parameters.
    • Connection: Relationships or data links between system parts.

    Safety and security compliance

    The standards used in this representation are for safety we use the IEC 61508- a international standard for functional safety concerning electrical, electronic, and programmable electronic safety-related systems. It outlines methods for designing, deploying, and maintaining such systems, particularly those with automatic protection functions. For security we use IEC 62443-3-3 which defines system security requirements and security capability levels to build an IACS that meets the target security level and evaluate your practice for each requirement.

    Related Publications

    1. M. Bhole, W. Kastner and T. Sauter, "From Manual to Semi-Automated Safety and Security Requirements Engineering: Ensuring Compliance in Industry 4.0," IECON 2024 - 50th Annual Conference of the IEEE Industrial Electronics Society, Chicago, IL, USA, 2024, pp. 1-8, doi: 10.1109/IECON55916.2024.10905636.
    2. M. Bhole, T. Sauter, S. Semper and W. Kastner, "Why to Fail Fast and Often: A Strategy for OT Safety and Security Evaluation," in IEEE Access, vol. 13, pp. 51793-51812, 2025, doi: 10.1109/ACCESS.2025.3553011.
  4. r

    Data from: Automatic Detection of Ditches and Natural Streams from Digital...

    • researchdata.se
    Updated Mar 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mariana dos Santos Toledo Busarello; William Lidberg; Anneli Ågren; Florian Westphal (2024). Automatic Detection of Ditches and Natural Streams from Digital Elevation Models Using Deep Learning [Dataset]. http://doi.org/10.5878/jrex-z325
    Explore at:
    (75), (10003963896), (119121), (90), (6577796), (77), (74), (143923371), (788389), (62817)Available download formats
    Dataset updated
    Mar 15, 2024
    Dataset provided by
    Swedish University of Agricultural Sciences
    Authors
    Mariana dos Santos Toledo Busarello; William Lidberg; Anneli Ågren; Florian Westphal
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Sweden
    Description

    This data contains the digital elevation models and polyline shapefiles with the location of channels from the 12 study areas used in this study. It also has the code to generate the datasets used to train the deep learning models to detect channels, ditches, and streams, and calculate the topographic indices. The code to train the models is also included, along with the models with the highest performance in 0.5 m resolution. The channels were mapped differently based on their type: ditches were manually digitized based on the visual analysis of some topographic indices and orthophotos obtained from the DEM. Streams were mapped by initially detecting all natural channel heads, then tracing the downstream channels, and finally manually editing them based on orthophotos.

  5. Automated MIAPPE Compliance Validation

    • figshare.com
    pdf
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bruno Costa; João Cardoso; Daniel Faria (2023). Automated MIAPPE Compliance Validation [Dataset]. http://doi.org/10.6084/m9.figshare.6392537.v3
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Bruno Costa; João Cardoso; Daniel Faria
    License

    https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html

    Description

    ObjectiveThis poster describes the preliminary results of a study of the use of semantic web technologies to tackle both the representation and validation of conformity of Plant Phenotyping Experiment datasets with the MIAPPE standard.Motivation Plant phenotyping research generates large datasets that are often not reusable due to the lack of standardization in their descriptors. The Minimal Information about a Plant Phenotyping Experiment (MIAPPE) is a Minimum Information (MI) standard for plant phenotyping, that has been introduced to mitigate this issue. To effect, MIAPPE provides a closed set of metadata descriptors which a plant phenotyping dataset must conform to, in order to be classified as MIAPPE-compliant.As researchers begin to use MIAPPE descriptors to annotate their datasets, it will be necessary to assert the compliance of these datasets with the MIAPPE standard in an automated manner, so that they can be inserted into MIAPPE repositories or feedback can be provided to enable researchers to adjust their data annotation to ensure acceptance by such repositories. Proposed Approach1. Analyse the current state of both MIAPPE and a published plant phenotyping dataset, and attempt to represent them both under a common data model using semantic web technologies. 2. Attempt to verify the conformation of the metadata descriptors of the plant phenotyping dataset with the MIAPPE specification. 3. Provide documentation of the results of the verification in the form of a report, detailing the extent of compliance of the plant phenotyping dataset with the MIAPPE specification.Current StatusOur proposed approach is represented in the BPMN business process attached. At present our effort has focused on the Dataset Representation part of our business process. So far, we’ve been able to successfully pre-process and structure a plant phenotyping dataset, and are in the process of creating the mappings to our selected domain ontology (Plant Phenotyping Experiment Ontology).Future IdeasUse similarity matching algorithms to aid in the mapping creation. Explore the possible usage of machine-learning techniques both in the mapping creation as well as the Dataset Validation

  6. e

    Autonomous Knowledge Extractor

    • data.europa.eu
    html, json
    Updated Mar 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Business Online Services Sp. z o.o. (2024). Autonomous Knowledge Extractor [Dataset]. https://data.europa.eu/data/datasets/https-dane-gov-pl-pl-dataset-3071-autonomiczny-ekstraktor-wiedzy/embed
    Explore at:
    json(147186), html(32526), json(1236), json(185425), json(254647), html(35659), html(50411), json(1287), json(1181)Available download formats
    Dataset updated
    Mar 19, 2024
    Dataset authored and provided by
    Business Online Services Sp. z o.o.
    Description

    Industrial research: Task No. 1 - Development of algorithms for extracting objects from data

    The task includes industrial works consisting in the development of algorithms for extracting objects from data. The basic assumption of the semantic web is to operate on objects that have specific attributes and relations between them. Assuming that the input data to the system usually have a weak structure (textual or structured documents with general attributes, e.g. title, creator, etc.), it is necessary to develop methods for extracting objects of basic types representing typical concepts from the real world, such as people, institutions, places, dates etc. Tasks of this type are performed by algorithms from the group of natural language processing and entity extraction. The main technological issue of this stage was to develop an algorithm that would extract entities from documents with a weak structure - in the extreme case, text documents - as efficiently as possible. For this purpose, it was necessary to process documents included in the shared internal representation and extract entities in a generalized way, regardless of the source form of the document.

    Detailed tasks (milestones):

    Development of an algorithm for pre-processing data and internal representation of a generalized document. As part of the task, methods of pre-processing documents from various sources and in various formats will be selected to a common form on which further algorithms will operate. As input - text documents (pdf, word etc), scans of printed documents (we do not include handwriting), web documents (HTML pages), other databases (relational tables), csv/xls files, XML files

    Development of an algorithm for extracting simple attributes from documents - Extraction of simple scalar attributes from processed documents, such as dates and numbers, taking into account the metadata existing in source systems and document templates for groups of documents with a similar structure.

    Development of an entity extraction algorithm from documents for basic classes of objects - entity extraction in unstructured text documents based on NLP techniques based on the developed language corpus for Polish and English with the possibility of development for other languages, taking into account the basic types of real-world objects (places, people , institutions, events, etc.)

    Industrial research: Task No. 2 - Development of algorithms for automatic ontology creation

    As part of task 2, it is planned to develop algorithms for automatic ontology creation. Reducing the impact of the human factor on data organization processes requires the development of algorithms that will significantly automate the process of classifying and organizing data imported to the system. It requires the use of advanced knowledge modeling techniques such as ontology extraction and thematic modeling. These algorithms are usually based on text statistics and the quality of their operation largely depends on the quality of the input data. This creates the risk that models created by algorithms may differ from expert models used by field experts. It is therefore necessary to take this risk into account in the architecture of the solution.

    Detailed tasks (milestones):

    Development of an algorithm for organizing objects in dictionaries and deduplication of entities in dictionaries - The purpose of the task is to develop an algorithm that organizes objects identified in previously developed algorithms in such a way as to prevent duplication of objects representing the same concepts and to enable the presentation of appropriate relationships between nodes of the semantic network.

    Development of an extraction algorithm for a domain ontological model - Requires the use of sophisticated methods of analyzing the accumulated corpus of documents in terms of identifying concepts and objects specific to the domain. The task will be carried out by a research unit experienced in the field of creating ontological models.

    Development of a semantic tagging algorithm - Requires the use of topic modeling methods. The task will be carried out by a research unit experienced in the field of creating ontological models.

    <p style="margin-left:0cm; margin-r

  7. c

    Data from: Datasets: Programmable content and a pattern-matching algorithm...

    • cord.cranfield.ac.uk
    • figshare.com
    zip
    Updated Jun 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iñigo Fernández del amo blanco; John ahmet Erkoyuncu; Maryam Farsi (2020). Datasets: Programmable content and a pattern-matching algorithm for automatic adaptive authoring in Augmented Reality for maintenance [Dataset]. http://doi.org/10.17862/cranfield.rd.12213380.v4
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 1, 2020
    Dataset provided by
    Cranfield Online Research Data (CORD)
    Authors
    Iñigo Fernández del amo blanco; John ahmet Erkoyuncu; Maryam Farsi
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This repository includes datasets on experimental cases of study and analysis regarding the research called "Programmable content and a pattern-matching algorithm for automatic adaptive authoring in Augmented Reality for maintenance".DOI:Abstract: "Augmented Reality (AR) can increase efficiency and safety of maintenance operations, but costs of augmented content creation (authoring) are hindering its industrial deployment. A relevant research gap involves the ability of authoring solutions to automatically generate content for multiple operations. Hence, this paper offers programmable content formats and a pattern-matching algorithm for automatic adaptive authoring of ontology -based maintenance data. The proposed solution is validated against common authoring tools for repair and remote diagnosis AR applications in terms of operational efficiency gains achieved with the content they produce. Experimental results show that content from all authoring solutions attain same time reductions (42%) in comparison with non-AR information delivery tools. Surveys results suggest alike perceived usability of all authoring solutions and better content adaptiveness and user’s performance tracking of this authoring proposal."

  8. Cityscapes Image Pairs

    • kaggle.com
    Updated Apr 20, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DanB (2018). Cityscapes Image Pairs [Dataset]. https://www.kaggle.com/datasets/dansbecker/cityscapes-image-pairs/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 20, 2018
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    DanB
    Description

    Context

    Cityscapes data (dataset home page) contains labeled videos taken from vehicles driven in Germany. This version is a processed subsample created as part of the Pix2Pix paper. The dataset has still images from the original videos, and the semantic segmentation labels are shown in images alongside the original image. This is one of the best datasets around for semantic segmentation tasks.

    Content

    This dataset has 2975 training images files and 500 validation image files. Each image file is 256x512 pixels, and each file is a composite with the original photo on the left half of the image, alongside the labeled image (output of semantic segmentation) on the right half.

    Acknowledgements

    This dataset is the same as what is available here from the Berkeley AI Research group.

    License

    The Cityscapes data available from cityscapes-dataset.com has the following license:

    This dataset is made freely available to academic and non-academic entities for non-commercial purposes such as academic research, teaching, scientific publications, or personal experimentation. Permission is granted to use the data given that you agree:

    • That the dataset comes "AS IS", without express or implied warranty. Although every effort has been made to ensure accuracy, we (Daimler AG, MPI Informatics, TU Darmstadt) do not accept any responsibility for errors or omissions.
    • That you include a reference to the Cityscapes Dataset in any work that makes use of the dataset. For research papers, cite our preferred publication as listed on our website; for other media cite our preferred publication as listed on our website or link to the Cityscapes website.
    • That you do not distribute this dataset or modified versions. It is permissible to distribute derivative works in as far as they are abstract representations of this dataset (such as models trained on it or additional annotations that do not directly include any of our data) and do not allow to recover the dataset or something similar in character.
    • That you may not use the dataset or any derivative work for commercial purposes as, for example, licensing or selling the data, or using the data with a purpose to procure a commercial gain.
    • That all rights not expressly granted to you are reserved by (Daimler AG, MPI Informatics, TU Darmstadt).

    Inspiration

    Can you identify you identify what objects are where in these images from a vehicle.

  9. Z

    OMOP2OBO Condition Occurrence Mappings

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Baumgartner, William A (2023). OMOP2OBO Condition Occurrence Mappings [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6774363
    Explore at:
    Dataset updated
    Mar 29, 2023
    Dataset provided by
    Kahn, Michael G
    Callahan, Tiffany J
    Vasilevsky, Nicole A
    Hunter, Lawrence D
    Martin, Blake
    Bennett, Tellen D
    Feinstein, James A
    Baumgartner, William A
    Wyrwa, Jordan M
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    OMOP2OBO Condition Occurrence Mappings V1.0

    These mappings were created by the OMOP2OBO mapping algorithm (see links below). OMOP2OBO - the first health system-wide, disease-agnostic mappings between standardized clinical terminologies and eight Open Biomedical Ontology (OBO) Foundry ontologies spanning diseases, phenotypes, anatomical entities, cell types, organisms, chemicals, vaccines, and proteins. These mappings are also the first to be explicitly created using standard terminologies in the Observational Medical Outcomes (OMOP) common data model (CDM), ensuring both semantic and clinical interoperability across a space of N conditions (and N relationships curated in these ontologies).

    The mappings in this repository were created between OMOP standard condition occurrence concepts (i.e., SNOMED CT) to the Human Phenotype Ontology (HPO) and the (Mondo). The National Library of Medicine's Unified Medical Language System (UMLS) Semantic Types are first used to filter out all concepts that did not have a biological origin (accidents, injuries, external complications, and findings without clear interpretations). Then, the Semantic Type was used to prioritize the mapping of HPO concepts to findings and symptoms and Mondo to Semantic Types indicative of disease. For these OMOP domains, owl:intersectionOf (“and”), and owl:unionOf (“or”) constructors were used to construct semantically expressive mappings.

    Mapping Details Mappings included in this set were generated automatically using OMOP2OBO or through the use of a Bag-of-words embedding model using TF-IDF. Cosine similarity is used to compute similarity scores between all pairwise combinations of OMOP and OBO concepts and ancestor concepts. To improve the efficiency of this process, the algorithm searches only the top 𝑛 most similar results and keeps the top 75th percentile among all pairs with scores >= 0.25. Manually created mappings are also included.

    Mapping Categories

    Automatic One-to-One Concept: Exact label or synonym, dbXRef, or expert validated mapping @ concept-level; 1:1

    Automatic One-to-One Ancestor: Exact label or synonym, dbXRef, or expert validated mapping @ concept ancestor-level; 1:1

    Automatic One-to-Many Concept: Exact label or synonym, dbXRef, cosine similarity, or expert validated mapping @ concept-level; 1:Many

    Automatic One-to-Many Ancestor: Exact label or synonym, dbXRef, cosine similarity, or expert validated mapping @ concept-level; 1:Many

    Manual One-to-One: Hand mapping created using expert suggested resources; 1:1

    Manual One-to-Many: Hand mapping created using expert suggested resources; 1:Many

    Cosine Similarity: score suggested mapping -- manually verified

    UnMapped: No suitable mapping or not mapped type

    Mapping Statistics Additional statistics have been provided for the mappings and are shown in the table below. This table presents the counts of OMOP concepts by mapping category and ontology:

        Mapping Category
        HPO
        Mondo
    
    
    
    
        Automatic One-to-One Concept
        4767
        9097
    
    
        Automatic One-to-Many Concept
        150
        885
    
    
        Cosine Similarity
        1375
        667
    
    
        Automatic One-to-One Ancestor
        13595
        8911
    
    
        Automatic One-to-Many Ancestor 
        38080
        40224
    
    
        Manual
        5131
        755
    
    
        Manual One-to-Many
        10326
        2835
    
    
        Unmapped
        36301
        46345
    

    Provenance and Versioning: The V1.0 deposited mappings were created by OMOP2OBO v1.0.0 on October 2022 using the OMOP Common Data Model V5.0 and OBO Foundry ontologies downloaded on September 14, 2020.

    Caveats: The deposited files only contain the mappings that were generated automatically by the algorithm. The manually generated mappings will be deposited with the official preprint manuscript. Please note that these are the original mappings that were created for the preprint. They have not been updated to current versions of the ontologies. In our experience, this should result in very few errors, but we do suggest that you check the ontology concepts used against current versions of each ontology before using them.

    Important Resources and Documentation

    GitHub: OMOP2OBO

    Project Wiki: OMOP2OBO - wiki

    Zenodo Community: OMOP2OBO

    Preprint Manuscript: 10.5281/zenodo.5716421

  10. 3D Scene Graph Dataset

    • redivis.com
    application/jsonl +7
    Updated Jun 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford Doerr School of Sustainability (2024). 3D Scene Graph Dataset [Dataset]. http://doi.org/10.57761/y61f-w926
    Explore at:
    stata, csv, avro, sas, application/jsonl, arrow, parquet, spssAvailable download formats
    Dataset updated
    Jun 28, 2024
    Dataset provided by
    Redivis Inc.
    Authors
    Stanford Doerr School of Sustainability
    Time period covered
    Jun 27, 2024 - Jun 28, 2024
    Description

    Abstract

    The 3D Scene Graph provides a variety of semantic data for models in the Gibson environment in the form of a scene graph. Semantic information is provided for buildings, rooms, and objects, and includes attributes (e.g., category, floors, dimensions, material, and texture) and relationships (e.g., spatial, parent-child, and comparative).

    Methodology

    The 3D Scene Graph provides semantic data for models in the

    Gibson environment that corresponds to the structure proposed in 3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera. The semantic information for models in the tiny Gibson split is verified via crowdsourcing and contains all 3D Scene Graph attributes. For these models we provide both the automated and verified outputs. For the rest of them, semantic information is the output of automated modules and does not include modalities that depend solely on manual input (e.g., object materials and textures). You can learn more about 3D Scene Graph and interact with the semantic data here: http://3dscenegraph.stanford.edu

    https://redivis.com/fileUploads/efdce2bf-5300-4c21-8d71-1dc2511fba8d%3E" alt="3dscenegraph.png">

    https://redivis.com/fileUploads/5e84eee8-f8c1-4816-af80-171ac1960f3a%3E" alt="Albertville.png">%3Cu%3E%3Cstrong%3EImportant Information%3C/strong%3E%3C/u%3E

    %3C!-- --%3E

  11. o

    Building Object and Outdoor Scene Segmentation (BOOSS) - Multi-channel (RGB...

    • explore.openaire.eu
    Updated Aug 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yu Hou; Rebekka Volk; Lucio Soibelman (2021). Building Object and Outdoor Scene Segmentation (BOOSS) - Multi-channel (RGB + Thermal) Aerial Imagery Datasets [Dataset]. http://doi.org/10.5281/zenodo.5241286
    Explore at:
    Dataset updated
    Aug 1, 2021
    Authors
    Yu Hou; Rebekka Volk; Lucio Soibelman
    Description

    {"references": ["Y. Hou, L. Soibelman, R. Volk, and M. Chen, Factors Affecting the Performance of 3D Thermal Mapping for Energy Audits in A District by Using Infrared Thermography (IRT) Mounted on Unmanned Aircraft Systems (UAS), doi: https://doi.org/10.22260/ISARC2019/0036.", "Y. Hou, R. Volk, M. Chen, and L. Soibelman, (2021). Fusing tie points' RGB and thermal information for mapping large areas based on aerial images: A study of fusion performance under different flight configurations and experimental conditions, Automation in Construction, vol. 124, doi: 10.1016/j.autcon.2021.103554."]} The dataset of Building Object and Outdoor Scene Segmentation (BOOSS) is based on multi-channel aerial imagery data. It covers - Ground Truth - RGB - Thermal The annotations in version 1.0 include roofs, facades, cars, roof equipment, and ground equipment Please cite as: Hou, Yu, Meida Chen, Rebekka Volk, and Lucio Soibelman. "An Approach to Semantically Segmenting Building Components and Outdoor Scenes Based on Multichannel Aerial Imagery Datasets." Remote Sensing 13, no. 21 (2021): 4357.

  12. f

    Data from: Semantic Annotation Automatic of Curriculum Lattes Using Linked...

    • scielo.figshare.com
    jpeg
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Walison Dias da Silva; Fernando Silva Parreiras; Luiz Cláudio Gomes Maia; Wladmir Cardoso Brandão (2023). Semantic Annotation Automatic of Curriculum Lattes Using Linked Open Data [Dataset]. http://doi.org/10.6084/m9.figshare.20006418.v1
    Explore at:
    jpegAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    SciELO journals
    Authors
    Walison Dias da Silva; Fernando Silva Parreiras; Luiz Cláudio Gomes Maia; Wladmir Cardoso Brandão
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract The Semantic Web has the purpose of optimizing document recovery, where these documents received synonyms, allowing people and machines to understand the meaning of one information. The semantic annotation entity is the path to promote the semantic in documents. This paper has an objective to build an outline with the Semantic Web concepts that allow to automatically annotate entities in the Lattes Curriculum based on Linked Open Data (LOD), which store terms and expressions’ meaning. The problem addressed in this research is based on what of the Semantic Web concepts can contribute to the Automatic Semantic Annotation Entities of the Lattes Curriculum using Linked Open Data. During the literature review the concepts, tools and technologies related to the theme were presented. The application of these concepts allowed the creation of the Semantic Web Lattes System. An empirical study was conducted with the objective of identifying an Extraction Tool Entity further Effective. The system allows importing the XML curricula in the Lattes Platform, annotates automatically the available data using the open databases and allows to run semantic queries.

  13. Lane Detection for Carla Driving Simulator

    • kaggle.com
    Updated Dec 3, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    thomasfermi (2020). Lane Detection for Carla Driving Simulator [Dataset]. https://www.kaggle.com/thomasfermi/lane-detection-for-carla-driving-simulator/activity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 3, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    thomasfermi
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Context

    This dataset was created for the online course Algorithms for Automated Driving. In that course the students are guided to implement software that * detects lane boundaries from an image using deep learning * controls steering wheel and throttle to keep the vehicle within the detected lane at the desired speed

    Content

    The data set consists of images that were generated with the Carla driving simulator. The training images are images captured by a dashcam that is installed in the simulated vehicle. The label images are segmentation masks. A label image classifies each pixel as * part of the left lane boundary * part of the right lane boundary * neither of those

    The challenge connected to this dataset is to train a model that can accurately predict the segmentation masks for the validation data set. The metric that I consider relevant is the dice score, which is nicely explained in this blog post about semantic segmentation by Jeremy Jordan.

    More details on how the dataset was generated can be found on the course website.

    Acknowledgements

    I owe thanks to the creators of the Carla driving simulator that enabled the generation of this dataset.

    Inspiration

    I would love to see notebooks from the kaggle community that perform better than my sample solution in terms of the dice score.

    For educational purposes, it would be great to see notebooks using different libraries to solve this challenge (I used the segmentation models pytorch library). I would also love to see solutions that are very easy to understand for students. Maybe someone can create a good and short solution with keras or fastai?

  14. R

    Comic Semantic Last Dataset

    • universe.roboflow.com
    zip
    Updated Mar 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    George Mason University (2024). Comic Semantic Last Dataset [Dataset]. https://universe.roboflow.com/george-mason-university-tk2yn/comic-semantic-last/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 31, 2024
    Dataset authored and provided by
    George Mason University
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Comic Semantic Bounding Boxes
    Description

    Here are a few use cases for this project:

    1. Comic Book Categorization: This model can be used by comic book creators, publishers, and platforms for automatic categorization and tagging of comic book content, which can significantly improve search functionality and user recommendations.

    2. Interactive Study Tool: Educators in the field of media and visual arts could use this model as a tool to help students study and understand the nuances of comic semantics, including differentiating characters, objects, and other elements.

    3. Animated Film Production: In the animation industry, this model can be utilized to help storyboard artists, animators, and directors identify and extract certain elements from existing comics for characters design, scene settings or plot inspiration.

    4. Comic Accessibility: For visually impaired individuals, this model can extract and describe comic semantic classes, providing an enhanced experience of enjoying comics through descriptive audio.

    5. AI-Powered Comic Creator: App developers can create a tool that uses the model to help amateur comic creators to recognize and improve their drawing of comic semantic classes, guiding them to produce professional-quality content.

  15. E

    CINTIL-PropBank

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Sep 12, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2012). CINTIL-PropBank [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-W0056/
    Explore at:
    Dataset updated
    Sep 12, 2012
    Dataset provided by
    ELRA (European Language Resources Association)
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    The CINTIL-PropBank is a corpus of sentences annotated with their constituency structure and semantic role tags, composed of 10,039 sentences and 110,166 tokens taken from different sources and domains: news (8,861 sentences; 101,430 tokens), and novels (399 sentences; 3,082 tokens). In addition, there are 779 sentences (5,654 tokens) used for regression testing of the computational grammar that supported the annotation of the corpus.For the creation of this PropBank we adopted a semi-automatic analysis with a double-blind annotation followed by adjudication. The resulting dataset contains three information levels: phrase constituency, grammatical functions, and phrase semantic roles. The main motivation behind the creation of this resource was to build a high quality data set with semantic information that could support the development of automatic semantic role labelers for Portuguese.

  16. Reference Building BMS Data BRICK Models

    • kaggle.com
    Updated Sep 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Clayton Miller (2021). Reference Building BMS Data BRICK Models [Dataset]. https://www.kaggle.com/claytonmiller/reference-building-bms-data-brick-models/tasks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 16, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Clayton Miller
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This data set comes from the BRICK Schema website and the description below is taken from the site. Authors of the BRICK models are outlined in the Metadata section.

    Context

    Brick is an open-source effort to standardize semantic descriptions of the physical, logical and virtual assets in buildings and the relationships between them. Brick consists of an extensible dictionary of terms and concepts in and around buildings, a set of relationships for linking and composing concepts together, and a flexible data model permitting seamless integration of Brick with existing tools and databases. Through the use of powerful Semantic Web technology, Brick can describe the broad set of idiosyncratic and custom features, assets and subsystems found across the building stock in a consistent matter.

    Adopting Brick as the canonical description of a building enables the following:

    • Brick lowers the cost of deploying analytics, energy efficiency measures and intelligent controls across buildings
    • Brick presents an integrated, cross-vendor representation of the multitude of subsystems in modern buildings: HVAC, lighting, fire, security and so on
    • Brick simplifies the development of smart analytics and control applications
    • Brick reduces the reliance upon the non-standard, unstructured labels endemic to building management systems
    • Brick is free and open-sourced under the BSD 3-Clause license. The source code for Brick, this website, and related tools developed by the Brick team are available on GitHub.

    Content

    These five models are representative examples of how Brick can be used to model real buildings. For an in-depth discussion of the creation and evaluation of these Brick models, please refer to the BuildSys 2016 and Applied Energy 2018 papers.

    BuildingLocationBMSBuiltSq FtPointsRelationshipsClassified
    Soda HallBerkeley, CABarrington Systems1994110,5651,5861,93998.7%
    Gates Hillman CenterPittsburgh, PA, USAAutomated Logic Controls2009217,0008,29235,69399%
    Rice HallCharlottesville, VA, USA2011100,0001,3002,15898.5%
    Engineering Building Unit 3BSan Diego, CA, USAJohnson Controls2004150,0004,5948,38396%
    Green Tech HouseVejle, DenmarkNiagara201438,00095619,08698.8%

    Points: the number of BMS points contained in the model Relationships: the number of relationships contained in the model Classified: the percentage of points classified with Brick

    Acknowledgements

    The authors of the following papers can be credited for creating these example BRICK files:

    Inspiration

    These files can be used to play around with BRICK format files and test out the tools in development for use on BRICK

  17. f

    Data from: Ontology lexicalization: Relationship between content and meaning...

    • scielo.figshare.com
    gif
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marcelo SCHIESSL; Marisa BRÄSCHER (2023). Ontology lexicalization: Relationship between content and meaning in the context of Information Retrieval [Dataset]. http://doi.org/10.6084/m9.figshare.5885659.v1
    Explore at:
    gifAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    SciELO journals
    Authors
    Marcelo SCHIESSL; Marisa BRÄSCHER
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract The proposal presented in this study seeks to properly represent natural language to ontologies and vice-versa. Therefore, the semi-automatic creation of a lexical database in Brazilian Portuguese containing morphological, syntactic, and semantic information that can be read by machines was proposed, allowing the link between structured and unstructured data and its integration into an information retrieval model to improve precision. The results obtained demonstrated that the methodology can be used in the risco financeiro (financial risk) domain in Portuguese for the construction of an ontology and the lexical-semantic database and the proposal of a semantic information retrieval model. In order to evaluate the performance of the proposed model, documents containing the main definitions of the financial risk domain were selected and indexed with and without semantic annotation. To enable the comparison between the approaches, two databases were created based on the texts with the semantic annotations to represent the semantic search. The first one represents the traditional search and the second contained the index built based on the texts with the semantic annotations to represent the semantic search. The evaluation of the proposal was based on recall and precision. The queries submitted to the model showed that the semantic search outperforms the traditional search and validates the methodology used. Although more complex, the procedure proposed can be used in all kinds of domains.

  18. Linked Open Data at cervantesvirtual.com

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    C. Carrasco; Marco-Such; Escobar; Candela; C. Carrasco; Marco-Such; Escobar; Candela (2020). Linked Open Data at cervantesvirtual.com [Dataset]. http://doi.org/10.5281/zenodo.998617
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    C. Carrasco; Marco-Such; Escobar; Candela; C. Carrasco; Marco-Such; Escobar; Candela
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The catalogue of the Biblioteca Virtual Miguel de Cervantes contains about 200,000 records which were originally created in compliance with the MARC21 standard. The entries in the catalogue have been recently migrated to a new relational database whose data model adheres to the conceptual models promoted by the International Federation of Library Associations and Institutions (IFLA), in particular, to the FRBR and FRAD specifications.

    The database content has been later mapped, by means of an automated procedure, to RDF triples which employ mainly the RDA vocabulary (Resource Description and Access) to describe the entities, as well as their properties and relationships. In contrast to a direct transformation, the intermediate relational model provides tighter control over the process for example through referential integrity, and therefore enhanced validation of the output. This RDF-based semantic description of the catalogue is now accessible online.

  19. Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL

    • zenodo.org
    bin, json, txt
    Updated Aug 16, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson (2021). Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL [Dataset]. http://doi.org/10.5281/zenodo.5205322
    Explore at:
    txt, json, binAvailable download formats
    Dataset updated
    Aug 16, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This folder contains the Spider-Realistic dataset used for evaluation in the paper "Structure-Grounded Pretraining for Text-to-SQL". The dataset is created based on the dev split of the Spider dataset (2020-06-07 version from https://yale-lily.github.io/spider). We manually modified the original questions to remove the explicit mention of column names while keeping the SQL queries unchanged to better evaluate the model's capability in aligning the NL utterance and the DB schema. For more details, please check our paper at https://arxiv.org/abs/2010.12773.

    It contains the following files:

    - spider-realistic.json
    # The spider-realistic evaluation set
    # Examples: 508
    # Databases: 19
    - dev.json
    # The original dev split of Spider
    # Examples: 1034
    # Databases: 20
    - tables.json
    # The original DB schemas from Spider
    # Databases: 166
    - README.txt
    - license

    The Spider-Realistic dataset is created based on the dev split of the Spider dataset realsed by Yu, Tao, et al. "Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task." It is a subset of the original dataset with explicit mention of the column names removed. The sql queries and databases are kept unchanged.
    For the format of each json file, please refer to the github page of Spider https://github.com/taoyds/spider.
    For the database files please refer to the official Spider release https://yale-lily.github.io/spider.

    This dataset is distributed under the CC BY-SA 4.0 license.

    If you use the dataset, please cite the following papers including the original Spider datasets, Finegan-Dollak et al., 2018 and the original datasets for Restaurants, GeoQuery, Scholar, Academic, IMDB, and Yelp.

    @article{deng2020structure,
    title={Structure-Grounded Pretraining for Text-to-SQL},
    author={Deng, Xiang and Awadallah, Ahmed Hassan and Meek, Christopher and Polozov, Oleksandr and Sun, Huan and Richardson, Matthew},
    journal={arXiv preprint arXiv:2010.12773},
    year={2020}
    }

    @inproceedings{Yu&al.18c,
    year = 2018,
    title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
    booktitle = {EMNLP},
    author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev }
    }

    @InProceedings{P18-1033,
    author = "Finegan-Dollak, Catherine
    and Kummerfeld, Jonathan K.
    and Zhang, Li
    and Ramanathan, Karthik
    and Sadasivam, Sesh
    and Zhang, Rui
    and Radev, Dragomir",
    title = "Improving Text-to-SQL Evaluation Methodology",
    booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    year = "2018",
    publisher = "Association for Computational Linguistics",
    pages = "351--360",
    location = "Melbourne, Australia",
    url = "http://aclweb.org/anthology/P18-1033"
    }

    @InProceedings{data-sql-imdb-yelp,
    dataset = {IMDB and Yelp},
    author = {Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig},
    title = {SQLizer: Query Synthesis from Natural Language},
    booktitle = {International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM},
    month = {October},
    year = {2017},
    pages = {63:1--63:26},
    url = {http://doi.org/10.1145/3133887},
    }

    @article{data-academic,
    dataset = {Academic},
    author = {Fei Li and H. V. Jagadish},
    title = {Constructing an Interactive Natural Language Interface for Relational Databases},
    journal = {Proceedings of the VLDB Endowment},
    volume = {8},
    number = {1},
    month = {September},
    year = {2014},
    pages = {73--84},
    url = {http://dx.doi.org/10.14778/2735461.2735468},
    }

    @InProceedings{data-atis-geography-scholar,
    dataset = {Scholar, and Updated ATIS and Geography},
    author = {Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer},
    title = {Learning a Neural Semantic Parser from User Feedback},
    booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
    year = {2017},
    pages = {963--973},
    location = {Vancouver, Canada},
    url = {http://www.aclweb.org/anthology/P17-1089},
    }

    @inproceedings{data-geography-original
    dataset = {Geography, original},
    author = {John M. Zelle and Raymond J. Mooney},
    title = {Learning to Parse Database Queries Using Inductive Logic Programming},
    booktitle = {Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2},
    year = {1996},
    pages = {1050--1055},
    location = {Portland, Oregon},
    url = {http://dl.acm.org/citation.cfm?id=1864519.1864543},
    }

    @inproceedings{data-restaurants-logic,
    author = {Lappoon R. Tang and Raymond J. Mooney},
    title = {Automated Construction of Database Interfaces: Intergrating Statistical and Relational Learning for Semantic Parsing},
    booktitle = {2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora},
    year = {2000},
    pages = {133--141},
    location = {Hong Kong, China},
    url = {http://www.aclweb.org/anthology/W00-1317},
    }

    @inproceedings{data-restaurants-original,
    author = {Ana-Maria Popescu, Oren Etzioni, and Henry Kautz},
    title = {Towards a Theory of Natural Language Interfaces to Databases},
    booktitle = {Proceedings of the 8th International Conference on Intelligent User Interfaces},
    year = {2003},
    location = {Miami, Florida, USA},
    pages = {149--157},
    url = {http://doi.acm.org/10.1145/604045.604070},
    }

    @inproceedings{data-restaurants,
    author = {Alessandra Giordani and Alessandro Moschitti},
    title = {Automatic Generation and Reranking of SQL-derived Answers to NL Questions},
    booktitle = {Proceedings of the Second International Conference on Trustworthy Eternal Systems via Evolving Software, Data and Knowledge},
    year = {2012},
    location = {Montpellier, France},
    pages = {59--76},
    url = {https://doi.org/10.1007/978-3-642-45260-4_5},
    }

  20. f

    Point Cloud Dataset of Reinforced Concrete Bridges Captured with Matterport...

    • figshare.com
    bin
    Updated Jul 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pang-jo Chun; Chao Lin; Tatsuro Yamane; Shiori Kubo; Yu Chen (2025). Point Cloud Dataset of Reinforced Concrete Bridges Captured with Matterport Pro3 Scanner in Japan [Dataset]. http://doi.org/10.6084/m9.figshare.28091453.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Jul 11, 2025
    Dataset provided by
    figshare
    Authors
    Pang-jo Chun; Chao Lin; Tatsuro Yamane; Shiori Kubo; Yu Chen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Japan
    Description

    This dataset contains high-resolution point cloud data of nine reinforced concrete bridges located in rural areas of Japan, collected in December 2023 using the Matterport Pro3 terrestrial laser scanner. The scanner features a 360° horizontal field of view (FOV) and a 295° vertical FOV, operating with a 904nm wavelength laser beam. It achieves a measurement accuracy of ±20 mm at a distance of 10 m and captures up to 100,000 points per second.Key characteristics of the dataset:Data Format: LASCoordinate System: Local, without georeferencingResolution: Coordinate scale value of 1 mmThis dataset was created to support research on automated dimension estimation of bridge components using semantic segmentation and geometric analysis. It can be utilized by researchers and practitioners in structural engineering, computer vision, and infrastructure management for tasks such as semantic segmentation, structural analysis, and digital twin development.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Agnieszka Łysak (2022). Two residential districts datasets from Kielce, Poland for building semantic segmentation task [Dataset]. http://doi.org/10.57760/sciencedb.02955

Two residential districts datasets from Kielce, Poland for building semantic segmentation task

Explore at:
262 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 29, 2022
Dataset provided by
Science Data Bank
Authors
Agnieszka Łysak
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Area covered
Poland, Kielce
Description

Today, deep neural networks are widely used in many computer vision problems, also for geographic information systems (GIS) data. This type of data is commonly used for urban analyzes and spatial planning. We used orthophotographic images of two residential districts from Kielce, Poland for research including urban sprawl automatic analysis with Transformer-based neural network application.Orthophotomaps were obtained from Kielce GIS portal. Then, the map was manually masked into building and building surroundings classes. Finally, the ortophotomap and corresponding classification mask were simultaneously divided into small tiles. This approach is common in image data preprocessing for machine learning algorithms learning phase. Data contains two original orthophotomaps from Wietrznia and Pod Telegrafem residential districts with corresponding masks and also their tiled version, ready to provide as a training data for machine learning models.Transformed-based neural network has undergone a training process on the Wietrznia dataset, targeted for semantic segmentation of the tiles into buildings and surroundings classes. After that, inference of the models was used to test model's generalization ability on the Pod Telegrafem dataset. The efficiency of the model was satisfying, so it can be used in automatic semantic building segmentation. Then, the process of dividing the images can be reversed and complete classification mask retrieved. This mask can be used for area of the buildings calculations and urban sprawl monitoring, if the research would be repeated for GIS data from wider time horizon.Since the dataset was collected from Kielce GIS portal, as the part of the Polish Main Office of Geodesy and Cartography data resource, it may be used only for non-profit and non-commertial purposes, in private or scientific applications, under the law "Ustawa z dnia 4 lutego 1994 r. o prawie autorskim i prawach pokrewnych (Dz.U. z 2006 r. nr 90 poz 631 z późn. zm.)". There are no other legal or ethical considerations in reuse potential.Data information is presented below.wietrznia_2019.jpg - orthophotomap of Wietrznia districtmodel's - used for training, as an explanatory imagewietrznia_2019.png - classification mask of Wietrznia district - used for model's training, as a target imagewietrznia_2019_validation.jpg - one image from Wietrznia district - used for model's validation during training phasepod_telegrafem_2019.jpg - orthophotomap of Pod Telegrafem district - used for model's evaluation after training phasewietrznia_2019 - folder with wietrznia_2019.jpg (image) and wietrznia_2019.png (annotation) images, divided into 810 tiles (512 x 512 pixels each), tiles with no information were manually removed, so the training data would contain only informative tilestiles presented - used for the model during training (images and annotations for fitting the model to the data)wietrznia_2019_vaidation - folder with wietrznia_2019_validation.jpg image divided into 16 tiles (256 x 256 pixels each) - tiles were presented to the model during training (images for validation model's efficiency); it was not the part of the training datapod_telegrafem_2019 - folder with pod_telegrafem.jpg image divided into 196 tiles (256 x 265 pixels each) - tiles were presented to the model during inference (images for evaluation model's robustness)Dataset was created as described below.Firstly, the orthophotomaps were collected from Kielce Geoportal (https://gis.kielce.eu). Kielce Geoportal offers a .pst recent map from April 2019. It is an orthophotomap with a resolution of 5 x 5 pixels, constructed from a plane flight at 700 meters over ground height, taken with a camera for vertical photos. Downloading was done by WMS in open-source QGIS software (https://www.qgis.org), as a 1:500 scale map, then converted to a 1200 dpi PNG image.Secondly, the map from Wietrznia residential district was manually labelled, also in QGIS, in the same scope, as the orthophotomap. Annotation based on land cover map information was also obtained from Kielce Geoportal. There are two classes - residential building and surrounding. Second map, from Pod Telegrafem district was not annotated, since it was used in the testing phase and imitates situation, where there is no annotation for the new data presented to the model.Next, the images was converted to an RGB JPG images, and the annotation map was converted to 8-bit GRAY PNG image.Finally, Wietrznia data files were tiled to 512 x 512 pixels tiles, in Python PIL library. Tiles with no information or a relatively small amount of information (only white background or mostly white background) were manually removed. So, from the 29113 x 15938 pixels orthophotomap, only 810 tiles with corresponding annotations were left, ready to train the machine learning model for the semantic segmentation task. Pod Telegrafem orthophotomap was tiled with no manual removing, so from the 7168 x 7168 pixels ortophotomap were created 197 tiles with 256 x 256 pixels resolution. There was also image of one residential building, used for model's validation during training phase, it was not the part of the training data, but was a part of Wietrznia residential area. It was 2048 x 2048 pixel ortophotomap, tiled to 16 tiles 256 x 265 pixels each.

Search
Clear search
Close search
Google apps
Main menu