Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PHENOTYPE KNOWLEDGE TRANSLATOR (PHEKNOWLATOR)
2021 Continuous Evaluation of Relational Learning in Biomedicine (CERLIB)
OVERVIEW
INTRODUCTION
PheKnowLator (Phenotype Knowledge Translator), is a Python 3 library that constructs semantically-rich, large-scale biomedical knowledge graphs under different semantic models. PheKnowLator is also a data sharing hub, providing downloadable versions of prebuilt knowledge graphs. For this challenge, the PheKnowLator knowledge graphs have been designed to model mechanisms of human disease and were built using 12 open biomedical ontologies, 24 linked open datasets, and results from two large-scale, experimentally-derived datasets. For additional information see the associated GitHub website: https://github.com/callahantiff/PheKnowLator/wiki/v2.0.0. For a visual representation of the resources used (and their relationships) in the PheKnowLator knowledge graphs, click the link below.
KNOWLEDGE GRAPH BUILDS
PheKnowLator was designed to generate knowledge graphs under different semantic models and to provide users with complete flexibility throughout the construction process. At its core, PheKnowLator is built on a core set of Open Biomedical Ontologies (OBOs), which are extended with external data sources by utilizing different knowledge models. The software allows users the flexibility to customize the following parameters:
CHALLENGE DATA
With this information in mind, the Google Cloud Storage Bucket includes the data files listed below. Additional information for each file type can be found here: https://github.com/callahantiff/PheKnowLator/wiki/v2.0.0#knowledge-graph-output.
Knowledge Graph Data
Edge Lists
Metadata
CHALLENGE RELATIONS
We will evaluate predictions on 15 Relation Ontology (RO) relations utilized in 34 distinct edge types. Additional details on these edge types can be found here: https://github.com/callahantiff/PheKnowLator/wiki/v2.0.0#edge-data. The 15 RO relations and their associated edge types are shown in the table below.
BUILD UPDATES
Below we note important updates to each build. For additional information on each build please see the project Wiki (https://github.com/callahantiff/PheKnowLator/wiki/v2.0.0) and for more information on the data sources that are used for each build see: https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources.
JANUARY 2021
APRIL 2021
chemical-rna
edge typesMAY 2021
JUNE 2021
JULY 2021
AUGUST 2021
SEPTEMBER 2021
OCTOBER 2021
NOVEMBER 2021
Industrial research: Task No. 1 - Development of algorithms for extracting objects from data The task includes industrial works consisting in the development of algorithms for extracting objects from data. The basic assumption of the semantic web is to operate on objects that have specific attributes and relations between them. Assuming that the input data to the system usually have a weak structure (textual or structured documents with general attributes, e.g. title, creator, etc.), it is necessary to develop methods for extracting objects of basic types representing typical concepts from the real world, such as people, institutions, places, dates etc. Tasks of this type are performed by algorithms from the group of natural language processing and entity extraction. The main technological issue of this stage was to develop an algorithm that would extract entities from documents with a weak structure - in the extreme case, text documents - as efficiently as possible. For this purpose, it was necessary to process documents included in the shared internal representation and extract entities in a generalized way, regardless of the source form of the document. Detailed tasks (milestones): Development of an algorithm for pre-processing data and internal representation of a generalized document. As part of the task, methods of pre-processing documents from various sources and in various formats will be selected to a common form on which further algorithms will operate. As input - text documents (pdf, word etc), scans of printed documents (we do not include handwriting), web documents (HTML pages), other databases (relational tables), csv/xls files, XML files Development of an algorithm for extracting simple attributes from documents - Extraction of simple scalar attributes from processed documents, such as dates and numbers, taking into account the metadata existing in source systems and document templates for groups of documents with a similar structure. Development of an entity extraction algorithm from documents for basic classes of objects - entity extraction in unstructured text documents based on NLP techniques based on the developed language corpus for Polish and English with the possibility of development for other languages, taking into account the basic types of real-world objects (places, people , institutions, events, etc.) Industrial research: Task No. 2 - Development of algorithms for automatic ontology creation As part of task 2, it is planned to develop algorithms for automatic ontology creation. Reducing the impact of the human factor on data organization processes requires the development of algorithms that will significantly automate the process of classifying and organizing data imported to the system. It requires the use of advanced knowledge modeling techniques such as ontology extraction and thematic modeling. These algorithms are usually based on text statistics and the quality of their operation largely depends on the quality of the input data. This creates the risk that models created by algorithms may differ from expert models used by field experts. It is therefore necessary to take this risk into account in the architecture of the solution. Detailed tasks (milestones): Development of an algorithm for organizing objects in dictionaries and deduplication of entities in dictionaries - The purpose of the task is to develop an algorithm that organizes objects identified in previously developed algorithms in such a way as to prevent duplication of objects representing the same concepts and to enable the presentation of appropriate relationships between nodes of the semantic network. Development of an extraction algorithm for a domain ontological model - Requires the use of sophisticated methods of analyzing the accumulated corpus of documents in terms of identifying concepts and objects specific to the domain. The task will be carried out by a research unit experienced in the field of creating ontological models. Development of a semantic tagging algorithm - Requires the use of topic modeling methods. The task will be carried out by a research unit experienced in the field of creating ontological models. Development of a method of representing the semantic model in the database - The aim of the task is to develop a method of encoding information resulting from the operation of previous algorithms in such a way that it can be saved in a scalable manner in the appropriate database. Experimental development work: Task No. 3 - Prototype of the system The purpose of this task was to create an application prototype that would enable validation of the possibility of implementing the application on a real scale of applications (millions of documents) and functional usability for the end user. The problem faced by semantic modeling researchers is that they often work with theoretical models expressed in languages that are optimal for mathematical modeling but unscaled for production use. Therefore, it was necessary to develop an architecture that would enable scaling of the developed algorithms to process large data sets. Another aspect of semantic solutions is the problem of usability for end users. These solutions are based on advanced concepts, which forces a complex internal structure of the systems and complicated access to data. To ensure the usability of the project, it was necessary to develop a user interface that would offer the use of advanced data operations to the common user. Detailed tasks (milestones): Development of methods for obtaining data from various sources - the goal of the task is to develop an appropriate architecture and pipelines for processing data obtained from heterogeneous sources and formats in order to collect them in a coherent form in a central knowledge repository. It requires the use of an ETL/ESB type architecture based on a queuing system and distributed processing. Development of a large-scale data processing architecture by developed algorithms - the goal of the task is to develop an implementation architecture that would enable the implementation of the developed algorithms on a large scale, e.g. on the basis of distributed processing systems such as Apache Spark. Development of scalable data storage methods - the aim of the task is to select a data storage environment that enables effective representation of knowledge as a semantic network. The use of a graph database engine or a base that supports the RDF format will be required here. Development of an API enabling data mining - the aim of the task is to develop an API enabling the use of semantic knowledge accumulated in the system by various types of algorithms for further data processing, machine learning and artificial intelligence. A probable solution here may be to create an interface based on the SPARQL standard. Development of a prototype of a user interface for data mining - the aim of the task is to develop an ergonomic interface that allows domain users to explore and analyze the collected data. It is necessary to develop a method of generating an interface that automatically adapts to the type of data that is collected in the system, enabling data exploration by asking queries on the "Query By Example" basis, faceted/faceted search and traversing relationships between entities in the semantic model.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset was developed as part of the NANCY project (https://nancy-project.eu/) to support tasks in the computer vision area. It is specifically designed for sign language recognition, focusing on representing joints and finger positions. The dataset comprises images of hands that represent the alphabet in American Sign Language (ASL), with the exception of the letters "J" and "Z," as these involve motion and the dataset is limited to static images. A significant feature of the dataset is the use of color-coding, where each finger is associated with a distinct color. This approach enhances the ability to extract features and distinguish between different fingers, offering significant advantages over traditional grayscale datasets like MNIST. The dataset consists of RGB images, which enhance the recognition process and support more effective learning, achieving high performance even with a relatively modest amount of training data. This format improves the ability to discriminate and extract features compared to grayscale images. Although the use of RGB images introduces additional complexity, such as increased data representation and storage requirements, the advantages in accuracy and feature extraction make it a valuable choice. The dataset is well-suited for applications involving gesture recognition, sign language interpretation, and other tasks requiring detailed analysis of joint and finger positions. The NANCY project has received funding from the Smart Networks and Services Joint Undertaking (SNS JU) under the European Union's Horizon Europe research and innovation programme under Grant Agreement No 101096456.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Badania przemysłowe: Zadanie nr 1 - Opracowanie algorytmów ekstrakcji obiektów z danych
W ramach zadania przewidziano prace przemysłowe polegające na opracowaniu algorytmów ekstrakcji obiektów z danych. Podstawowym założeniem sieci semantycznej jest operowanie na obiektach, które posiadają konkretne atrybuty i relacjach między nimi. Zakładając, że dane wejściowe do systemu mają zazwyczaj słabą strukturę (dokumenty tekstowe lub strukturalne o atrybutach ogólnych np. tytuł, twórca itp.) konieczne jest opracowanie metod ekstrakcji obiektów podstawowych typów reprezentujących typowe pojęcia ze świata rzeczywistego, jak osoby, instytucje, miejsca, daty itd. Zadania tego typu są realizowane przez algorytmy z grupy przetwarzania języka naturalnego i ekstrakcji encji. Głównym zagadnieniem technologicznym tego etapu było wiec opracowanie algorytmu, który możliwie jak najskuteczniej będzie ekstrahował encje z dokumentów o słabej strukturze - w skrajnym przypadku dokumentów tekstowych. Konieczne było w tym celu przetworzenie dokumentów wchodzących do uwspólnionej wewnętrznej reprezentacji oraz ekstrahowanie encji w uogólniony sposób niezależnie od źródłowej postaci dokumentu.
Zadania szczegółowe (kamienie milowe):
Badania przemysłowe: Zadanie nr 2 - Opracowanie algorytmów automatycznego tworzenia ontologii
W ramach zadania 2 przewidziano opracowanie algorytmów automatycznego tworzenia ontologii. Ograniczenie wpływu czynnika ludzkiego na procesy organizacji danych wymaga opracowania algorytmów, które w znaczny sposób zautomatyzują proces klasyfikacji i organizacji danych importowanych do systemu. Wymagane jest tutaj zastosowanie zaawansowanych technik modelowania wiedzy jak ekstrakcja ontologii i modelowanie tematyczne. Algorytmy te bazują zazwyczaj na statystyce tekstu i jakość ich działania w dużej mierze zależy od jakości danych wejściowych. Rodzi to ryzyko tego rodzaju, że modele tworzone przez algorytmy mogą się różnić od modeli eksperckich stosowanych przez ekspertów dziedzinowych. Konieczne jest więc uwzględnienie tego ryzyka w architekturze rozwiązania.
Zadania szczegółowe (kamienie milowe):
Eksperymatalne prace rozwojowe: Zadanie nr 3 - Wykonanie prototypu systemu
Celem tego zadania było wytworzenie prototypu aplikacji umożliwiającego walidację możliwości wdrożenia aplikacji w rzeczywistej skali zastosowań (miliony dokumentów) oraz użyteczności funkcjonalnej dla użytkownika końcowego. Problemem, który napotykają naukowcy zajmujący się tematem modelowania semantycznego jest to, że często pracują oni na modelach teoretycznych wyrażanych w językach optymalnych do modelowania matematycznego, ale nieskalowanych do zastosowań produkcyjnych. W związku z tym konieczne było opracowanie takiej architektury, która umożliwi skalowanie opracowanych algorytmów do realizacji przetwarzania dużych zbiorów danych. Innym aspektem rozwiązań semantycznych jest problem użyteczności dla użytkowników końcowych. Rozwiązania te bazują na zaawansowanych koncepcjach, co wymusza złożoną wewnętrzną strukturę systemów i skomplikowany dostęp do danych. Dla zapewnienia użyteczności projektu konieczne było opracowanie interfejsu użytkownika, który będzie oferował wykorzystanie zaawansowanych operacji na danych zwykłemu użytkownikowi.
Zadania szczegółowe (kamienie milowe):
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PHENOTYPE KNOWLEDGE TRANSLATOR (PHEKNOWLATOR)
2021 Continuous Evaluation of Relational Learning in Biomedicine (CERLIB)
OVERVIEW
INTRODUCTION
PheKnowLator (Phenotype Knowledge Translator), is a Python 3 library that constructs semantically-rich, large-scale biomedical knowledge graphs under different semantic models. PheKnowLator is also a data sharing hub, providing downloadable versions of prebuilt knowledge graphs. For this challenge, the PheKnowLator knowledge graphs have been designed to model mechanisms of human disease and were built using 12 open biomedical ontologies, 24 linked open datasets, and results from two large-scale, experimentally-derived datasets. For additional information see the associated GitHub website: https://github.com/callahantiff/PheKnowLator/wiki/v2.0.0. For a visual representation of the resources used (and their relationships) in the PheKnowLator knowledge graphs, click the link below.
KNOWLEDGE GRAPH BUILDS
PheKnowLator was designed to generate knowledge graphs under different semantic models and to provide users with complete flexibility throughout the construction process. At its core, PheKnowLator is built on a core set of Open Biomedical Ontologies (OBOs), which are extended with external data sources by utilizing different knowledge models. The software allows users the flexibility to customize the following parameters:
CHALLENGE DATA
With this information in mind, the Google Cloud Storage Bucket includes the data files listed below. Additional information for each file type can be found here: https://github.com/callahantiff/PheKnowLator/wiki/v2.0.0#knowledge-graph-output.
Knowledge Graph Data
Edge Lists
Metadata
CHALLENGE RELATIONS
We will evaluate predictions on 15 Relation Ontology (RO) relations utilized in 34 distinct edge types. Additional details on these edge types can be found here: https://github.com/callahantiff/PheKnowLator/wiki/v2.0.0#edge-data. The 15 RO relations and their associated edge types are shown in the table below.
BUILD UPDATES
Below we note important updates to each build. For additional information on each build please see the project Wiki (https://github.com/callahantiff/PheKnowLator/wiki/v2.0.0) and for more information on the data sources that are used for each build see: https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources.
JANUARY 2021
APRIL 2021
chemical-rna
edge typesMAY 2021
JUNE 2021
JULY 2021
AUGUST 2021
SEPTEMBER 2021
OCTOBER 2021
NOVEMBER 2021