Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
FooDrugs database is a development done by the Computational Biology Group at IMDEA Food Institute (Madrid, Spain), in the context of the Food Nutrition Security Cloud (FNS-Cloud) project. Food Nutrition Security Cloud (FNS-Cloud) has received funding from the European Union's Horizon 2020 Research and Innovation programme (H2020-EU.3.2.2.3. – A sustainable and competitive agri-food industry) under Grant Agreement No. 863059 – www.fns-cloud.eu (See more details about FNS-Cloud below)
FooDrugs stores information extracted from transcriptomics and text documents for foo-drug interactiosn and it is part of a demonstrator to be done in the FNS-Cloud project. The database was built using MySQL, an open source relational database management system. FooDrugs host information for a total of 161 transcriptomics GEO series with 585 conditions for food or bioactive compounds. Each condition is defined as a food/biocomponent per time point, per concentration, per cell line, primary culture or biopsy per study. FooDrugs includes information about a bipartite network with 510 nodes and their similarity scores (tau score; https://clue.io/connectopedia/connectivity_scores) related with possible drug interactions with drugs assayed in conectivity map (https://www.broadinstitute.org/connectivity-map-cmap). The information is stored in eight tables:
Table “study” : This table contains basic information about study identifiers from GEO, pubmed or platform, study type, title and abstract
Table “sample”: This table contains basic information about the different experiments in a study, like the identifier of the sample, treatment, origin type, time point or concentration.
Table “misc_study”: This table contains additional information about different attributes of the study.
Table “misc_sample”: This table contains additional information about different attributes of the sample.
Table “cmap”: This table contains information about 70895 nodes, compromising drugs, foods or bioactives, overexpressed and knockdown genes (see section 3.4). The information includes cell line, compound and perturbation type.
Table “cmap_foodrugs”: This table contains information about the tau score (see section 3.4) that relates food with drugs or genes and the node identifier in the FooDrugs network.
Table “topTable”: This table contains information about 150 over and underexpressed genes from each GEO study condition, used to calculate the tau score (see section 3.4). The information stored is the logarithmic fold change, average expression, t-statistic, p-value, adjusted p-value and if the gene is up or downregulated.
Table “nodes”: This table stores the information about the identification of the sample and the node in the bipartite network connecting the tables “sample”, “cmap_foodrugs” and “topTable”.
In addition, FooDrugs database stores a total of 6422 food/drug interactions from 2849 text documents, obtained from three different sources: 2312 documents from PubMed, 285 from DrugBank, and 252 from drugs.com. These documents describe potential interactions between 1464 food/bioactive compounds and 3009 drugs. The information is stored in two tables:
Table “texts”: This table contains all the documents with its identifiers where interactions have been identified with strategy described in section 4.
Table “TM_interactions”: This table contains information about interaction identifiers, the food and drug entities, and the start and the end positions of the context for the interaction in the document.
FNS-Cloud will overcome fragmentation problems by integrating existing FNS data, which is essential for high-end, pan-European FNS research, addressing FNS, diet, health, and consumer behaviours as well as on sustainable agriculture and the bio-economy. Current fragmented FNS resources not only result in knowledge gaps that inhibit public health and agricultural policy, and the food industry from developing effective solutions, making production sustainable and consumption healthier, but also do not enable exploitation of FNS knowledge for the benefit of European citizens. FNS-Cloud will, through three Demonstrators; Agri-Food, Nutrition & Lifestyle and NCDs & the Microbiome to facilitate: (1) Analyses of regional and country-specific differences in diet including nutrition, (epi)genetics, microbiota, consumer behaviours, culture and lifestyle and their effects on health (obesity, NCDs, ethnic and traditional foods), which are essential for public health and agri-food and health policies; (2) Improved understanding agricultural differences within Europe and what these means in terms of creating a sustainable, resilient food systems for healthy diets; and (3) Clear definitions of boundaries and how these affect the compositions of foods and consumer choices and, ultimately, personal and public health in the future. Long-term sustainability of the FNS-Cloud will be based on Services that have the capacity to link with new resources and enable cross-talk amongst them; access to FNS-Cloud data will be open access, underpinned by FAIR principles (findable, accessible, interoperable and re-useable). FNS-Cloud will work closely with the proposed Food, Nutrition and Health Research Infrastructure (FNHRI) as well as METROFOOD-RI and other existing ESFRI RIs (e.g. ELIXIR, ECRIN) in which several FNS-Cloud Beneficiaries are involved directly. (https://cordis.europa.eu/project/id/863059)
As of June 2024, the most popular database management system (DBMS) worldwide was Oracle, with a ranking score of 1244.08; MySQL and Microsoft SQL server rounded out the top three. Although the database management industry contains some of the largest companies in the tech industry, such as Microsoft, Oracle and IBM, a number of free and open-source DBMSs such as PostgreSQL and MariaDB remain competitive. Database Management Systems As the name implies, DBMSs provide a platform through which developers can organize, update, and control large databases. Given the business world’s growing focus on big data and data analytics, knowledge of SQL programming languages has become an important asset for software developers around the world, and database management skills are seen as highly desirable. In addition to providing developers with the tools needed to operate databases, DBMS are also integral to the way that consumers access information through applications, which further illustrates the importance of the software.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
This datasets have SQL injection attacks (SLQIA) as malicious Netflow data. The attacks carried out are SQL injection for Union Query and Blind SQL injection. To perform the attacks, the SQLMAP tool has been used.
NetFlow traffic has generated using DOROTHEA (DOcker-based fRamework fOr gaTHering nEtflow trAffic). NetFlow is a network protocol developed by Cisco for the collection and monitoring of network traffic flow data generated. A flow is defined as a unidirectional sequence of packets with some common properties that pass through a network device.
Datasets
The firts dataset was colleted to train the detection models (D1) and other collected using different attacks than those used in training to test the models and ensure their generalization (D2).
The datasets contain both benign and malicious traffic. All collected datasets are balanced.
The version of NetFlow used to build the datasets is 5.
Dataset
Aim
Samples
Benign-malicious
traffic ratio
D1
Training
400,003
50%
D2
Test
57,239
50%
Infrastructure and implementation
Two sets of flow data were collected with DOROTHEA. DOROTHEA is a Docker-based framework for NetFlow data collection. It allows you to build interconnected virtual networks to generate and collect flow data using the NetFlow protocol. In DOROTHEA, network traffic packets are sent to a NetFlow generator that has a sensor ipt_netflow installed. The sensor consists of a module for the Linux kernel using Iptables, which processes the packets and converts them to NetFlow flows.
DOROTHEA is configured to use Netflow V5 and export the flow after it is inactive for 15 seconds or after the flow is active for 1800 seconds (30 minutes)
Benign traffic generation nodes simulate network traffic generated by real users, performing tasks such as searching in web browsers, sending emails, or establishing Secure Shell (SSH) connections. Such tasks run as Python scripts. Users may customize them or even incorporate their own. The network traffic is managed by a gateway that performs two main tasks. On the one hand, it routes packets to the Internet. On the other hand, it sends it to a NetFlow data generation node (this process is carried out similarly to packets received from the Internet).
The malicious traffic collected (SQLI attacks) was performed using SQLMAP. SQLMAP is a penetration tool used to automate the process of detecting and exploiting SQL injection vulnerabilities.
The attacks were executed on 16 nodes and launch SQLMAP with the parameters of the following table.
Parameters
Description
'--banner','--current-user','--current-db','--hostname','--is-dba','--users','--passwords','--privileges','--roles','--dbs','--tables','--columns','--schema','--count','--dump','--comments', --schema'
Enumerate users, password hashes, privileges, roles, databases, tables and columns
--level=5
Increase the probability of a false positive identification
--risk=3
Increase the probability of extracting data
--random-agent
Select the User-Agent randomly
--batch
Never ask for user input, use the default behavior
--answers="follow=Y"
Predefined answers to yes
Every node executed SQLIA on 200 victim nodes. The victim nodes had deployed a web form vulnerable to Union-type injection attacks, which was connected to the MYSQL or SQLServer database engines (50% of the victim nodes deployed MySQL and the other 50% deployed SQLServer).
The web service was accessible from ports 443 and 80, which are the ports typically used to deploy web services. The IP address space was 182.168.1.1/24 for the benign and malicious traffic-generating nodes. For victim nodes, the address space was 126.52.30.0/24. The malicious traffic in the test sets was collected under different conditions. For D1, SQLIA was performed using Union attacks on the MySQL and SQLServer databases.
However, for D2, BlindSQL SQLIAs were performed against the web form connected to a PostgreSQL database. The IP address spaces of the networks were also different from those of D1. In D2, the IP address space was 152.148.48.1/24 for benign and malicious traffic generating nodes and 140.30.20.1/24 for victim nodes.
To run the MySQL server we ran MariaDB version 10.4.12. Microsoft SQL Server 2017 Express and PostgreSQL version 13 were used.
As of June 2024, the most popular relational database management system (RDBMS) worldwide was Oracle, with a ranking score of 1244.08. Oracle was also the most popular DBMS overall. MySQL and Microsoft SQL server rounded out the top three.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset comprises of two .csv format files used within workstream 2 of the Wellcome Trust funded ‘Orphan drugs: High prices, access to medicines and the transformation of biopharmaceutical innovation’ project (219875/Z/19/Z). They appear in various outputs, e.g. publications and presentations.
The deposited data were gathered using the University of Amsterdam Digital Methods Institute’s ‘Twitter Capture and Analysis Toolset’ (DMI-TCAT) before being processed and extracted from Gephi. DMI-TCAT queries Twitter’s STREAM Application Programming Interface (API) using SQL and retrieves data on a pre-set text query. It then sends the returned data for storage on a MySQL database. The tool allows for output of that data in various formats. This process aligns fully with Twitter’s service user terms and conditions. The query for the deposited dataset gathered a 1% random sample of all public tweets posted between 10-Feb-2021 and 10-Mar-2021 containing the text ‘Rare Diseases’ and/or ‘Rare Disease Day’, storing it on a local MySQL database managed by the University of Sheffield School of Sociological Studies (http://dmi-tcat.shef.ac.uk/analysis/index.php), accessible only via a valid VPN such as FortiClient and through a permitted active directory user profile. The dataset was output from the MySQL database raw as a .gexf format file, suitable for social network analysis (SNA). It was then opened using Gephi (0.9.2) data visualisation software and anonymised/pseudonymised in Gephi as per the ethical approval granted by the University of Sheffield School of Sociological Studies Research Ethics Committee on 02-Jun-201 (reference: 039187). The deposited dataset comprises of two anonymised/pseudonymised social network analysis .csv files extracted from Gephi, one containing node data (Issue-networks as excluded publics – Nodes.csv) and another containing edge data (Issue-networks as excluded publics – Edges.csv). Where participants explicitly provided consent, their original username has been provided. Where they have provided consent on the basis that they not be identifiable, their username has been replaced with an appropriate pseudonym. All other usernames have been anonymised with a randomly generated 16-digit key. The level of anonymity for each Twitter user is provided in column C of deposited file ‘Issue-networks as excluded publics – Nodes.csv’.
This dataset was created and deposited onto the University of Sheffield Online Research Data repository (ORDA) on 26-Aug-2021 by Dr. Matthew S. Hanchard, Research Associate at the University of Sheffield iHuman institute/School of Sociological Studies. ORDA has full permission to store this dataset and to make it open access for public re-use without restriction under a CC BY license, in line with the Wellcome Trust commitment to making all research data Open Access.
The University of Sheffield are the designated data controller for this dataset.
The Radiocarbon dating laboratory of IRPA/KIK was founded in the 1960s. Initially dates were reported at more or less regular intervals in the journal Radiocarbon (Schreurs 1968). Since the advent of radiocarbon dating in the 1950s it had been a common practice amongst radiocarbon laboratories to publish their dates in so-called ‘date-lists’ that were arranged per laboratory. This was first done in the Radiocarbon Supplement of the American Journal of Science and later in the specialised journal Radiocarbon. In the course of time the latter, with the added subtitle An International Journal of Cosmogenic Isotope Research, became a regular scientific journal shifting focus from date-lists to articles. Furthermore the world-wide exponential increase of radiocarbon dates made it almost impossible to publish them all in the same journal, even more so because of the broad range of applications that use radiocarbon analysis, ranging from archaeology and art history to geology and oceanography and recently also biomedical studies.The IRPA/KIK database
From 1995 onwards IRPA/KIK’s Radiocarbon laboratory started to publish its dates in small publications, continuing the numbering of the preceding lists in Radiocarbon. The first booklet in this series was “Royal Institute for Cultural Heritage Radiocarbon dates XV” (Van Strydonck et al. 1995), followed by three more volumes (XVI, XVII, XVIII). The next list (XIX, 2005) was no longer printed but instead handed out as a PDF file on CD-rom.
The ever increasing number of dates and the difficulties in handling all the data, however, made us look for a more permanent and easier solution. In order to improve data management and consulting, it was thus decided to gather all our dates in a web-based database. List XIX was in fact already a Microsoft Access database that was converted into a reader friendly style and could also be printed as a PDF file. However a Microsoft Access database is not the most practical solution to make information publicly available. Hence the structure of the database was recreated in Mysql and the existing content was transferred into the corresponding fields. To display the records, a web-based front-end was programmed in PHP/Apache. It features a full-text search function that allows for partial word-matching. In addition the records can be consulted in PDF format.
Old records from the printed date-lists as well as new records are now added using the same Microsoft Acces back-end, which is now connected directly to the Mysql database. The main problem with introducing the old data was that not all the current criteria were available in the past (e.g. stable isotope measurements). Furthermore since all the sample information is given by the submitter, its quality largely depends on the persons willingness to contribute as well as on the accuracy and correctness of the information he provides. Sometimes problems arrive from the fact that a certain investigation (like an excavation) is carried out over a relatively long period (sometimes even more than ten years) and is directed by different people or even institutions. This can lead to differences in the labeling procedure of the samples, but also in the interpretation of structures and artifacts and in the orthography of the site’s name. Finally the submitter might change address, while the names of institutions or even regions and countries might change as well (e.g.Zaire - Congo)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present three defect rediscovery datasets mined from Bugzilla. The datasets capture data for three groups of open source software projects: Apache, Eclipse, and KDE. The datasets contain information about approximately 914 thousands of defect reports over a period of 18 years (1999-2017) to capture the inter-relationships among duplicate defects.
File Descriptions
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Malaria is the leading cause of death in the African region. Data mining can help extract valuable knowledge from available data in the healthcare sector. This makes it possible to train models to predict patient health faster than in clinical trials. Implementations of various machine learning algorithms such as K-Nearest Neighbors, Bayes Theorem, Logistic Regression, Support Vector Machines, and Multinomial Naïve Bayes (MNB), etc., has been applied to malaria datasets in public hospitals, but there are still limitations in modeling using the Naive Bayes multinomial algorithm. This study applies the MNB model to explore the relationship between 15 relevant attributes of public hospitals data. The goal is to examine how the dependency between attributes affects the performance of the classifier. MNB creates transparent and reliable graphical representation between attributes with the ability to predict new situations. The model (MNB) has 97% accuracy. It is concluded that this model outperforms the GNB classifier which has 100% accuracy and the RF which also has 100% accuracy.
Methods
Prior to collection of data, the researcher was be guided by all ethical training certification on data collection, right to confidentiality and privacy reserved called Institutional Review Board (IRB). Data was be collected from the manual archive of the Hospitals purposively selected using stratified sampling technique, transform the data to electronic form and store in MYSQL database called malaria. Each patient file was extracted and review for signs and symptoms of malaria then check for laboratory confirmation result from diagnosis. The data was be divided into two tables: the first table was called data1 which contain data for use in phase 1 of the classification, while the second table data2 which contains data for use in phase 2 of the classification.
Data Source Collection
Malaria incidence data set is obtained from Public hospitals from 2017 to 2021. These are the data used for modeling and analysis. Also, putting in mind the geographical location and socio-economic factors inclusive which are available for patients inhabiting those areas. Naive Bayes (Multinomial) is the model used to analyze the collected data for malaria disease prediction and grading accordingly.
Data Preprocessing:
Data preprocessing shall be done to remove noise and outlier.
Transformation:
The data shall be transformed from analog to electronic record.
Data Partitioning
The data which shall be collected will be divided into two portions; one portion of the data shall be extracted as a training set, while the other portion will be used for testing. The training portion shall be taken from a table stored in a database and will be called data which is training set1, while the training portion taking from another table store in a database is shall be called data which is training set2.
The dataset was split into two parts: a sample containing 70% of the training data and 30% for the purpose of this research. Then, using MNB classification algorithms implemented in Python, the models were trained on the training sample. On the 30% remaining data, the resulting models were tested, and the results were compared with the other Machine Learning models using the standard metrics.
Classification and prediction:
Base on the nature of variable in the dataset, this study will use Naïve Bayes (Multinomial) classification techniques; Classification phase 1 and Classification phase 2. The operation of the framework is illustrated as follows:
i. Data collection and preprocessing shall be done.
ii. Preprocess data shall be stored in a training set 1 and training set 2. These datasets shall be used during classification.
iii. Test data set is shall be stored in database test data set.
iv. Part of the test data set must be compared for classification using classifier 1 and the remaining part must be classified with classifier 2 as follows:
Classifier phase 1: It classify into positive or negative classes. If the patient is having malaria, then the patient is classified as positive (P), while a patient is classified as negative (N) if the patient does not have malaria.
Classifier phase 2: It classify only data set that has been classified as positive by classifier 1, and then further classify them into complicated and uncomplicated class label. The classifier will also capture data on environmental factors, genetics, gender and age, cultural and socio-economic variables. The system will be designed such that the core parameters as a determining factor should supply their value.
THIS RESOURCE IS NO LONGER IN SERVICE, documented on July 17, 2013. A database that provides access to data from several gene expression profile analysis results of smokers and non-smokers. In the experiment, researchers first obtained brushings from intra-pulmonary airways (the right upper lobe carina) and scrapings from the buccal mucosa, from normal smoking and non-smoking volunteers. RNA was isolated from these samples and gene expression profiles from intra-pulmonary airway epithelial cells were analyzed using Affymetrix U133A human gene expression arrays. All microarray data from these experimentshave been stored, preprocessed and analyzed in a relational MySQL database that is accessible through this website.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Knowledge graph construction of heterogeneous data has seen a lot of uptake
in the last decade from compliance to performance optimizations with respect
to execution time. Besides execution time as a metric for comparing knowledge
graph construction, other metrics e.g. CPU or memory usage are not considered.
This challenge aims at benchmarking systems to find which RDF graph
construction system optimizes for metrics e.g. execution time, CPU,
memory usage, or a combination of these metrics.
Task description
The task is to reduce and report the execution time and computing resources
(CPU and memory usage) for the parameters listed in this challenge, compared
to the state-of-the-art of the existing tools and the baseline results provided
by this challenge. This challenge is not limited to execution times to create
the fastest pipeline, but also computing resources to achieve the most efficient
pipeline.
We provide a tool which can execute such pipelines end-to-end. This tool also
collects and aggregates the metrics such as execution time, CPU and memory
usage, necessary for this challenge as CSV files. Moreover, the information
about the hardware used during the execution of the pipeline is available as
well to allow fairly comparing different pipelines. Your pipeline should consist
of Docker images which can be executed on Linux to run the tool. The tool is
already tested with existing systems, relational databases e.g. MySQL and
PostgreSQL, and triplestores e.g. Apache Jena Fuseki and OpenLink Virtuoso
which can be combined in any configuration. It is strongly encouraged to use
this tool for participating in this challenge. If you prefer to use a different
tool or our tool imposes technical requirements you cannot solve, please contact
us directly.
The set of new specification for the RDF Mapping Language (RML) established by the W3C Community Group on Knowledge Graph Construction provide a set of test-cases for each module:
These test-cases are evaluated in this Track of the Challenge to determine their feasibility, correctness, etc. by applying them in implementations. This Track is in Beta status because these new specifications have not seen any implementation yet, thus it may contain bugs and issues. If you find problems with the mappings, output, etc. please report them to the corresponding repository of each module.
Note: validating the output of the RML Star module automatically through the provided tooling is currently not possible, see https://github.com/kg-construct/challenge-tool/issues/1.
Through this Track we aim to spark development of implementations for the new specifications and improve the test-cases. Let us know your problems with the test-cases and we will try to find a solution.
Part 1: Knowledge Graph Construction Parameters
These parameters are evaluated using synthetic generated data to have more
insights of their influence on the pipeline.
Data
Mappings
Part 2: GTFS-Madrid-Bench
The GTFS-Madrid-Bench provides insights in the pipeline with real data from the
public transport domain in Madrid.
Scaling
Heterogeneity
Example pipeline
The ground truth dataset and baseline results are generated in different steps
for each parameter:
The pipeline is executed 5 times from which the median execution time of each
step is calculated and reported. Each step with the median execution time is
then reported in the baseline results with all its measured metrics.
Knowledge graph construction timeout is set to 24 hours.
The execution is performed with the following tool: https://github.com/kg-construct/challenge-tool,
you can adapt the execution plans for this example pipeline to your own needs.
Each parameter has its own directory in the ground truth dataset with the
following files:
metadata.json
.Datasets
Knowledge Graph Construction Parameters
The dataset consists of:
Format
All input datasets are provided as CSV, depending on the parameter that is being
evaluated, the number of rows and columns may differ. The first row is always
the header of the CSV.
GTFS-Madrid-Bench
The dataset consists of:
Format
CSV datasets always have a header as their first row.
JSON and XML datasets have their own schema.
Evaluation criteria
Submissions must evaluate the following metrics:
Expected output
Duplicate values
Scale | Number of Triples |
---|---|
0 percent | 2000000 triples |
25 percent | 1500020 triples |
50 percent | 1000020 triples |
75 percent | 500020 triples |
100 percent | 20 triples |
Empty values
Scale | Number of Triples |
---|---|
0 percent | 2000000 triples |
25 percent | 1500000 triples |
50 percent | 1000000 triples |
75 percent | 500000 triples |
100 percent | 0 triples |
Mappings
Scale | Number of Triples |
---|---|
1TM + 15POM | 1500000 triples |
3TM + 5POM | 1500000 triples |
5TM + 3POM | 1500000 triples |
15TM + 1POM | 1500000 triples |
Properties
Scale | Number of Triples |
---|---|
1M rows 1 column | 1000000 triples |
1M rows 10 |
Approximately 82 percent of the surveyed software companies in Russia mentioned PostgreSQL, making it the most popular database management system (DBMS) in the period between February and May 2022. MS SQL and MySQL followed, having been mentioned by 47 percent and 41 percent of respondents, respectively.
Knowledge Graph Construction Workshop 2023: challenge Knowledge graph construction of heterogeneous data has seen a lot of uptake
in the last decade from compliance to performance optimizations with respect
to execution time. Besides execution time as a metric for comparing knowledge
graph construction, other metrics e.g. CPU or memory usage are not considered.
This challenge aims at benchmarking systems to find which RDF graph
construction system optimizes for metrics e.g. execution time, CPU,
memory usage, or a combination of these metrics. Task description The task is to reduce and report the execution time and computing resources
(CPU and memory usage) for the parameters listed in this challenge, compared
to the state-of-the-art of the existing tools and the baseline results provided
by this challenge. This challenge is not limited to execution times to create
the fastest pipeline, but also computing resources to achieve the most efficient
pipeline. We provide a tool which can execute such pipelines end-to-end. This tool also
collects and aggregates the metrics such as execution time, CPU and memory
usage, necessary for this challenge as CSV files. Moreover, the information
about the hardware used during the execution of the pipeline is available as
well to allow fairly comparing different pipelines. Your pipeline should consist
of Docker images which can be executed on Linux to run the tool. The tool is
already tested with existing systems, relational databases e.g. MySQL and
PostgreSQL, and triplestores e.g. Apache Jena Fuseki and OpenLink Virtuoso
which can be combined in any configuration. It is strongly encouraged to use
this tool for participating in this challenge. If you prefer to use a different
tool or our tool imposes technical requirements you cannot solve, please contact
us directly. Part 1: Knowledge Graph Construction Parameters These parameters are evaluated using synthetic generated data to have more
insights of their influence on the pipeline. Data Number of data records: scaling the data size vertically by the number of records with a fixed number of data properties (10K, 100K, 1M, 10M records). Number of data properties: scaling the data size horizontally by the number of data properties with a fixed number of data records (1, 10, 20, 30 columns). Number of duplicate values: scaling the number of duplicate values in the dataset (0%, 25%, 50%, 75%, 100%). Number of empty values: scaling the number of empty values in the dataset (0%, 25%, 50%, 75%, 100%). Number of input files: scaling the number of datasets (1, 5, 10, 15). Mappings Number of subjects: scaling the number of subjects with a fixed number of predicates and objects (1, 10, 20, 30 TMs). Number of predicates and objects: scaling the number of predicates and objects with a fixed number of subjects (1, 10, 20, 30 POMs). Number of and type of joins: scaling the number of joins and type of joins (1-1, N-1, 1-N, N-M) Part 2: GTFS-Madrid-Bench The GTFS-Madrid-Bench provides insights in the pipeline with real data from the
public transport domain in Madrid. Scaling GTFS-1 SQL GTFS-10 SQL GTFS-100 SQL GTFS-1000 SQL Heterogeneity GTFS-100 XML + JSON GTFS-100 CSV + XML GTFS-100 CSV + JSON GTFS-100 SQL + XML + JSON + CSV Example pipeline The ground truth dataset and baseline results are generated in different steps
for each parameter: The provided CSV files and SQL schema are loaded into a MySQL relational database. Mappings are executed by accessing the MySQL relational database to construct a knowledge graph in N-Triples as RDF format. The constructed knowledge graph is loaded into a Virtuoso triplestore, tuned according to the Virtuoso documentation. The provided SPARQL queries are executed on the SPARQL endpoint exposed by Virtuoso. The pipeline is executed 5 times from which the median execution time of each
step is calculated and reported. Each step with the median execution time is
then reported in the baseline results with all its measured metrics.
Query timeout is set to 1 hour and knowledge graph construction timeout
to 24 hours. The execution is performed with the following tool
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Flatback mud crab (Eurypanopeus depressus) transcriptome was assembled from 15 individuals sequenced in 2 Illumina HiSeq2000 lanes PE100 using Trinity2.0.3 and annotated with Trinotate2.0.1 on a custom MySQL database from The Broad Institute. We exposed 3 flatback mud crabs to one of 4 treatments (total = 12 individuals): non-aerated control, aerated control, oil only (Marlin platform Dorado 1g/l ), and oil-dispersant (Marlin platform Dorado 1g/l, COREXIT 9500 0.1g/l) for 72 hours to assess the up and down regulation of genes in muscle tissues. To account for stress caused by laboratory treatments, muscle tissue from three reference individuals that were sacrificed and not exposed to any lab treatments were analyzed. This dataset reports upregulated and downregulated gene expression. NCBI accession numbers are provided for each sample.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset is an active collection of access data to information items in the University of Tasmania's EPrints repository. Each night a task is scheduled to run, and this picks up in the Apache access logs from where it left off the previous night. Each download of an open access full-text item causes the generation of a database record in the MySQL database, together with a timestamp, and an approximate location of the computer system generating the download. This is achieved by looking up the IP address against the GeoIP database, with one significant difference. Downloads originating from a University of Tasmania IP address are separately identified, and removed from the Australia category. This eliminates vanity searches from achieving high significance. Countries are coded using the ISO3166 two-letter code.
The dataset has been used to analyse the usage made of the repository and to tune it to achieve maximal visibility for the University of Tasmania. Researchers with items in the repository have used it to identify the types of use being made of their work, and to find potential collaborators. The citation of a work in a journal or conference article, for example, causes a typical step in usage, and the citing article can be searched in Google or Google Scholar to identify the authors. This enhances the dissemination experience and its value.
The software was written in the University of Tasmania by Professor Arthur Sale (in php) based on earlier work by the University of Melbourne (with permission). Mr Christian McGee wrote some critical sections of the code in perl, and set up the cron scheduling.
The dataset is generated by a computer program written by Professor Arthur Sale. The software was a test bed for ideas, and subsequently resulted in an official software set included in the EPrints distribution. This set expanded on the concepts significantly
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The River Ouse Project was started by Dr Margaret Pilkington and colleagues in the Centre for Continuing Education, University of Sussex. Margaret is now retired with emeritus status and continues to run the project with a team of volunteers, in association with the University of Sussex. The team does botanical surveys of streamside grassland and steep wooded valleys (gills) in the upper reaches of the Sussex Ouse, a short flashy river arising on the southern slopes of the High Weald AONB (Area of Outstanding Natural Beauty). Survey sites are chosen on the basis of species richness, potential for restoration and contribution to flood control, and surveyed using the sampling methods outlined in Rodwell, J S (1992. British Plant Communities, Volume 3, Grasslands and Montane Communities). Survey data are transferred from the paper record taken in the field to Excel spreadsheets, and from there after validation and cleaning into two MySQL (MariaDB) databases, meadows and gills.
The file is an extract from the meadows database. It contains binary data of the site occupancy for most of the plants encountered in meadow sites (stands, assemblies) sampled using five 2m x 2m quadrats. Details of the database are available here: River Ouse Project databases.
For further details and access to the full database contact the author.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
FooDrugs database is a development done by the Computational Biology Group at IMDEA Food Institute (Madrid, Spain), in the context of the Food Nutrition Security Cloud (FNS-Cloud) project. Food Nutrition Security Cloud (FNS-Cloud) has received funding from the European Union's Horizon 2020 Research and Innovation programme (H2020-EU.3.2.2.3. – A sustainable and competitive agri-food industry) under Grant Agreement No. 863059 – www.fns-cloud.eu (See more details about FNS-Cloud below)
FooDrugs stores information extracted from transcriptomics and text documents for foo-drug interactiosn and it is part of a demonstrator to be done in the FNS-Cloud project. The database was built using MySQL, an open source relational database management system. FooDrugs host information for a total of 161 transcriptomics GEO series with 585 conditions for food or bioactive compounds. Each condition is defined as a food/biocomponent per time point, per concentration, per cell line, primary culture or biopsy per study. FooDrugs includes information about a bipartite network with 510 nodes and their similarity scores (tau score; https://clue.io/connectopedia/connectivity_scores) related with possible drug interactions with drugs assayed in conectivity map (https://www.broadinstitute.org/connectivity-map-cmap). The information is stored in eight tables:
Table “study” : This table contains basic information about study identifiers from GEO, pubmed or platform, study type, title and abstract
Table “sample”: This table contains basic information about the different experiments in a study, like the identifier of the sample, treatment, origin type, time point or concentration.
Table “misc_study”: This table contains additional information about different attributes of the study.
Table “misc_sample”: This table contains additional information about different attributes of the sample.
Table “cmap”: This table contains information about 70895 nodes, compromising drugs, foods or bioactives, overexpressed and knockdown genes (see section 3.4). The information includes cell line, compound and perturbation type.
Table “cmap_foodrugs”: This table contains information about the tau score (see section 3.4) that relates food with drugs or genes and the node identifier in the FooDrugs network.
Table “topTable”: This table contains information about 150 over and underexpressed genes from each GEO study condition, used to calculate the tau score (see section 3.4). The information stored is the logarithmic fold change, average expression, t-statistic, p-value, adjusted p-value and if the gene is up or downregulated.
Table “nodes”: This table stores the information about the identification of the sample and the node in the bipartite network connecting the tables “sample”, “cmap_foodrugs” and “topTable”.
In addition, FooDrugs database stores a total of 6422 food/drug interactions from 2849 text documents, obtained from three different sources: 2312 documents from PubMed, 285 from DrugBank, and 252 from drugs.com. These documents describe potential interactions between 1464 food/bioactive compounds and 3009 drugs. The information is stored in two tables:
Table “texts”: This table contains all the documents with its identifiers where interactions have been identified with strategy described in section 4.
Table “TM_interactions”: This table contains information about interaction identifiers, the food and drug entities, and the start and the end positions of the context for the interaction in the document.
FNS-Cloud will overcome fragmentation problems by integrating existing FNS data, which is essential for high-end, pan-European FNS research, addressing FNS, diet, health, and consumer behaviours as well as on sustainable agriculture and the bio-economy. Current fragmented FNS resources not only result in knowledge gaps that inhibit public health and agricultural policy, and the food industry from developing effective solutions, making production sustainable and consumption healthier, but also do not enable exploitation of FNS knowledge for the benefit of European citizens. FNS-Cloud will, through three Demonstrators; Agri-Food, Nutrition & Lifestyle and NCDs & the Microbiome to facilitate: (1) Analyses of regional and country-specific differences in diet including nutrition, (epi)genetics, microbiota, consumer behaviours, culture and lifestyle and their effects on health (obesity, NCDs, ethnic and traditional foods), which are essential for public health and agri-food and health policies; (2) Improved understanding agricultural differences within Europe and what these means in terms of creating a sustainable, resilient food systems for healthy diets; and (3) Clear definitions of boundaries and how these affect the compositions of foods and consumer choices and, ultimately, personal and public health in the future. Long-term sustainability of the FNS-Cloud will be based on Services that have the capacity to link with new resources and enable cross-talk amongst them; access to FNS-Cloud data will be open access, underpinned by FAIR principles (findable, accessible, interoperable and re-useable). FNS-Cloud will work closely with the proposed Food, Nutrition and Health Research Infrastructure (FNHRI) as well as METROFOOD-RI and other existing ESFRI RIs (e.g. ELIXIR, ECRIN) in which several FNS-Cloud Beneficiaries are involved directly. (https://cordis.europa.eu/project/id/863059)