Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Performance of the various algorithms for Entrez Gene identifier assignment, as measured on the BioCreative III dataset. The canonical and family assignment algorithms both refer to the combined procedure which use the taxonomic assignments by GenNorm to enable species-specific ID disambiguation (Figure 2, Combination 1–2).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This file provides additional details on the pathway curation use-case, which describes a subsection of the human p53 signaling pathway. In this supplemental file, the data on the full p53 pathway are also provided. (XLS)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
"This data accompanies the following publication:
Title: Data and systems for medication-related text classification and concept normalization from Twitter: Insights from the Social Media Mining for Health (SMM4H) 2017 shared task
Journal: Journal of the American Medical Informatics Association (JAMIA)
The evaluation data (in addition to the training data) was used for the SMM4H-2017 shared tasks, co-located with AMIA-2017 (Washington DC).
Please use the latest version of these files to avoid inconsistencies: (currently v2) "
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The anemia dataset used in this study were obtained from the Faculty of Medicine, Tokat Gaziosmanpaşa University, Turkey. The data contains the complete blood count test results of 15,300 patients in the 5-year interval between 2013 and 2018. The dataset of pregnant women, children, and patients with cancer were excluded from the study. The noise in the dataset was eliminated and the parameters, which were considered insignificant in the diagnosis of anemia, were excluded from the dataset with the help of the experts. It is observed that, in the dataset, some of the records have missing parameter values and have values outside the reference range of the parameters which are marked by specialist doctors as noise in our study. Thus, records that have missing data and parameter values outside the reference ranges were removed from the dataset. In the study, Pearson correlation method was used to understand whether there is any relationship between the parameters. It is observed that the relationship between the parameters in the dataset is generally a weak relationship which is below p < 0.4 [59]. Because of this reason none of the parameters excluded from the dataset. Twenty-four features (Table 1) and 5 classes in the dataset were used in the study (Table 2). Since the difference between the parameters in the dataset was very high, a linear transformation was performed on the data with min-max normalization [30]. This dataset consists of data from 15,300 patients, of which 10,379 were female and 4921 were male. The dataset consists of 1019 (7%) patients with HGB-anemia, 4182 (27%) patients with iron deficiency, 199 (1%) patients with B12 deficiency, 153 (1%) patients with folate deficiency, and 9747 (64%) patients who had no anemia (Table 2). The transferring saturation in the dataset was obtained by the "SDTSD" feature, using the Eq. (1), which was developed with the help of a specialist physician. Saturation is the ratio of serum iron to total serum iron. In the Equation SD represents Serum Iron and TSD represents Total Serum Iron.
Motivation
Maus et al created the first database of the spatial extent of mining areas by mobilizing nearly 20 years of Landsat data. This dataset is imperative for GlobES, as mining areas are specified in the IUCN habitat class scheme. Yet, this dataset is temporally static. To tackle this flaw, we mined the Landsat archive to infer the first observable year of mining.
Approach
For each mining area polygon, we collected 50 random samples within it and 50 random samples along its borders. This was meant to reflect increasing spectral differences between areas within and outside a mining exploration after its onset. Then, for each sample, we used Google Earth Engine to extract spectral profiles for every available acquisition between 1990 and 2020.
After completing the extraction, we estimate mean spectral profiles for each acquisition date, once for the samples “inside” the mining area, and another for those “outside” of it. In this process, we masked pixels afflicted by clouds and cloud shadows using Landsat's quality information.
Using the time-series of mean profiles, at each mining site and for each unique date, we normalized the “inside” and “outside” multi-spectral averages and estimated the Root Mean Square Error (RMSE) between them. The normalization step aimed at emphasizing differences in the shape of the spectral profiles rather than on specific values, which can be related to radiometric innacuracies, or simply to differences in acquisition dates. This resulted in an RMSE time-series for each mining site.
We then used these data to infer the first mining year. To achieve this, we first derived a cumulative sum of the RMSE time-series with the intent of removing noise while preserving abrupt directional changes. For example, if a mine was introduced in a forest, it would drive an increase in the RMSE due to the removal of trees, whereas the outskirts of the mine would remain forested. In this example, the accumulated values would tilt upwards. However, if a mining exploration was accompanied by the removal of vegetation along its outskirts where bare land was common, a downwards shift is RMSE values is more likely as the landscape becomes more homogenization.
To detect the date marking a shift in RMSE values, we used a knee/elbow detection algorithm implemented in the python package kneebow, which uses curve rotation to infer the inflection/deflection point of a time series. Here, downward trends correspond to the elbow and upward trends to the knee. To determine which of these metrics was the most adequate, we use the Area Under the Curve (AUC). An elbow is characterized by a convex shape of a time-series which makes the AUC greater than 50%. However, if the shape of the curve is concave, the knee is the most adequate metric. We limited the detection of shifts to time-series with at least 100 time steps. When below this threshold, we assumed the mine (or the the conditions to sustain it) were present since 1990.
Content
This repository contains the infrastructure used to infer the start of a mining operation, which is organized as following:
00_data - Contains the base data required for the operation, including a SHP file with the mining area outlines, and validation samples.
01_analysis - Contains several outputs of our analysis:
xy.tar.gz - Sample locations for each mining site.
sr.tar.gz - Spectral profiles for each sample location.
mine_start.csv - First year when we detected the start of mining.
02_code - Includes all code used in our analysis.
requirements.txt - Python module requirements that can be fed to pip to replicate our study.
config.yml - Configuration file, including information on the Landsat products used.
The automatic recognition of gene names and their corresponding database identifiers in biomedical text is an important first step for many downstream text-mining applications. The NLM-Gene corpus is a high-quality manually annotated corpus for genes, covering ambiguous gene names, with an average of 29 gene mentions (10 unique identifiers) per article, and a broader representation of different species (including Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Arabidopsis thaliana, Danio rerio, etc.) when compared to previous gene annotation corpora. NLM-Gene consists of 550 PubMed articles from 156 biomedical journals, doubly annotated by six experienced NLM indexers, randomly paired for each article to control for bias. The annotators worked in three annotation rounds until they reached a complete agreement. Â Using the new resource, we developed a new gene finding algorithm based on deep learning which improved both on precision and recall from existing tools. ...
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The rapid evolution of mass spectrometry (MS)-based lipidomics has enabled the simultaneous measurement of numerous lipid classes. With lipidomics datasets becoming increasingly available, lipidomic-focused software tools are required to facilitate data analysis as well as mining of public datasets, integrating lipidomics-unique molecular information such as lipid class, chain length, and unsaturation. To address this need, we developed lipidr, an open-source R/Bioconductor package for data mining and analysis of lipidomics datasets. lipidr implements a comprehensive lipidomic-focused analysis workflow for targeted and untargeted lipidomics. lipidr imports numerical matrices, Skyline exports, and Metabolomics Workbench files directly into R, automatically inferring lipid class and chain information from lipid names. Through integration with the Metabolomics Workbench API, users can search, download, and reanalyze public lipidomics datasets seamlessly. lipidr allows thorough data inspection, normalization, and uni- and multivariate analyses, displaying results as interactive visualizations. To enable interpretation of lipid class, chain length, and total unsaturation data, we also developed and implemented a novel lipid set enrichment analysis. A companion online guide with two live example datasets is presented at https://www.lipidr.org/. We expect that the ease of use and innovative features of lipidr will allow the lipidomics research community to gain novel detailed insights from lipidomics data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Performance of pGenN & GenNorm on use case data set.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Note-Data (V1, V2) are mean brain volumes after normalization to the supratentorial volume with a scale factor of 1000, no unit.**Calculated with two-sample t test to obtain original p-value (shown with P
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Performance of the various algorithms for Entrez Gene identifier assignment, as measured on the BioCreative III dataset. The canonical and family assignment algorithms both refer to the combined procedure which use the taxonomic assignments by GenNorm to enable species-specific ID disambiguation (Figure 2, Combination 1–2).