9 datasets found

f
Performance of the various normalization algorithms.
plos.figshare.com
xls
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sofie Van Landeghem; Jari Björne; Chih-Hsuan Wei; Kai Hakala; Sampo Pyysalo; Sophia Ananiadou; Hung-Yu Kao; Zhiyong Lu; Tapio Salakoski; Yves Van de Peer; Filip Ginter (2023). Performance of the various normalization algorithms. [Dataset]. http://doi.org/10.1371/journal.pone.0055814.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0055814.t004
Dataset updated
Jun 4, 2023
Dataset provided by
PLOS ONE
Authors
Sofie Van Landeghem; Jari Björne; Chih-Hsuan Wei; Kai Hakala; Sampo Pyysalo; Sophia Ananiadou; Hung-Yu Kao; Zhiyong Lu; Tapio Salakoski; Yves Van de Peer; Filip Ginter
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performance of the various algorithms for Entrez Gene identifier assignment, as measured on the BioCreative III dataset. The canonical and family assignment algorithms both refer to the combined procedure which use the taxonomic assignments by GenNorm to enable species-specific ID disambiguation (Figure 2, Combination 1–2).
f
Data S1 - Large-Scale Event Extraction from Literature with Multi-Level Gene...
plos.figshare.com
figshare.com
xls
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sofie Van Landeghem; Jari Björne; Chih-Hsuan Wei; Kai Hakala; Sampo Pyysalo; Sophia Ananiadou; Hung-Yu Kao; Zhiyong Lu; Tapio Salakoski; Yves Van de Peer; Filip Ginter (2023). Data S1 - Large-Scale Event Extraction from Literature with Multi-Level Gene Normalization [Dataset]. http://doi.org/10.1371/journal.pone.0055814.s001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0055814.s001
Dataset updated
May 30, 2023
Dataset provided by
PLOS ONE
Authors
Sofie Van Landeghem; Jari Björne; Chih-Hsuan Wei; Kai Hakala; Sampo Pyysalo; Sophia Ananiadou; Hung-Yu Kao; Zhiyong Lu; Tapio Salakoski; Yves Van de Peer; Filip Ginter
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This file provides additional details on the pathway curation use-case, which describes a subsection of the human p53 signaling pathway. In this supplemental file, the data on the full p53 pathway are also provided. (XLS)
O
SMM4H (Social Media Mining for Health Shared Task)
opendatalab.com
zip
Updated Oct 1, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of Pennsylvania (2018). SMM4H (Social Media Mining for Health Shared Task) [Dataset]. https://opendatalab.com/OpenDataLab/SMM4H
Explore at:
zip(1663992 bytes)Available download formats
Dataset updated
Oct 1, 2018
Dataset provided by
University of Pennsylvania
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
"This data accompanies the following publication:

Title: Data and systems for medication-related text classification and concept normalization from Twitter: Insights from the Social Media Mining for Health (SMM4H) 2017 shared task

Journal: Journal of the American Medical Informatics Association (JAMIA)

The evaluation data (in addition to the training data) was used for the SMM4H-2017 shared tasks, co-located with AMIA-2017 (Washington DC).

Please use the latest version of these files to avoid inconsistencies: (currently v2) "
m
Hybrid models based on genetic algorithm and deep learning algorithms for...
data.mendeley.com
Updated Oct 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Serhat KILIÇARSLAN (2022). Hybrid models based on genetic algorithm and deep learning algorithms for nutritional Anemia disease classification. Biomedical Signal Processing and Control, 63, 102231. https://doi.org/10.1016/j.bspc.2020.102231 [Dataset]. http://doi.org/10.17632/dt89jydgnv.1
Explore at:
Unique identifier
https://doi.org/10.17632/dt89jydgnv.1
Dataset updated
Oct 18, 2022
Authors
Serhat KILIÇARSLAN
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The anemia dataset used in this study were obtained from the Faculty of Medicine, Tokat Gaziosmanpaşa University, Turkey. The data contains the complete blood count test results of 15,300 patients in the 5-year interval between 2013 and 2018. The dataset of pregnant women, children, and patients with cancer were excluded from the study. The noise in the dataset was eliminated and the parameters, which were considered insignificant in the diagnosis of anemia, were excluded from the dataset with the help of the experts. It is observed that, in the dataset, some of the records have missing parameter values and have values outside the reference range of the parameters which are marked by specialist doctors as noise in our study. Thus, records that have missing data and parameter values outside the reference ranges were removed from the dataset. In the study, Pearson correlation method was used to understand whether there is any relationship between the parameters. It is observed that the relationship between the parameters in the dataset is generally a weak relationship which is below p < 0.4 [59]. Because of this reason none of the parameters excluded from the dataset. Twenty-four features (Table 1) and 5 classes in the dataset were used in the study (Table 2). Since the difference between the parameters in the dataset was very high, a linear transformation was performed on the data with min-max normalization [30]. This dataset consists of data from 15,300 patients, of which 10,379 were female and 4921 were male. The dataset consists of 1019 (7%) patients with HGB-anemia, 4182 (27%) patients with iron deficiency, 199 (1%) patients with B12 deficiency, 153 (1%) patients with folate deficiency, and 9747 (64%) patients who had no anemia (Table 2). The transferring saturation in the dataset was obtained by the "SDTSD" feature, using the Eq. (1), which was developed with the help of a specialist physician. Saturation is the ratio of serum iron to total serum iron. In the Equation SD represents Serum Iron and TSD represents Total Serum Iron.
Z
Onset of mining operations
data.niaid.nih.gov
zenodo.org
Updated Mar 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Remelgado, Ruben (2024). Onset of mining operations [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8214548
Explore at:
Dataset updated
Mar 17, 2024
Dataset provided by
Remelgado, Ruben
Meyer, Carsten
Description
Motivation

Maus et al created the first database of the spatial extent of mining areas by mobilizing nearly 20 years of Landsat data. This dataset is imperative for GlobES, as mining areas are specified in the IUCN habitat class scheme. Yet, this dataset is temporally static. To tackle this flaw, we mined the Landsat archive to infer the first observable year of mining.

Approach

For each mining area polygon, we collected 50 random samples within it and 50 random samples along its borders. This was meant to reflect increasing spectral differences between areas within and outside a mining exploration after its onset. Then, for each sample, we used Google Earth Engine to extract spectral profiles for every available acquisition between 1990 and 2020.

After completing the extraction, we estimate mean spectral profiles for each acquisition date, once for the samples “inside” the mining area, and another for those “outside” of it. In this process, we masked pixels afflicted by clouds and cloud shadows using Landsat's quality information.

Using the time-series of mean profiles, at each mining site and for each unique date, we normalized the “inside” and “outside” multi-spectral averages and estimated the Root Mean Square Error (RMSE) between them. The normalization step aimed at emphasizing differences in the shape of the spectral profiles rather than on specific values, which can be related to radiometric innacuracies, or simply to differences in acquisition dates. This resulted in an RMSE time-series for each mining site.

We then used these data to infer the first mining year. To achieve this, we first derived a cumulative sum of the RMSE time-series with the intent of removing noise while preserving abrupt directional changes. For example, if a mine was introduced in a forest, it would drive an increase in the RMSE due to the removal of trees, whereas the outskirts of the mine would remain forested. In this example, the accumulated values would tilt upwards. However, if a mining exploration was accompanied by the removal of vegetation along its outskirts where bare land was common, a downwards shift is RMSE values is more likely as the landscape becomes more homogenization.

To detect the date marking a shift in RMSE values, we used a knee/elbow detection algorithm implemented in the python package kneebow, which uses curve rotation to infer the inflection/deflection point of a time series. Here, downward trends correspond to the elbow and upward trends to the knee. To determine which of these metrics was the most adequate, we use the Area Under the Curve (AUC). An elbow is characterized by a convex shape of a time-series which makes the AUC greater than 50%. However, if the shape of the curve is concave, the knee is the most adequate metric. We limited the detection of shifts to time-series with at least 100 time steps. When below this threshold, we assumed the mine (or the the conditions to sustain it) were present since 1990.

Content

This repository contains the infrastructure used to infer the start of a mining operation, which is organized as following:

00_data - Contains the base data required for the operation, including a SHP file with the mining area outlines, and validation samples.

01_analysis - Contains several outputs of our analysis:

xy.tar.gz - Sample locations for each mining site.

sr.tar.gz - Spectral profiles for each sample location.

mine_start.csv - First year when we detected the start of mining.

02_code - Includes all code used in our analysis.

requirements.txt - Python module requirements that can be fed to pip to replicate our study.

config.yml - Configuration file, including information on the Landsat products used.
d
Data from: NLM-Gene, a richly annotated gold standard dataset for gene...
search.dataone.org
datadryad.org
Updated Apr 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rezarta Islamaj; Zhiyong Lu (2025). NLM-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi-species gene recognition [Dataset]. http://doi.org/10.5061/dryad.dv41ns1wt
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.dv41ns1wt
Dataset updated
Apr 20, 2025
Dataset provided by
Dryad Digital Repository
Authors
Rezarta Islamaj; Zhiyong Lu
Time period covered
Jan 1, 2020
Description
The automatic recognition of gene names and their corresponding database identifiers in biomedical text is an important first step for many downstream text-mining applications. The NLM-Gene corpus is a high-quality manually annotated corpus for genes, covering ambiguous gene names, with an average of 29 gene mentions (10 unique identifiers) per article, and a broader representation of different species (including Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Arabidopsis thaliana, Danio rerio, etc.) when compared to previous gene annotation corpora. NLM-Gene consists of 550 PubMed articles from 156 biomedical journals, doubly annotated by six experienced NLM indexers, randomly paired for each article to control for bias. The annotators worked in three annotation rounds until they reached a complete agreement. Â Using the new resource, we developed a new gene finding algorithm based on deep learning which improved both on precision and recall from existing tools. ...
f
lipidr: A Software Tool for Data Mining and Analysis of Lipidomics Datasets
acs.figshare.com
xlsx
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed Mohamed; Jeffrey Molendijk; Michelle M. Hill (2023). lipidr: A Software Tool for Data Mining and Analysis of Lipidomics Datasets [Dataset]. http://doi.org/10.1021/acs.jproteome.0c00082.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jproteome.0c00082.s002
Dataset updated
May 30, 2023
Dataset provided by
ACS Publications
Authors
Ahmed Mohamed; Jeffrey Molendijk; Michelle M. Hill
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The rapid evolution of mass spectrometry (MS)-based lipidomics has enabled the simultaneous measurement of numerous lipid classes. With lipidomics datasets becoming increasingly available, lipidomic-focused software tools are required to facilitate data analysis as well as mining of public datasets, integrating lipidomics-unique molecular information such as lipid class, chain length, and unsaturation. To address this need, we developed lipidr, an open-source R/Bioconductor package for data mining and analysis of lipidomics datasets. lipidr implements a comprehensive lipidomic-focused analysis workflow for targeted and untargeted lipidomics. lipidr imports numerical matrices, Skyline exports, and Metabolomics Workbench files directly into R, automatically inferring lipid class and chain information from lipid names. Through integration with the Metabolomics Workbench API, users can search, download, and reanalyze public lipidomics datasets seamlessly. lipidr allows thorough data inspection, normalization, and uni- and multivariate analyses, displaying results as interactive visualizations. To enable interpretation of lipid class, chain length, and total unsaturation data, we also developed and implemented a novel lipid set enrichment analysis. A companion online guide with two live example datasets is presented at https://www.lipidr.org/. We expect that the ease of use and innovative features of lipidr will allow the lipidomics research community to gain novel detailed insights from lipidomics data.
f
Performance of pGenN & GenNorm on use case data set.
plos.figshare.com
xls
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruoyao Ding; Cecilia N. Arighi; Jung-Youn Lee; Cathy H. Wu; K. Vijay-Shanker (2023). Performance of pGenN & GenNorm on use case data set. [Dataset]. http://doi.org/10.1371/journal.pone.0135305.t009
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0135305.t009
Dataset updated
Jun 3, 2023
Dataset provided by
PLOS ONE
Authors
Ruoyao Ding; Cecilia N. Arighi; Jung-Youn Lee; Cathy H. Wu; K. Vijay-Shanker
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performance of pGenN & GenNorm on use case data set.
Significant regional volume differences (P
plos.figshare.com
xls
Updated Jun 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yongxia Zhou; Fang Yu; Timothy Duong (2023). Significant regional volume differences (P [Dataset]. http://doi.org/10.1371/journal.pone.0090405.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0090405.t001
Dataset updated
Jun 7, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Yongxia Zhou; Fang Yu; Timothy Duong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Note-Data (V1, V2) are mean brain volumes after normalization to the supratentorial volume with a scale factor of 1000, no unit.**Calculated with two-sample t test to obtain original p-value (shown with P
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Sofie Van Landeghem; Jari Björne; Chih-Hsuan Wei; Kai Hakala; Sampo Pyysalo; Sophia Ananiadou; Hung-Yu Kao; Zhiyong Lu; Tapio Salakoski; Yves Van de Peer; Filip Ginter (2023). Performance of the various normalization algorithms. [Dataset]. http://doi.org/10.1371/journal.pone.0055814.t004

Performance of the various normalization algorithms.

Explore at:

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0055814.t004

Dataset updated

Jun 4, 2023

Dataset provided by

PLOS ONE

Authors

Sofie Van Landeghem; Jari Björne; Chih-Hsuan Wei; Kai Hakala; Sampo Pyysalo; Sophia Ananiadou; Hung-Yu Kao; Zhiyong Lu; Tapio Salakoski; Yves Van de Peer; Filip Ginter

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Performance of the various algorithms for Entrez Gene identifier assignment, as measured on the BioCreative III dataset. The canonical and family assignment algorithms both refer to the combined procedure which use the taxonomic assignments by GenNorm to enable species-specific ID disambiguation (Figure 2, Combination 1–2).

Clear search

Close search

Google apps

Main menu

Performance of the various normalization algorithms.

Data S1 - Large-Scale Event Extraction from Literature with Multi-Level Gene...

SMM4H (Social Media Mining for Health Shared Task)

Hybrid models based on genetic algorithm and deep learning algorithms for...

Onset of mining operations

Data from: NLM-Gene, a richly annotated gold standard dataset for gene...

lipidr: A Software Tool for Data Mining and Analysis of Lipidomics Datasets

Performance of pGenN & GenNorm on use case data set.

Significant regional volume differences (P

Performance of the various normalization algorithms.