Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was scrapped from the MedQuAD repository and then converted it to a csv file. It only contains the questions and answers related to Cancer.
MedQuAD includes 47,457 medical question-answer pairs created from 12 NIH websites (e.g. cancer.gov, niddk.nih.gov, GARD, MedlinePlus Health Topics). The collection covers 37 question types (e.g. Treatment, Diagnosis, Side Effects) associated with diseases, drugs and other medical entities such as tests.
We included additional annotations in the XML files, that could be used for diverse IR and NLP tasks, such as the question type, the question focus, its syonyms, its UMLS Concept Unique Identifier (CUI) and Semantic Type. We added the category of the question focus (Disease, Drug or Other) in the 4 MedlinePlus collections. All other collections are about diseases.
The paper cited below describes the collection, the construction method as well as its use and evaluation within a medical question answering system.
N.B. We removed the answers from 3 subsets to respect the MedlinePlus copyright (https://medlineplus.gov/copyright.html):
(1) A.D.A.M. Medical Encyclopedia, (2) MedlinePlus Drug information, and (3) MedlinePlus Herbal medicine and supplement information.
-- We kept all the other information including the URLs in case you want to crawl the answers. Please contact me if you have any questions.
We used the test questions of the TREC-2017 LiveQA medical task: https://github.com/abachaa/LiveQA_MedicalTask_TREC2017/tree/master/TestDataset.
As described in our BMC paper, we have manually judged the answers retrieved by the IR and QA systems from the MedQuAD collection. We used the same judgment scores as the LiveQA Track: 1-Incorrect, 2-Related, 3-Incomplete, and 4-Excellent. -- Format of the qrels file: Question_ID judgment Answer_ID
The QA test collection contains 2,479 judged answers that can be used to evaluate the performance of IR & QA systems on the LiveQA-Med test questions: https://github.com/abachaa/MedQuAD/blob/master/QA-TestSet-LiveQA-Med-Qrels-2479-Answers.zip
If you use the MedQuAD dataset and/or the collection of 2,479 judged answers, please cite the following paper: "A Question-Entailment Approach to Question Answering". Asma Ben Abacha and Dina Demner-Fushman. BMC Bioinformatics, 2019.
@ARTICLE{BenAbacha-BMC-2019,
author = {Asma {Ben Abacha} and Dina Demner{-}Fushman},
title = {A Question-Entailment Approach to Question Answering},
journal = {{BMC} Bioinform.},
volume = {20},
number = {1},
pages = {511:1--511:23},
year = {2019},
url = {https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3119-4}
}
License - The MedQuAD dataset is published under a Creative Commons Attribution 4.0 International Licence (CC BY). https://creativecommons.org/licenses/by/4.0/
Facebook
TwitterFeature data for 2019 training and test cells. The structure of the data table is as follows. The first column contains unique cell identifiers. The first row contains headers, including: "Case" – indicates if the cell was randomly assigned to the training or test sets "Class" – the true class of the cell by human review (poorly segmented (PS) or well-segmented (WS) ) The remaining 116 columns contain the feature data for each cell. A total of ~190 features were generated by the Cellomics Morphology Explorer software, but only a subset of 116 features with complete data for every cell were included in our analysis. (CSV 2 MB)
SOURCE: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-340
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the TBGA dataset. TBGA is a large-scale, semi-automatically annotated dataset for Gene-Disease Association (GDA) extraction. The dataset consists of three text files, corresponding to train, validation, and test sets, plus an additional JSON file containing the mapping between relation names and IDs. Each record in train, validation, or test files corresponds to a single GDA extracted from a sentence. Records are represented as JSON objects with the following structure:
text: sentence from which the GDA was extracted.
relation: relation name associated with the given GDA.
h: JSON object representing the gene entity, composed of:
id: NCBI Entrez ID associated with the gene entity.
name: NCBI official gene symbol associated with the gene entity.
pos: list consisting of starting position and length of the gene mention within text.
t: JSON object representing the disease entity, composed of:
id: UMLS CUI associated with the disease entity.
name: UMLS preferred term associated with the disease entity.
pos: list consisting of starting position and length of the disease mention within text.
TBGA contains over 200,000 instances and 100,000 bags. The zip file consists of one folder, named TBGA, containing the files corresponding to the dataset.
If you use or extend our work, please cite the following: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04646-6#citeas TBGA paper can be found at: https://rdcu.be/cKkY2 TBGA code is available at: https://github.com/GDAMining/gda-extraction
Facebook
Twitterhttps://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
KVFinder's cavity dataset in CVS.This dataset describes the protein cavities output by a protein cavity detection method called KVFinder. This method is described in the article available at:https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-197
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the repository for the manuscript "Gene regulatory network inference using mixed-norms regularized multivariate model with covariance selection" by Alain J. Mbebi & Zoran Nikoloski.
Organisation
The folder Codes contains the following R scripts with the K-folds cross-validation option to learn the hyperparameters:
Mixed_L1L21_GRN.R which computes L1L21-solution
Mixed_L1L21G_GRN.R which computes L1L21G-solution
Mixed_L2L21_GRN.R which computes L2L21-solution
Mixed_L2L21G_GRN.R which computes L2L21G-solution
L1L21_Dream5_Scerevisiae_example_run.R is an example run using the L1L21-solution with S. cerevisiae data (Network 4 in DREAM5 challenge) All files needed to successfully run "L1L21_Dream5_Scerevisiae_example_run" are locaded in the folder Codes.
The folder Figures contains all figures in the manuscript.
The folder Inferred-networks contains all network objects for each dataset and each inference methods in the comparative analysis.
Dependencies and required packages
The following packages are required for the contending approaches in the comparative analysis: "devtools", "foreach", "plyr", "glmnet" and "randomForest".
GENIE3
The GENIE3 package can be installed from: http://bioconductor.org/packages/release/bioc/html/GENIE3.html
TIGRESS
The TIGRESS repository can be obtained from: https://github.com/jpvert/tigress
ENNET
The ENNET repository can be obtained from: https://github.com/slawekj/ennet
PLSNET
The Matlab source code of PLSNET can be obtained from: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1398-6#Sec17
PORTIA
The PORTIA repository can be obtained from: https://github.com/AntoinePassemiers/PORTIA
D3GRN
The Matlab source code of D3GRN can be obtained from: https://github.com/chenxofhit/D3GRN
Fused-LASSO
The fused-LASSO repository can be obtained from: https://github.com/omranian/inference-of-GRN-using-Fused-LASSO
ANOVerence
Because of some technical issues (e.g code's accessibility: http://www2.bio.ifi.lmu.de/˜kueffner/anova.tar.gz), we were not able to reproduce ANOVerence results and used the inferred network from DREAM5 challenge instead.
Facebook
TwitterBranch lengths and topology of a species tree are essential in most downstream analyses, including estimation of diversification dates, characterization of selection, understanding adaptation, and comparative genomics. Modern phylogenomic analyses often use methods that account for the heterogeneity of evolutionary histories across the genome due to processes such as incomplete lineage sorting. However, these methods typically do not generate branch lengths in units that are usable by downstream applications, forcing phylogenomic analyses to resort to alternative shortcuts such as estimating branch lengths by concatenating gene alignments into a supermatrix. Yet, concatenation and other available approaches for estimating branch lengths fail to address heterogeneity across the genome. In this article, we derive expected values of gene tree branch lengths in substitution units under an extension of the multispecies coalescent (MSC) model that allows substitutions with varying rates acros..., , # Data from: Phylogenomic branch length estimation using quartets
This repository contains the datasets used in the following paper:
Y. Tabatabaee, C. Zhang, T. Warnow, S. Mirarab, Phylogenomic branch length estimation using quartets, Bioinformatics, Volume 39, Issue Supplement_1, June 2023, Pages i185–i193, https://doi.org/10.1093/bioinformatics/btad221
For experiments in this study, we studied a collection of simulated and biological datasets with incomplete lineage sorting (ILS). We generated a new quartet dataset and regenerated species trees with substitution-unit branch lengths for previously published datasets from Zhang et. al. (2018) and Mai et. al. (2017). We also analyzed the mammalian biological dataset from [Song et. al. (2012)](https://www.pnas.org/doi/full/10.1073...,
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Facebook
Twitterhttp://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html
Bio-medical texts have a lot of information which can be used for developments in the medical field. Traditionally, domain experts used to manually extract such information. Automating this information extraction task can help speed up progress in the field. To name a few use cases of bio-medical events, they show the effects of drugs on a person. They can also be used to identify certain medical conditions in a person. Hence automating extraction of events from bio-medical texts is very beneficial
The dataset is just a simplified version of the event annotated GENIA dataset derived from the version available in TEES
It consists of the original bio-medical text, labelled trigger words, location of trigger word in the text and the event type associated with the trigger word There are 3 sets of data (train (8k+ sentences), devel (about 3k sentences) and test (about 3k sentences)). Each set has 4 columns namely "Sentence", "TriggerWord", "TriggerWordLoc" and "EventType", capturing the original bio-medical text, trigger words in the sentence, location of the trigger words in the sentence and the event type associated with the trigger words respectively.
The dataset is just a simplified version of the event annotated GENIA dataset derived from the version available in TEES The original source dataset is from BioNLP Shared Task 2011 A complete unprocessed version seems to be present in genia-event-2011 dataset too
For TEES licensing information please refer this link For GENIA dataset licensing information, please refer the file "GE11-LICENSE" present beside the data files (.csv) in this kaggle dataset
Photo Credits: Louis Reed on Unsplash
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the TBGA dataset. TBGA is a large-scale, semi-automatically annotated dataset for Gene-Disease Association (GDA) extraction. The dataset consists of three text files, corresponding to train, validation, and test sets, plus an additional JSON file containing the mapping between relation names and IDs. Each record in train, validation, or test files corresponds to a single GDA extracted from a sentence. Records are represented as JSON objects with the following structure:
text: sentence from which the GDA was extracted. relation: relation name associated with the given GDA.
h: JSON object representing the gene entity, composed of:
id: NCBI Entrez ID associated with the gene entity. name: NCBI official gene symbol associated with the gene entity. pos: list consisting of starting position and length of the gene mention within text.
t: JSON object representing the disease entity, composed of:
id: UMLS CUI associated with the disease entity. name: UMLS preferred term associated with the disease entity. pos: list consisting of starting position and length of the disease mention within text. TBGA contains over 200,000 instances and 100,000 bags. The zip file consists of one folder, named TBGA, containing the files corresponding to the dataset.
If you use or extend our work, please cite the following: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04646-6#citeas TBGA paper can be found at: https://rdcu.be/cKkY2 TBGA code is available at: https://github.com/GDAMining/gda-extraction
Keeping Citation here because I don't know where else to keep it.
"""Cite all versions? You can cite all versions by using the DOI 10.5281/zenodo.5911096. This DOI represents all versions, and will always resolve to the latest one. Read more."""
Data set is taken from https://zenodo.org/records/5911097
Facebook
TwitterA collection of 1200 texts (292173 tokens) about clinical trials studies and clinical trials announcements in Spanish: - 500 abstracts from journals published under a Creative Commons license, e.g. available in PubMed or the Scientific Electronic Library Online (SciELO).- 700 clinical trials announcements published in the European Clinical Trials Register and Repositorio Espaol de Estudios Clnicos. Texts were annotated with the following entities types: - Semantic groups from the Unified Medical Language System: ANAT: anatomy CHEM: pharmacological and chemical substances DEVI: medical devices DISO: pathologic conditions LIVB: living beings, included the human being PHYS: physiological processes PROC: lab tests, diagnostic or therapeutic procedures- Medical drug information: Contraindicated: a contraindicated drug or treatment Dose: dose or strength Form: dosage form Route: administration route or mode- Temporal expressions Age Date Duration Frequency Time- Miscellaneous medical entities: Concept: abstract concepts, statistical tests or measurement scales Food: foods or drinks Observation: medical observations or clinical findings Quantifier_or_Qualifier: quantifier or qualifier adjective Result_or_Value: result or value of a measurement, laboratory analysis or procedure- Negation/Speculation: Neg_cue: negation cue Negated: negated event Spec_cue: speculation cue Speculated: speculated or uncertain event- Attributes: Temporality: ◦ History_of: past event ◦ Future: future event Experiencer: ◦ Patient: patient or participant on a clinical trial ◦ Family_member ◦ Other: other person different from the patient or the family member 86 389 entities and 16 590 attributes were annotated. 10% of the corpus was doubly annotated, and high inter-annotator agreement (IAA) values were achieved: F1-score = 0.84% for entities; and F1-score = 0.88% for attributes (both in strict match). The dataset includes the texts and annotations used for the human evaluation of the medical named entity tool: - 100 clinical trial announcements from EudraCT not used for system development: we provide files of the version revised by medical professionals (Reference folder)- 100 clinical cases with Creative Commons license: we provide files with the files revised by medical professionals (Reference folder). These data come from: Urgencias Bidasoa (https://urgenciasbidasoa.wordpress.com/casos-clinicos-3/) Hipocampo.org (https://www.hipocampo.org/) Cases published by Sociedad Andaluza de Medicina Familiar y Comunitaria (SAMFyC): we are greatly thankful for giving us permission to use these cases and we acknowledge that the copyright belongs to the authors' contents. Clinical cases were extracted from books published from 2016 to 2022 (https://www.samfyc.es/tipos-publicacion/publicaciones/). If you use these data, please, acknowledge the copyright and intellectual property rights to the authors' contents. The dataset is freely distributed for research and educational purposes under a Creative Commons Non-Commercial Attribution (CC-BY-NC-A) License. If you use the CT-EBM-SP vs. 2 dataset, please, cite as follows: Campillos-Llanos, L., A. Valverde-Mateos A. Capllonch-Carrion (2024) Hybrid natural language processing tool for semantic annotation of medical texts in Spanish. BMC Bioinformatics. BioMed Central.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was scrapped from the MedQuAD repository and then converted it to a csv file. It only contains the questions and answers related to Cancer.
MedQuAD includes 47,457 medical question-answer pairs created from 12 NIH websites (e.g. cancer.gov, niddk.nih.gov, GARD, MedlinePlus Health Topics). The collection covers 37 question types (e.g. Treatment, Diagnosis, Side Effects) associated with diseases, drugs and other medical entities such as tests.
We included additional annotations in the XML files, that could be used for diverse IR and NLP tasks, such as the question type, the question focus, its syonyms, its UMLS Concept Unique Identifier (CUI) and Semantic Type. We added the category of the question focus (Disease, Drug or Other) in the 4 MedlinePlus collections. All other collections are about diseases.
The paper cited below describes the collection, the construction method as well as its use and evaluation within a medical question answering system.
N.B. We removed the answers from 3 subsets to respect the MedlinePlus copyright (https://medlineplus.gov/copyright.html):
(1) A.D.A.M. Medical Encyclopedia, (2) MedlinePlus Drug information, and (3) MedlinePlus Herbal medicine and supplement information.
-- We kept all the other information including the URLs in case you want to crawl the answers. Please contact me if you have any questions.
We used the test questions of the TREC-2017 LiveQA medical task: https://github.com/abachaa/LiveQA_MedicalTask_TREC2017/tree/master/TestDataset.
As described in our BMC paper, we have manually judged the answers retrieved by the IR and QA systems from the MedQuAD collection. We used the same judgment scores as the LiveQA Track: 1-Incorrect, 2-Related, 3-Incomplete, and 4-Excellent. -- Format of the qrels file: Question_ID judgment Answer_ID
The QA test collection contains 2,479 judged answers that can be used to evaluate the performance of IR & QA systems on the LiveQA-Med test questions: https://github.com/abachaa/MedQuAD/blob/master/QA-TestSet-LiveQA-Med-Qrels-2479-Answers.zip
If you use the MedQuAD dataset and/or the collection of 2,479 judged answers, please cite the following paper: "A Question-Entailment Approach to Question Answering". Asma Ben Abacha and Dina Demner-Fushman. BMC Bioinformatics, 2019.
@ARTICLE{BenAbacha-BMC-2019,
author = {Asma {Ben Abacha} and Dina Demner{-}Fushman},
title = {A Question-Entailment Approach to Question Answering},
journal = {{BMC} Bioinform.},
volume = {20},
number = {1},
pages = {511:1--511:23},
year = {2019},
url = {https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3119-4}
}
License - The MedQuAD dataset is published under a Creative Commons Attribution 4.0 International Licence (CC BY). https://creativecommons.org/licenses/by/4.0/