11 datasets found

Profile HMM marker sets
figshare.com
application/x-gzip
Updated Oct 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Josh Espinoza (2022). Profile HMM marker sets [Dataset]. http://doi.org/10.6084/m9.figshare.19616016.v1
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19616016.v1
Dataset updated
Oct 14, 2022
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Josh Espinoza
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Archaea_76.hmm - (Anvi'o) Lee, https://doi.org/10.1093/bioinformatics/btz188 (https://github.com/merenlab/anvio/tree/master/anvio/data/hmm/Archaea_76) Bacteria_71.hmm - (Anvi'o) Lee modified, https://doi.org/10.1093/bioinformatics/btz188 (https://github.com/merenlab/anvio/tree/master/anvio/data/hmm/Bacteria_71) Protista_83.hmm - (Anvi'o) Delmont, http://merenlab.org/delmont-euk-scgs (https://github.com/merenlab/anvio/tree/master/anvio/data/hmm/Protista_83) Fungi_593.hmm - (FGMP) https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2782-9 CPR_43.hmm - (CheckM) https://github.com/Ecogenomics/CheckM/tree/master/custom_marker_sets eukaryota_odb10 - (BUSCO) https://busco-data.ezlab.org/v5/data/lineages/eukaryota_odb10.2020-09-10.tar.gz

Citation: * Espinoza, J.L., Dupont, C.L. VEBA: a modular end-to-end suite for in silico recovery, clustering, and analysis of prokaryotic, microeukaryotic, and viral genomes from metagenomes. BMC Bioinformatics 23, 419 (2022). https://doi.org/10.1186/s12859-022-04973-8 * Espinoza, Josh (2022): Profile HMM marker sets. figshare. Dataset. https://doi.org/10.6084/m9.figshare.19616016.v1 * Eren AM, Esen ÖC, Quince C, Vineis JH, Morrison HG, Sogin ML, Delmont TO. 2015. Anvi’o: an advanced analysis and visualization platform for ‘omics data. PeerJ 3:e1319 https://doi.org/10.7717/peerj.1319 * Cissé, O.H., Stajich, J.E. FGMP: assessing fungal genome completeness. BMC Bioinformatics 20, 184 (2019). https://doi.org/10.1186/s12859-019-2782-9 * Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015 Jul;25(7):1043-55. doi: 10.1101/gr.186072.114. * Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Mol Biol Evol. 2021 Sep 27;38(10):4647-4654. doi: 10.1093/molbev/msab199.
Data from: TBGA: A Large-Scale Gene-Disease Association Dataset for...
data.europa.eu
unknown
Updated Jan 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2022). TBGA: A Large-Scale Gene-Disease Association Dataset for Biomedical Relation Extraction [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-5911097?locale=el
Explore at:
unknown(24338684)Available download formats
Dataset updated
Jan 27, 2022
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the TBGA dataset. TBGA is a large-scale, semi-automatically annotated dataset for Gene-Disease Association (GDA) extraction. The dataset consists of three text files, corresponding to train, validation, and test sets, plus an additional JSON file containing the mapping between relation names and IDs. Each record in train, validation, or test files corresponds to a single GDA extracted from a sentence. Records are represented as JSON objects with the following structure: text: sentence from which the GDA was extracted. relation: relation name associated with the given GDA. h: JSON object representing the gene entity, composed of: id: NCBI Entrez ID associated with the gene entity. name: NCBI official gene symbol associated with the gene entity. pos: list consisting of starting position and length of the gene mention within text. t: JSON object representing the disease entity, composed of: id: UMLS CUI associated with the disease entity. name: UMLS preferred term associated with the disease entity. pos: list consisting of starting position and length of the disease mention within text. TBGA contains over 200,000 instances and 100,000 bags. The zip file consists of one folder, named TBGA, containing the files corresponding to the dataset. If you use or extend our work, please cite the following: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04646-6#citeas TBGA paper can be found at: https://rdcu.be/cKkY2 TBGA code is available at: https://github.com/GDAMining/gda-extraction
h
Paraphrased_Medical_QandA
huggingface.co
Updated Apr 20, 2007
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Advaith (2007). Paraphrased_Medical_QandA [Dataset]. https://huggingface.co/datasets/iceplasma/Paraphrased_Medical_QandA
Explore at:
Dataset updated
Apr 20, 2007
Authors
Advaith
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The paraphrased data is referenced from @ARTICLE{BenAbacha-BMC-2019,
author = {Asma {Ben Abacha} and Dina Demner{-}Fushman}, title = {A Question-Entailment Approach to Question Answering}, journal = {{BMC} Bioinform.}, volume = {20}, number = {1}, pages = {511:1--511:23}, year = {2019}, url = {https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3119-4} }
GENIA Bio-medical event dataset
kaggle.com
Updated Dec 5, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nishanth (2020). GENIA Bio-medical event dataset [Dataset]. https://www.kaggle.com/datasets/nishanthsalian/genia-biomedical-event-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 5, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nishanth
License
http://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html
Description
Context

Bio-medical texts have a lot of information which can be used for developments in the medical field. Traditionally, domain experts used to manually extract such information. Automating this information extraction task can help speed up progress in the field. To name a few use cases of bio-medical events, they show the effects of drugs on a person. They can also be used to identify certain medical conditions in a person. Hence automating extraction of events from bio-medical texts is very beneficial

Content

The dataset is just a simplified version of the event annotated GENIA dataset derived from the version available in TEES

It consists of the original bio-medical text, labelled trigger words, location of trigger word in the text and the event type associated with the trigger word There are 3 sets of data (train (8k+ sentences), devel (about 3k sentences) and test (about 3k sentences)). Each set has 4 columns namely "Sentence", "TriggerWord", "TriggerWordLoc" and "EventType", capturing the original bio-medical text, trigger words in the sentence, location of the trigger words in the sentence and the event type associated with the trigger words respectively.

Acknowledgements

The dataset is just a simplified version of the event annotated GENIA dataset derived from the version available in TEES The original source dataset is from BioNLP Shared Task 2011 A complete unprocessed version seems to be present in genia-event-2011 dataset too

For TEES licensing information please refer this link For GENIA dataset licensing information, please refer the file "GE11-LICENSE" present beside the data files (.csv) in this kaggle dataset

Photo Credits: Louis Reed on Unsplash
f
Gaussian Finder's cavity dataset in XML
figshare.com
zip
Updated Sep 28, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abel Gomes (2019). Gaussian Finder's cavity dataset in XML [Dataset]. http://doi.org/10.6084/m9.figshare.9916733.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.9916733.v1
Dataset updated
Sep 28, 2019
Dataset provided by
figshare
Authors
Abel Gomes
License
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Description
Gaussian Finder's cavity dataset in XML. This dataset describes the protein cavities output by a protein cavity detection method called Gaussian Finder. This method is described in the article available at: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1913-4
Z
Repositiry for the article: "Gene regulatory network inference using...
data.niaid.nih.gov
zenodo.org
Updated May 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alain Mbebi (2023). Repositiry for the article: "Gene regulatory network inference using mixed-norms regularized multivariate model with covariance selection" by Alain Mbebi & Zoran Nikoloski [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7965948
Explore at:
Dataset updated
May 25, 2023
Dataset provided by
Alain Mbebi
Zoran Nikoloski
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the repository for the manuscript "Gene regulatory network inference using mixed-norms regularized multivariate model with covariance selection" by Alain J. Mbebi & Zoran Nikoloski.

Organisation

The folder Codes contains the following R scripts with the K-folds cross-validation option to learn the hyperparameters:

Mixed_L1L21_GRN.R which computes L1L21-solution

Mixed_L1L21G_GRN.R which computes L1L21G-solution

Mixed_L2L21_GRN.R which computes L2L21-solution

Mixed_L2L21G_GRN.R which computes L2L21G-solution

L1L21_Dream5_Scerevisiae_example_run.R is an example run using the L1L21-solution with S. cerevisiae data (Network 4 in DREAM5 challenge) All files needed to successfully run "L1L21_Dream5_Scerevisiae_example_run" are locaded in the folder Codes.

The folder Figures contains all figures in the manuscript.

The folder Inferred-networks contains all network objects for each dataset and each inference methods in the comparative analysis.

Dependencies and required packages

The following packages are required for the contending approaches in the comparative analysis: "devtools", "foreach", "plyr", "glmnet" and "randomForest".

GENIE3

The GENIE3 package can be installed from: http://bioconductor.org/packages/release/bioc/html/GENIE3.html

TIGRESS

The TIGRESS repository can be obtained from: https://github.com/jpvert/tigress

ENNET

The ENNET repository can be obtained from: https://github.com/slawekj/ennet

PLSNET

The Matlab source code of PLSNET can be obtained from: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1398-6#Sec17

PORTIA

The PORTIA repository can be obtained from: https://github.com/AntoinePassemiers/PORTIA

D3GRN

The Matlab source code of D3GRN can be obtained from: https://github.com/chenxofhit/D3GRN

Fused-LASSO

The fused-LASSO repository can be obtained from: https://github.com/omranian/inference-of-GRN-using-Fused-LASSO

ANOVerence

Because of some technical issues (e.g code's accessibility: http://www2.bio.ifi.lmu.de/˜kueffner/anova.tar.gz), we were not able to reproduce ANOVerence results and used the inferred network from DREAM5 challenge instead.

Although the codes here were tested on Fedora 29 (Workstation Edition) using R (version 4.2.2), they can run under any Linux or Windows OS distributions, as long as all the required packages are compatible with the desired R version.
PATH_SURVEYOR_ExampleUseCases
zenodo.org
zip
Updated Apr 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timothy Shaw; Timothy Shaw (2024). PATH_SURVEYOR_ExampleUseCases [Dataset]. http://doi.org/10.5281/zenodo.10937799
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10937799
Dataset updated
Apr 7, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Timothy Shaw; Timothy Shaw
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PATH SURVEYOR pipeline examples that were originally hosted on http://shawlab.science/shiny/PATH_SURVEYOR_ExampleUseCases/

It was originally presented in our publication PMID: 37380943

https://pubmed.ncbi.nlm.nih.gov/37380943/ and https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05393-y

Please contact timothy.shaw@moffitt.org for any additional questions.
Data set 1 - Proteome set description
figshare.com
txt
Updated Mar 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mick Van Vlierberghe; Denis BAURAIN (2021). Data set 1 - Proteome set description [Dataset]. http://doi.org/10.6084/m9.figshare.13113893.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13113893.v1
Dataset updated
Mar 20, 2021
Dataset provided by
Figsharehttp://figshare.com/
Authors
Mick Van Vlierberghe; Denis BAURAIN
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
proteomes-73.csv: Table describing taxonomically the 73 organisms used for the orthology inference and links for download.- sampling_OF73.html: interactive taxonomic visualization generated using Krona [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-385].
u
Data from: Evaluation data for "GeFaST: An improved method for OTU...
pub.uni-bielefeld.de
Updated Mar 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert Müller; Markus Nebel (2021). Evaluation data for "GeFaST: An improved method for OTU assignment by generalising Swarm's fastidious clustering approach" [Dataset]. https://pub.uni-bielefeld.de/record/2918928
Explore at:
Dataset updated
Mar 22, 2021
Authors
Robert Müller; Markus Nebel
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Data sets and results of the comparative analyses of GeFaST performed in "GeFaST: An improved method for OTU assignment by generalising Swarm's fastidious clustering approach" (. The scripts for the analyses are available here.

dereplicated.tar.bz2: Data sets used in the analyses of performance (ELDERMET [1]) and clustering quality (even & uneven [2], ELDERMET). The original data sets (see below) have been dereplicated and sequences containing ambiguous bases (IUPAC code n resp. N) have been deleted.

even: http://sbr2.sb-roscoff.fr/download/externe/de/fmahe/even.fasta.bz2

uneven: http://sbr2.sb-roscoff.fr/download/externe/de/fmahe/uneven.fasta.bz2

ELDERMET: http://www.ebi.ac.uk/ena/data/view/SRP003158

The analysis of the clustering quality also requires the reference data set.

eldermet_subsamples_X.tar.bz2: Each archive contains three random subsamples of ELDERMET of size X, with X being the percentage of sequences from eldermet_derep.fasta (in dereplicated.tar.bz2) in the subsample.

uneven_subsamples_80.tar.bz2: Archive containing five random subsamples of uneven, each containing 80 % of the sequences from uneven_derep.fasta (in dereplicated.tar.bz2).

even_subsamples_80.tar.bz2: Archive containing five random subsamples of even, each containing 80 % of the sequences from even_derep.fasta (in dereplicated.tar.bz2).

eldermet_reduced_subsamples_80.tar.bz2: Archive containing the reduced ELDERMET data set and five random subsamples of it, each containing 80 % of the sequences from eldermet_derep.reduced.fasta, plus the corresponding taxonomic assignments.

results.tar.bz2: Results files containing the measurements of performance resp. clustering quality.

eldermet-performance-measurements.csv: runtime and memory consumption for different thresholds

eldermet-subsampling-measurements.csv: runtime and memory consumption for different data set sizes

eldermet-sub-fixed-red-log.csv: runtime and memory consumption for different thresholds (on subsamples of reduced data set)

eldermet-sub-fixed-red-metrics.csv: clustering quality for different thresholds (on subsamples of reduced data set)

even_0.95-metrics.csv: clustering quality for different thresholds, 95 % ground truth

even_0.97-metrics.csv: clustering quality for different thresholds, 97 % ground truth

even_0.99-metrics.csv: clustering quality for different thresholds, 99 % ground truth

uneven_0.95-metrics.csv: clustering quality for different thresholds, 95 % ground truth

uneven_0.97-metrics.csv: clustering quality for different thresholds, 97 % ground truth

uneven_0.99-metrics.csv: clustering quality for different thresholds, 99 % ground truth

uneven-sub-fixed-metrics.csv: clustering quality for different thresholds (on subsamples)

even-sub-fixed-metrics.csv: clustering quality for different thresholds (on subsamples)

References:

[1] Claesson M.J., Cusack S., O'Sullivan O., Greene-Diniz R., de Weerd H., Flannery E., Marchesi J.R., Falush D., Dinan T., Fitzgerald G., Stanton C., van Sinderen D., O'Connor M., Harnedy N., O'Connor K., Henry C., O'Mahony D., Fitzgerald A.P., Shanahan F., Twomey C., Hill C., Ross R.P., O'Toole P.W.: Composition, variability, and temporal stability of the intestinal microbiota of the elderly. Proceedings of the National Academy of Sciences 108 (Supplement 1), 4586-4591 (2011). doi: 10.1073/pnas.1000097107

[2] Mahé F., Rognes T., Quince C., de Vargas C., Dunthorn M.: Swarm: robust and fast clustering method for amplicon-based studies. PeerJ 2, 593 (2014). doi: 10.7717/peerj.593
f
KVFinder's cavity dataset in CSV
figshare.com
zip
Updated Sep 28, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abel Gomes (2019). KVFinder's cavity dataset in CSV [Dataset]. http://doi.org/10.6084/m9.figshare.9917012.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.9917012.v1
Dataset updated
Sep 28, 2019
Dataset provided by
figshare
Authors
Abel Gomes
License
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Description
KVFinder's cavity dataset in CVS.This dataset describes the protein cavities output by a protein cavity detection method called KVFinder. This method is described in the article available at:https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-197
h
medical_qa
huggingface.co
Updated Apr 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Massive Text Embedding Benchmark (2024). medical_qa [Dataset]. https://huggingface.co/datasets/mteb/medical_qa
Explore at:
Dataset updated
Apr 15, 2024
Dataset authored and provided by
Massive Text Embedding Benchmark
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
MedicalQARetrieval An MTEB dataset Massive Text Embedding Benchmark

The dataset consists 2048 medical question and answer pairs.

Task category t2t

Domains Medical, Written

Reference https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3119-4

How to evaluate on this task

You can evaluate an embedding model on this dataset using the following code: import mteb

task = mteb.get_tasks(["MedicalQARetrieval"]) evaluator = mteb.MTEB(task)… See the full description on the dataset page: https://huggingface.co/datasets/mteb/medical_qa.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Josh Espinoza (2022). Profile HMM marker sets [Dataset]. http://doi.org/10.6084/m9.figshare.19616016.v1

Profile HMM marker sets

Explore at:

application/x-gzipAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.19616016.v1

Dataset updated

Oct 14, 2022

Dataset provided by

figshare
Figsharehttp://figshare.com/

Authors

Josh Espinoza

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Archaea_76.hmm - (Anvi'o) Lee, https://doi.org/10.1093/bioinformatics/btz188 (https://github.com/merenlab/anvio/tree/master/anvio/data/hmm/Archaea_76) Bacteria_71.hmm - (Anvi'o) Lee modified, https://doi.org/10.1093/bioinformatics/btz188 (https://github.com/merenlab/anvio/tree/master/anvio/data/hmm/Bacteria_71) Protista_83.hmm - (Anvi'o) Delmont, http://merenlab.org/delmont-euk-scgs (https://github.com/merenlab/anvio/tree/master/anvio/data/hmm/Protista_83) Fungi_593.hmm - (FGMP) https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2782-9 CPR_43.hmm - (CheckM) https://github.com/Ecogenomics/CheckM/tree/master/custom_marker_sets eukaryota_odb10 - (BUSCO) https://busco-data.ezlab.org/v5/data/lineages/eukaryota_odb10.2020-09-10.tar.gz

Citation: * Espinoza, J.L., Dupont, C.L. VEBA: a modular end-to-end suite for in silico recovery, clustering, and analysis of prokaryotic, microeukaryotic, and viral genomes from metagenomes. BMC Bioinformatics 23, 419 (2022). https://doi.org/10.1186/s12859-022-04973-8 * Espinoza, Josh (2022): Profile HMM marker sets. figshare. Dataset. https://doi.org/10.6084/m9.figshare.19616016.v1 * Eren AM, Esen ÖC, Quince C, Vineis JH, Morrison HG, Sogin ML, Delmont TO. 2015. Anvi’o: an advanced analysis and visualization platform for ‘omics data. PeerJ 3:e1319 https://doi.org/10.7717/peerj.1319 * Cissé, O.H., Stajich, J.E. FGMP: assessing fungal genome completeness. BMC Bioinformatics 20, 184 (2019). https://doi.org/10.1186/s12859-019-2782-9 * Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015 Jul;25(7):1043-55. doi: 10.1101/gr.186072.114. * Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Mol Biol Evol. 2021 Sep 27;38(10):4647-4654. doi: 10.1093/molbev/msab199.

Clear search

Close search

Google apps

Main menu

Profile HMM marker sets

Data from: TBGA: A Large-Scale Gene-Disease Association Dataset for...

Paraphrased_Medical_QandA

GENIA Bio-medical event dataset

Context

Content

Acknowledgements

Gaussian Finder's cavity dataset in XML

Repositiry for the article: "Gene regulatory network inference using...

PATH_SURVEYOR_ExampleUseCases

Data set 1 - Proteome set description

Data from: Evaluation data for "GeFaST: An improved method for OTU...

KVFinder's cavity dataset in CSV

medical_qa

Profile HMM marker sets