11 datasets found
  1. Profile HMM marker sets

    • figshare.com
    application/x-gzip
    Updated Oct 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Josh Espinoza (2022). Profile HMM marker sets [Dataset]. http://doi.org/10.6084/m9.figshare.19616016.v1
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    Oct 14, 2022
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Josh Espinoza
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Archaea_76.hmm - (Anvi'o) Lee, https://doi.org/10.1093/bioinformatics/btz188 (https://github.com/merenlab/anvio/tree/master/anvio/data/hmm/Archaea_76) Bacteria_71.hmm - (Anvi'o) Lee modified, https://doi.org/10.1093/bioinformatics/btz188 (https://github.com/merenlab/anvio/tree/master/anvio/data/hmm/Bacteria_71) Protista_83.hmm - (Anvi'o) Delmont, http://merenlab.org/delmont-euk-scgs (https://github.com/merenlab/anvio/tree/master/anvio/data/hmm/Protista_83) Fungi_593.hmm - (FGMP) https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2782-9 CPR_43.hmm - (CheckM) https://github.com/Ecogenomics/CheckM/tree/master/custom_marker_sets eukaryota_odb10 - (BUSCO) https://busco-data.ezlab.org/v5/data/lineages/eukaryota_odb10.2020-09-10.tar.gz

    Citation: * Espinoza, J.L., Dupont, C.L. VEBA: a modular end-to-end suite for in silico recovery, clustering, and analysis of prokaryotic, microeukaryotic, and viral genomes from metagenomes. BMC Bioinformatics 23, 419 (2022). https://doi.org/10.1186/s12859-022-04973-8 * Espinoza, Josh (2022): Profile HMM marker sets. figshare. Dataset. https://doi.org/10.6084/m9.figshare.19616016.v1 * Eren AM, Esen ÖC, Quince C, Vineis JH, Morrison HG, Sogin ML, Delmont TO. 2015. Anvi’o: an advanced analysis and visualization platform for ‘omics data. PeerJ 3:e1319 https://doi.org/10.7717/peerj.1319 * Cissé, O.H., Stajich, J.E. FGMP: assessing fungal genome completeness. BMC Bioinformatics 20, 184 (2019). https://doi.org/10.1186/s12859-019-2782-9 * Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015 Jul;25(7):1043-55. doi: 10.1101/gr.186072.114. * Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Mol Biol Evol. 2021 Sep 27;38(10):4647-4654. doi: 10.1093/molbev/msab199.

  2. Data from: TBGA: A Large-Scale Gene-Disease Association Dataset for...

    • data.europa.eu
    unknown
    Updated Jan 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2022). TBGA: A Large-Scale Gene-Disease Association Dataset for Biomedical Relation Extraction [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-5911097?locale=el
    Explore at:
    unknown(24338684)Available download formats
    Dataset updated
    Jan 27, 2022
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains the TBGA dataset. TBGA is a large-scale, semi-automatically annotated dataset for Gene-Disease Association (GDA) extraction. The dataset consists of three text files, corresponding to train, validation, and test sets, plus an additional JSON file containing the mapping between relation names and IDs. Each record in train, validation, or test files corresponds to a single GDA extracted from a sentence. Records are represented as JSON objects with the following structure: text: sentence from which the GDA was extracted. relation: relation name associated with the given GDA. h: JSON object representing the gene entity, composed of: id: NCBI Entrez ID associated with the gene entity. name: NCBI official gene symbol associated with the gene entity. pos: list consisting of starting position and length of the gene mention within text. t: JSON object representing the disease entity, composed of: id: UMLS CUI associated with the disease entity. name: UMLS preferred term associated with the disease entity. pos: list consisting of starting position and length of the disease mention within text. TBGA contains over 200,000 instances and 100,000 bags. The zip file consists of one folder, named TBGA, containing the files corresponding to the dataset. If you use or extend our work, please cite the following: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04646-6#citeas TBGA paper can be found at: https://rdcu.be/cKkY2 TBGA code is available at: https://github.com/GDAMining/gda-extraction

  3. h

    Paraphrased_Medical_QandA

    • huggingface.co
    Updated Apr 20, 2007
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Advaith (2007). Paraphrased_Medical_QandA [Dataset]. https://huggingface.co/datasets/iceplasma/Paraphrased_Medical_QandA
    Explore at:
    Dataset updated
    Apr 20, 2007
    Authors
    Advaith
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The paraphrased data is referenced from @ARTICLE{BenAbacha-BMC-2019,
    author = {Asma {Ben Abacha} and Dina Demner{-}Fushman}, title = {A Question-Entailment Approach to Question Answering}, journal = {{BMC} Bioinform.}, volume = {20}, number = {1}, pages = {511:1--511:23}, year = {2019}, url = {https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3119-4} }

  4. GENIA Bio-medical event dataset

    • kaggle.com
    Updated Dec 5, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nishanth (2020). GENIA Bio-medical event dataset [Dataset]. https://www.kaggle.com/datasets/nishanthsalian/genia-biomedical-event-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 5, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Nishanth
    License

    http://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html

    Description

    Context

    Bio-medical texts have a lot of information which can be used for developments in the medical field. Traditionally, domain experts used to manually extract such information. Automating this information extraction task can help speed up progress in the field. To name a few use cases of bio-medical events, they show the effects of drugs on a person. They can also be used to identify certain medical conditions in a person. Hence automating extraction of events from bio-medical texts is very beneficial

    Content

    The dataset is just a simplified version of the event annotated GENIA dataset derived from the version available in TEES

    It consists of the original bio-medical text, labelled trigger words, location of trigger word in the text and the event type associated with the trigger word There are 3 sets of data (train (8k+ sentences), devel (about 3k sentences) and test (about 3k sentences)). Each set has 4 columns namely "Sentence", "TriggerWord", "TriggerWordLoc" and "EventType", capturing the original bio-medical text, trigger words in the sentence, location of the trigger words in the sentence and the event type associated with the trigger words respectively.

    Acknowledgements

    The dataset is just a simplified version of the event annotated GENIA dataset derived from the version available in TEES The original source dataset is from BioNLP Shared Task 2011 A complete unprocessed version seems to be present in genia-event-2011 dataset too

    For TEES licensing information please refer this link For GENIA dataset licensing information, please refer the file "GE11-LICENSE" present beside the data files (.csv) in this kaggle dataset

    Photo Credits: Louis Reed on Unsplash

  5. f

    Gaussian Finder's cavity dataset in XML

    • figshare.com
    zip
    Updated Sep 28, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abel Gomes (2019). Gaussian Finder's cavity dataset in XML [Dataset]. http://doi.org/10.6084/m9.figshare.9916733.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 28, 2019
    Dataset provided by
    figshare
    Authors
    Abel Gomes
    License

    https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html

    Description

    Gaussian Finder's cavity dataset in XML. This dataset describes the protein cavities output by a protein cavity detection method called Gaussian Finder. This method is described in the article available at: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1913-4

  6. Z

    Repositiry for the article: "Gene regulatory network inference using...

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alain Mbebi (2023). Repositiry for the article: "Gene regulatory network inference using mixed-norms regularized multivariate model with covariance selection" by Alain Mbebi & Zoran Nikoloski [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7965948
    Explore at:
    Dataset updated
    May 25, 2023
    Dataset provided by
    Alain Mbebi
    Zoran Nikoloski
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the repository for the manuscript "Gene regulatory network inference using mixed-norms regularized multivariate model with covariance selection" by Alain J. Mbebi & Zoran Nikoloski.

    Organisation

    The folder Codes contains the following R scripts with the K-folds cross-validation option to learn the hyperparameters:

    Mixed_L1L21_GRN.R which computes L1L21-solution

    Mixed_L1L21G_GRN.R which computes L1L21G-solution

    Mixed_L2L21_GRN.R which computes L2L21-solution

    Mixed_L2L21G_GRN.R which computes L2L21G-solution

    L1L21_Dream5_Scerevisiae_example_run.R is an example run using the L1L21-solution with S. cerevisiae data (Network 4 in DREAM5 challenge) All files needed to successfully run "L1L21_Dream5_Scerevisiae_example_run" are locaded in the folder Codes.

    1. The folder Figures contains all figures in the manuscript.

    2. The folder Inferred-networks contains all network objects for each dataset and each inference methods in the comparative analysis.

    Dependencies and required packages

    The following packages are required for the contending approaches in the comparative analysis: "devtools", "foreach", "plyr", "glmnet" and "randomForest".

    GENIE3

    The GENIE3 package can be installed from: http://bioconductor.org/packages/release/bioc/html/GENIE3.html

    TIGRESS

    The TIGRESS repository can be obtained from: https://github.com/jpvert/tigress

    ENNET

    The ENNET repository can be obtained from: https://github.com/slawekj/ennet

    PLSNET

    The Matlab source code of PLSNET can be obtained from: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1398-6#Sec17

    PORTIA

    The PORTIA repository can be obtained from: https://github.com/AntoinePassemiers/PORTIA

    D3GRN

    The Matlab source code of D3GRN can be obtained from: https://github.com/chenxofhit/D3GRN

    Fused-LASSO

    The fused-LASSO repository can be obtained from: https://github.com/omranian/inference-of-GRN-using-Fused-LASSO

    ANOVerence

    Because of some technical issues (e.g code's accessibility: http://www2.bio.ifi.lmu.de/˜kueffner/anova.tar.gz), we were not able to reproduce ANOVerence results and used the inferred network from DREAM5 challenge instead.

    1. Although the codes here were tested on Fedora 29 (Workstation Edition) using R (version 4.2.2), they can run under any Linux or Windows OS distributions, as long as all the required packages are compatible with the desired R version.
  7. PATH_SURVEYOR_ExampleUseCases

    • zenodo.org
    zip
    Updated Apr 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timothy Shaw; Timothy Shaw (2024). PATH_SURVEYOR_ExampleUseCases [Dataset]. http://doi.org/10.5281/zenodo.10937799
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 7, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Timothy Shaw; Timothy Shaw
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PATH SURVEYOR pipeline examples that were originally hosted on http://shawlab.science/shiny/PATH_SURVEYOR_ExampleUseCases/

    It was originally presented in our publication PMID: 37380943

    https://pubmed.ncbi.nlm.nih.gov/37380943/ and https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05393-y

    Please contact timothy.shaw@moffitt.org for any additional questions.

  8. Data set 1 - Proteome set description

    • figshare.com
    txt
    Updated Mar 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mick Van Vlierberghe; Denis BAURAIN (2021). Data set 1 - Proteome set description [Dataset]. http://doi.org/10.6084/m9.figshare.13113893.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Mar 20, 2021
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Mick Van Vlierberghe; Denis BAURAIN
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
  9. u

    Data from: Evaluation data for "GeFaST: An improved method for OTU...

    • pub.uni-bielefeld.de
    Updated Mar 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert Müller; Markus Nebel (2021). Evaluation data for "GeFaST: An improved method for OTU assignment by generalising Swarm's fastidious clustering approach" [Dataset]. https://pub.uni-bielefeld.de/record/2918928
    Explore at:
    Dataset updated
    Mar 22, 2021
    Authors
    Robert Müller; Markus Nebel
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Data sets and results of the comparative analyses of GeFaST performed in "GeFaST: An improved method for OTU assignment by generalising Swarm's fastidious clustering approach" (. The scripts for the analyses are available here.

    dereplicated.tar.bz2: Data sets used in the analyses of performance (ELDERMET [1]) and clustering quality (even & uneven [2], ELDERMET). The original data sets (see below) have been dereplicated and sequences containing ambiguous bases (IUPAC code n resp. N) have been deleted.

    The analysis of the clustering quality also requires the reference data set.

    eldermet_subsamples_X.tar.bz2: Each archive contains three random subsamples of ELDERMET of size X, with X being the percentage of sequences from eldermet_derep.fasta (in dereplicated.tar.bz2) in the subsample.

    uneven_subsamples_80.tar.bz2: Archive containing five random subsamples of uneven, each containing 80 % of the sequences from uneven_derep.fasta (in dereplicated.tar.bz2).

    even_subsamples_80.tar.bz2: Archive containing five random subsamples of even, each containing 80 % of the sequences from even_derep.fasta (in dereplicated.tar.bz2).

    eldermet_reduced_subsamples_80.tar.bz2: Archive containing the reduced ELDERMET data set and five random subsamples of it, each containing 80 % of the sequences from eldermet_derep.reduced.fasta, plus the corresponding taxonomic assignments.

    results.tar.bz2: Results files containing the measurements of performance resp. clustering quality.

    • eldermet-performance-measurements.csv: runtime and memory consumption for different thresholds
    • eldermet-subsampling-measurements.csv: runtime and memory consumption for different data set sizes
    • eldermet-sub-fixed-red-log.csv: runtime and memory consumption for different thresholds (on subsamples of reduced data set)
    • eldermet-sub-fixed-red-metrics.csv: clustering quality for different thresholds (on subsamples of reduced data set)
    • even_0.95-metrics.csv: clustering quality for different thresholds, 95 % ground truth
    • even_0.97-metrics.csv: clustering quality for different thresholds, 97 % ground truth
    • even_0.99-metrics.csv: clustering quality for different thresholds, 99 % ground truth
    • uneven_0.95-metrics.csv: clustering quality for different thresholds, 95 % ground truth
    • uneven_0.97-metrics.csv: clustering quality for different thresholds, 97 % ground truth
    • uneven_0.99-metrics.csv: clustering quality for different thresholds, 99 % ground truth
    • uneven-sub-fixed-metrics.csv: clustering quality for different thresholds (on subsamples)
    • even-sub-fixed-metrics.csv: clustering quality for different thresholds (on subsamples)

    References:

    [1] Claesson M.J., Cusack S., O'Sullivan O., Greene-Diniz R., de Weerd H., Flannery E., Marchesi J.R., Falush D., Dinan T., Fitzgerald G., Stanton C., van Sinderen D., O'Connor M., Harnedy N., O'Connor K., Henry C., O'Mahony D., Fitzgerald A.P., Shanahan F., Twomey C., Hill C., Ross R.P., O'Toole P.W.: Composition, variability, and temporal stability of the intestinal microbiota of the elderly. Proceedings of the National Academy of Sciences 108 (Supplement 1), 4586-4591 (2011). doi: 10.1073/pnas.1000097107

    [2] Mahé F., Rognes T., Quince C., de Vargas C., Dunthorn M.: Swarm: robust and fast clustering method for amplicon-based studies. PeerJ 2, 593 (2014). doi: 10.7717/peerj.593

  10. f

    KVFinder's cavity dataset in CSV

    • figshare.com
    zip
    Updated Sep 28, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abel Gomes (2019). KVFinder's cavity dataset in CSV [Dataset]. http://doi.org/10.6084/m9.figshare.9917012.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 28, 2019
    Dataset provided by
    figshare
    Authors
    Abel Gomes
    License

    https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html

    Description

    KVFinder's cavity dataset in CVS.This dataset describes the protein cavities output by a protein cavity detection method called KVFinder. This method is described in the article available at:https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-197

  11. h

    medical_qa

    • huggingface.co
    Updated Apr 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massive Text Embedding Benchmark (2024). medical_qa [Dataset]. https://huggingface.co/datasets/mteb/medical_qa
    Explore at:
    Dataset updated
    Apr 15, 2024
    Dataset authored and provided by
    Massive Text Embedding Benchmark
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    MedicalQARetrieval An MTEB dataset Massive Text Embedding Benchmark

    The dataset consists 2048 medical question and answer pairs.

    Task category t2t

    Domains Medical, Written

    Reference https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3119-4

      How to evaluate on this task
    

    You can evaluate an embedding model on this dataset using the following code: import mteb

    task = mteb.get_tasks(["MedicalQARetrieval"]) evaluator = mteb.MTEB(task)… See the full description on the dataset page: https://huggingface.co/datasets/mteb/medical_qa.

  12. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Josh Espinoza (2022). Profile HMM marker sets [Dataset]. http://doi.org/10.6084/m9.figshare.19616016.v1
Organization logoOrganization logo

Profile HMM marker sets

Explore at:
application/x-gzipAvailable download formats
Dataset updated
Oct 14, 2022
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Josh Espinoza
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Archaea_76.hmm - (Anvi'o) Lee, https://doi.org/10.1093/bioinformatics/btz188 (https://github.com/merenlab/anvio/tree/master/anvio/data/hmm/Archaea_76) Bacteria_71.hmm - (Anvi'o) Lee modified, https://doi.org/10.1093/bioinformatics/btz188 (https://github.com/merenlab/anvio/tree/master/anvio/data/hmm/Bacteria_71) Protista_83.hmm - (Anvi'o) Delmont, http://merenlab.org/delmont-euk-scgs (https://github.com/merenlab/anvio/tree/master/anvio/data/hmm/Protista_83) Fungi_593.hmm - (FGMP) https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2782-9 CPR_43.hmm - (CheckM) https://github.com/Ecogenomics/CheckM/tree/master/custom_marker_sets eukaryota_odb10 - (BUSCO) https://busco-data.ezlab.org/v5/data/lineages/eukaryota_odb10.2020-09-10.tar.gz

Citation: * Espinoza, J.L., Dupont, C.L. VEBA: a modular end-to-end suite for in silico recovery, clustering, and analysis of prokaryotic, microeukaryotic, and viral genomes from metagenomes. BMC Bioinformatics 23, 419 (2022). https://doi.org/10.1186/s12859-022-04973-8 * Espinoza, Josh (2022): Profile HMM marker sets. figshare. Dataset. https://doi.org/10.6084/m9.figshare.19616016.v1 * Eren AM, Esen ÖC, Quince C, Vineis JH, Morrison HG, Sogin ML, Delmont TO. 2015. Anvi’o: an advanced analysis and visualization platform for ‘omics data. PeerJ 3:e1319 https://doi.org/10.7717/peerj.1319 * Cissé, O.H., Stajich, J.E. FGMP: assessing fungal genome completeness. BMC Bioinformatics 20, 184 (2019). https://doi.org/10.1186/s12859-019-2782-9 * Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015 Jul;25(7):1043-55. doi: 10.1101/gr.186072.114. * Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Mol Biol Evol. 2021 Sep 27;38(10):4647-4654. doi: 10.1093/molbev/msab199.

Search
Clear search
Close search
Google apps
Main menu