47 datasets found
  1. u

    Data from: Functional annotation for 15 diverse arthropod genomes

    • agdatacommons.nal.usda.gov
    • catalog.data.gov
    application/x-gzip
    Updated Nov 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Surya Saha; Amanda M. Cooksey; Anna K. Childers; Monica F. Poelchau; Fiona M. McCarthy (2025). Functional annotation for 15 diverse arthropod genomes [Dataset]. http://doi.org/10.15482/USDA.ADC/1522860
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    Nov 22, 2025
    Dataset provided by
    Ag Data Commons
    Authors
    Surya Saha; Amanda M. Cooksey; Anna K. Childers; Monica F. Poelchau; Fiona M. McCarthy
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    We present the annotation results of 15 arthropod proteomes using an open source, open access and containerized pipeline for genome-scale functional annotation of insect proteomes and apply it to a diverse range of arthropod species. You can find more information about the pipeline at our readthedocs site. The files for each genome include GOanna, InterproScan and KOBAS predictions. Arthropod genomes selected for this study and their assembly and annotation statistics.

    Apis Mellifera (honey bee) Drosophila melanogaster (fruit fly) Tribolium castaneum (red flour beetle) Latrodectus hesperus (Western black widow spider) Limnephilus lunatus (caddisfly) Oncopeltus fasciatus (Large milkweed bug) Homalodisca vitripennis (Glassy-winged sharpshooter) Eurytemora affinis (calanoid copepod) Agrilus planipennis (emerald ash borer) Copidosoma floridanum (parasitoid wasp) Athalia rosae (turnip sawfly) Ceratitis capitata (Mediterranean fruit fly) Cimex lectularius (Cimicidae bed bug) Varroa destructor(parasitic mite)

    Diaphorina citri (Asian citrus psyllid) Resources in this dataset:Resource Title: Cimex lectularius (Cimicidae bed bug) annotation. File Name: CLEC.tar.gzResource Description: Functional annotation for Clec-OGSv1.2 protein setResource Title: Tribolium castaneum (red flour beetle) annotation. File Name: TCAS.tar.gzResource Description: Functional annotation for TCAS_OGS_v3 protein setResource Title: Drosophila melanogaster (fruit fly) annotation. File Name: DMEL.tar.gzResource Description: Functional annotation for DMEL_r6.38 protein set

    Resource Title: Varroa destructor (parasitic mite) annotation. File Name: VDES.tar.gzResource Description: Functional annotation for NCBI Varroa destructor Annotation Release 100 protein set based on Vdes_3.0 genome (GCA_002443255.1) Resource Title: Oncopeltus fasciatus (Large milkweed bug) annotation. File Name: ONCFAS.tar.gzResource Description: Functional annotation for oncfas_OGSv1.2 protein setResource Title: Apis Mellifera (honey bee) annotation. File Name: AMEL.tar.gzResource Description: Functional annotation for OGSv3.3 protein set from Amel_4.5 genome (GCA_000002195.1) Resource Title: Homalodisca vitripennis (Glassy-winged sharpshooter) annotation. File Name: HVIT.tar.gzResource Description: Functional annotation for HVIT-BCM_version_0.5.3 protein set based on Hvit_1.0 genome (GCA_000696855.1) Resource Title: Limnephilus lunatus (caddisfly) annotation. File Name: LLUN.tar.gzResource Description: Functional annotation for LLUN-BCM_version_0.5.3 protein set from Llun_1.0 genome (GCA_000648945.1) Resource Title: Latrodectus hesperus (Western black widow spider) annotation. File Name: LHES.tar.gzResource Description: Functional annotation for LHES-BCM_version_0.5.3 protein set from Lhes_1.0 genome (GCA_000697925.1) Resource Title: Eurytemora affinis (calanoid copepod) annotation. File Name: EAFF.tar.gzResource Description: Functional annotation for EAFF-BCM_version_0.5.3 protein set from Eaff_1.0 genome (GCA_000591075.1) Resource Title: Copidosoma floridanum (parasitoid wasp) annotation. File Name: CFLO.tar.gzResource Description: Functional annotation for CFLO-BCM_version_0.5.3 protein set based on Cflo_1.0 genome (GCA_000648655.1) Resource Title: Ceratitis capitata (Mediterranean fruit fly) annotation. File Name: CCAP.tar.gzResource Description: Functional annotation for Ccap-OGSv1 protein set based on Ccap_1.1 assembly (GCA_000347755.2) Resource Title: Athalia rosae (turnip sawfly) annotation. File Name: AROS.tar.gzResource Description: Functional annotation for AROS-BCM_version_0.5.3 protein set based on Aros_1.0 genome (GCA_000344095.1)Resource Title: Agrilus planipennis (emerald ash borer) annotation. File Name: APLA.tar.gzResource Description: Functional annotation for APLA-BCM_version_0.5.3 protein set based on Apla_1.0 genome (GCA_000699045.1)

  2. Supplementary materials for "Robustness analysis of metabolic predictions in...

    • zenodo.org
    • data-staging.niaid.nih.gov
    xml, zip
    Updated Apr 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elham Karimi; Elham Karimi; Enora Geslain; Arnaud Belcour; Clémence Frioux; Méziane Aïte; Anne Siegel; Erwan Corre; Simon M. Dittami; Enora Geslain; Arnaud Belcour; Clémence Frioux; Méziane Aïte; Anne Siegel; Erwan Corre; Simon M. Dittami (2021). Supplementary materials for "Robustness analysis of metabolic predictions in algal microbial communities based on different annotation pipelines" [Dataset]. http://doi.org/10.5281/zenodo.4436003
    Explore at:
    zip, xmlAvailable download formats
    Dataset updated
    Apr 1, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Elham Karimi; Elham Karimi; Enora Geslain; Arnaud Belcour; Clémence Frioux; Méziane Aïte; Anne Siegel; Erwan Corre; Simon M. Dittami; Enora Geslain; Arnaud Belcour; Clémence Frioux; Méziane Aïte; Anne Siegel; Erwan Corre; Simon M. Dittami
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supplementary materials for the revised version of the PeerJ preprint " Robustness analysis of metabolic predictions in algal microbial communities based on different annotation pipelines".

  3. Genome Annotation of the Biting Midges Culicoides sonorensis and Culicoides...

    • researchdata.edu.au
    • data.csiro.au
    datadownload
    Updated Nov 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gunjan Pandey; Debbie Eagles; Stacey Lynch; Prasad Paradkar; Tom Walsh; Rahul Rane; Leon Court; Melissa Klein; Asif Ahmed; Rahul Rane; Prasad Paradkar; Leon Court; Khandaker Asif ahmed; Gunjan Pandey; Debbie Eagles (2025). Genome Annotation of the Biting Midges Culicoides sonorensis and Culicoides stellifer Generated Using the EGAPx-alpha Pipeline [Dataset]. http://doi.org/10.25919/1TF9-BN03
    Explore at:
    datadownloadAvailable download formats
    Dataset updated
    Nov 17, 2025
    Dataset provided by
    CSIROhttps://www.csiro.au/
    Authors
    Gunjan Pandey; Debbie Eagles; Stacey Lynch; Prasad Paradkar; Tom Walsh; Rahul Rane; Leon Court; Melissa Klein; Asif Ahmed; Rahul Rane; Prasad Paradkar; Leon Court; Khandaker Asif ahmed; Gunjan Pandey; Debbie Eagles
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Sep 10, 2022 - Oct 10, 2025
    Description

    This collection contains gene annotation datasets for Culicoides sonorensis and Culicoides stellifer (Diptera: Ceratopogonidae), two biting midge species of veterinary significance. Culicoides sonorensis is a confirmed vector of Bluetongue and Epizootic Hemorrhagic Disease viruses in North America, while C. stellifer has been implicated in Orbivirus transmission in the southeastern United States. The reference genome assemblies for these species are publicly available at NCBI under accession numbers GCA_047716325.1 (C. sonorensis) and GCA_040583785.1 (C. stellifer).

    Gene models for both assemblies were predicted using the EGAPx-alpha pipeline (egapx:0.3.2-alpha), employing the same C. sonorensis RNA-Seq dataset (ERR2171964 – ERR2171978) as transcriptomic evidence for model training, as no C. stellifer RNA-Seq data are currently available (Last checked- 17/11/2025). The dataset includes gene model coordinates (GFF), transcript nucleotide sequences (FNA), and predicted protein sequences (FAA).

    These curated annotation datasets were generated to support the forthcoming manuscript: Chromosome-scale genome of Culicoides brevitarsis highlights genetic basis of vector competency. They provide consistent annotation resources for cross-species comparative analyses among Culicoides midges. Lineage: Culicoides sonorensis Genome assembly: GCA_047716325.1 (idCulSono.KS.ABADRU.1.0.female) Isolate: Kansas colony WGS project: JBLLJK01 Submitter: Ag100Pest Initiative (USDA-ARS ABADRU) Release date: 12 Feb 2025 Annotation pipeline: EGAPx-alpha (egapx:0.3.2-alpha) Transcript evidence: C. sonorensis RNA-Seq ERR2171964 – ERR2171978

    Culicoides stellifer: Genome assembly: GCA_040583785.1 (c_stellifer_primary030_purged) Submitter: University of Guelph Release date: 10 Jul 2024 Annotation pipeline: EGAPx-alpha (egapx:0.3.2-alpha) Transcript evidence: Shared C. sonorensis RNA-Seq dataset (ERR2171964 – ERR2171978) used for training and gene structure validation

  4. n

    Data from: A chromosome-scale high-contiguity genome assembly of the...

    • data.niaid.nih.gov
    • search.dataone.org
    zip
    Updated Jan 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sven Winter; René Meißner; Carola Greve; Alexander Ben Hamadou; Petr Horin; Stefan Prost; Pamela A. Burger (2023). A chromosome-scale high-contiguity genome assembly of the threatened cheetah (Acinonyx jubatus) [Dataset]. http://doi.org/10.5061/dryad.xksn02vkr
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 18, 2023
    Dataset provided by
    University of Oulu
    University of Veterinary Medicine Vienna
    LOEWE Centre for Translational Biodiversity Genomics
    University of Veterinary Science Brno
    Authors
    Sven Winter; René Meißner; Carola Greve; Alexander Ben Hamadou; Petr Horin; Stefan Prost; Pamela A. Burger
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    The cheetah (Acinonyx jubatus, SCHREBER 1775) is a large felid and is considered the fastest land animal. Historically, it inhabited open grassland across Africa, the Arabian Peninsula, and southwestern Asia; however, only small and fragmented populations remain today. Here, we present a de novo genome assembly of the cheetah based on PacBio continuous long reads and Hi-C proximity ligation data. The final assembly (VMU_Ajub_asm_v1.0) has a total length of 2.38 Gb, of which 99.7% are anchored into the expected 19 chromosome-scale scaffolds. The contig and scaffold N50 values of 96.8 Mb and 144.4 Mb, respectively, a BUSCO completeness of 95.4% and a k-mer completeness of 98.4%, emphasize the high quality of the assembly. Furthermore, annotation of the assembly identified 23,622 genes and a repeat content of 40.4%. This new highly contiguous and chromosome-scale assembly will greatly benefit conservation and evolutionary genomic analyses and will be a valuable resource, e.g., to gain a detailed understanding of the function and diversity of immune response genes in felids. Methods The presented data is related to the eponymous publication "A chromosome-scale high-contiguity genome assembly of the threatened cheetah (Acinonyx jubatus)" soon to be published in the Journal of Heredity. Any questions regarding this dataset or the publication can be addressed to the corresponding authors, Sven Winter (sven.winter@vetmeduni.ac.at) and Pamela Burger (pamela.burger@vetmeduni.ac.at). Assembly: The assembly was generated from one PacBio CLR library sequenced on one SMRTCell on a Sequel IIe using Flye v. 2.9, including one iteration of long-read polishing followed by one iteration of short-read polishing with pilon v.1.23 using trimmed standard Illumina short-reads generated on the Illumina Novaseq 6000 platform. Subsequently, the contigs of the polished assembly were anchored into chromosome-scale scaffolds with YaHS v.1.1 using publically available Hi-C data for the cheetah (SRR8616936, SRR8616937) that were prepared following the Arima Hi-C mapping pipeline (https://github.com/VGP/vgp-assembly/blob/master/pipeline/salsa/arima_mapping_pipeline.sh). Finally, two iterations of gap-closing were performed with TGS-GapCloser v. 1.1.1 using a different random subset of PacBio reads (25%) for each iteration. Annotation: Repeat Annotation To improve gene prediction, we first identified and masked the repeats in the assembly. A de novo repeat library was generated with RepeatModeler v.2.0.1 and combined with a Felidae-specific repeat library from RepBase. This custom repeat library was then used with RepeatMasker v.4.1.0 to hard-mask interspersed repeats and soft-mask simple repeats. Gene annotation We predicted genes in the masked assembly based on homology using the GeMoMa pipeline v. 1.7.1 and the following reference assemblies and annotation files: Homo sapiens (GCF_000001405.40), Mus musculus (GCF_000001635.27), Lynx canadensis(GCF_007474595.2), Canis lupus familiaris (GCF_014441545.1), Prionailuris bengalensis (GCF_016509475.1), Leopardus geoffroyi (GCF_018350155.1), Felis catus (GCF_018350175.1), Panthera tigris (GCF_018350195.1), and Panthera leo (GCF_018350215.1). We functionally annotated the predicted proteins using InterProScan v.5.50.84 and a BLASTP v.2.11.0 search against the Swiss-Prot database (release 2021-02). For more details on assembly quality assessment and comparative analyses to other Felidae assemblies, please read the original manuscript. This dataset comprises the following files: VMU_Ajub_asm_v1.0.fasta (final unmasked assembly, also available at GenBank under accession GCA_027475565.1) VMU_Ajub_asm_v1.0.fasta.masked (final assembly with all repeats hard-masked) VMU_Ajub_asm_v1.0.fasta.out (RepeatMasker output for the fully hard-masked assembly VMU_Ajub_asm_v1.0.fasta.masked) VMU_Ajub_asm_v1.0.fasta.tbl (RepeatMasker output for the fully hard-masked assembly VMU_Ajub_asm_v1.0.fasta.masked) VMU_Ajub_asm_v1.0_hardmaskedTE.fasta (final assembly with all interspersed repeats hard-masked) VMU_Ajub_asm_v1.0_hardmaskedTE.fasta.out (RepeatMasker output for the fully hard-masked assembly VMU_Ajub_asm_v1.0_hardmaskedTE.fasta) VMU_Ajub_asm_v1.0_hardmaskedTE.fasta.tbl (RepeatMasker output for the fully hard-masked assembly VMU_Ajub_asm_v1.0_hardmaskedTE.fasta) VMU_Ajub_asm_v1.0_hardmaskedTE_softmaskedSR.fasta (final assembly with all interspersed repeats hard-masked and simple repeats soft-masked) VMU_Ajub_asm_v1.0_hardmaskedTE_softmaskedSR.fasta.out (RepeatMasker output for the fully hard-masked assembly VMU_Ajub_asm_v1.0_hardmaskedTE_softmaskedSR.fasta) VMU_Ajub_asm_v1.0_hardmaskedTE_softmaskedSR.fasta.tbl (RepeatMasker output for the fully hard-masked assembly VMU_Ajub_asm_v1.0_hardmaskedTE_softmaskedSR.fasta) consensi.fa.classified (de novo repeat library for the final assembly VMU_Ajub_asm_v1.0.fasta generated With RepeatModeler2) Ajub_assembly_commands.txt (List with all commands used to generate the assembly and all related analyses)

  5. Additional file 3: of A De-Novo Genome Analysis Pipeline (DeNoGAP) for...

    • springernature.figshare.com
    xlsx
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shalabh Thakur; David Guttman (2023). Additional file 3: of A De-Novo Genome Analysis Pipeline (DeNoGAP) for large-scale comparative prokaryotic genomics studies [Dataset]. http://doi.org/10.6084/m9.figshare.c.3628187_D2.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Shalabh Thakur; David Guttman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    List of genomes used for development and testing of the DeNoGAP pipeline. (XLSX 25Â kb)

  6. Data from: Globe230k: A Benchmark Dense-Pixel Annotation Dataset for Global...

    • zenodo.org
    txt, zip
    Updated Jul 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qian Shi; Da He; Zhengyu Liu; Xiaoping Liu; Jingqian Xue; Qian Shi; Da He; Zhengyu Liu; Xiaoping Liu; Jingqian Xue (2024). Globe230k: A Benchmark Dense-Pixel Annotation Dataset for Global Land Cover Mapping [Dataset]. http://doi.org/10.5281/zenodo.10435661
    Explore at:
    txt, zipAvailable download formats
    Dataset updated
    Jul 7, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Qian Shi; Da He; Zhengyu Liu; Xiaoping Liu; Jingqian Xue; Qian Shi; Da He; Zhengyu Liu; Xiaoping Liu; Jingqian Xue
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We (Intelligent Mining and Analysis of Remote Sensing big data, IMARS) create a large-scale annotated dataset (Globe230k) for land use/land cover (LULC) mapping, which is annotated on Google Earth image of 1 m spatial resolution. Globe230k is annotated by numerous experts and students major in survey and mapping after necessary training, through visual interpretation on very high-resolution images, as well as in-situ field survey, under the guidance of the organized annotation pipeline. Globe230k has three superiorities:

    1) Large scale: the Globe230k includes 232,819 annotated images with the size of 512x512 and spatial resolution of 1 m, with more than 3x1010 annotated pixels, and it includes 10 first-level categories.

    2) Rich diversity: the annotated images are sampled from worldwide regions, with coverage area of over 60,000 km2, indicating a high variability and diversity. Besides, in order to ensure the category balance, we intentionally give more chance to the rare categories to be sampled, such as wetland, ice/snow, etc.

    3) Multi-modal: Globe230k not only contains RGB bands, but also include other important features for Earth system research, such as Normalized differential vegetation index (NDVI), digital elevation model (DEM), vertical-vertical polarization (VV) bands, vertical-horizontal polarization (VH) bands, which can facilitate the multi-modal data fusion research. Due to the large size of the multi-modal dataset (DEM 1.91G, NDVI 164G, VVVH 372G), these dataset are stored on Baidu Yunpan, the download link is :https://pan.baidu.com/s/12AKbiqOXSf4fnm7mYkCE0g?pwd=230k, the extraction code is 230k.

    The image patches and their corresponding annotated patches are respectively stored in "image_patch.zip" and "label_patch.zip" file. The RGB image is in forms of ".jpg", with size of 512x512, the pixel value is ranged from 0-255. The annotated patches is in forms of ".png", also with size of 512x512, the pixel value is ranged from 1-10, which respectively represent 1#cropland, 2#forest, 3#grass, 4#shrubland, 5#wetland, 6#water, 7#tundra, 8#impervious, 9#bareland, 10#ice/snow. The corresponding DEM, NDVI and VVVH patches are all in form of ".tif", with size of 512x512 (due to the different resolution of DEM, NDVI and VVVH patches, they are all uniformly resized to the same scale as the image patch).

    The total 232,819 pairs are officially divided into training set, validation set, and test set, based on ratio of 7:1:2, which can be find in "train_num.txt","val_num.txt","test_num.txt" file. Based on this division, the official baseline accuracy of several state-of-the-art semantic segmentation can be found in the related arcticle (https://spj.science.org/doi/10.34133/remotesensing.0078).

    We hope it can be used as a benchmark to promote further development of global land cover mapping and semantic segmentation algorithm development.

  7. d

    A high-quality genome assembly for Dillenia turbinata (Dilleniales)

    • search.dataone.org
    • datadryad.org
    Updated Nov 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andre Chanderbali; Xing Guo; Qiong Qiong Lin; Shi Xiao Luo; Huan Liu; Matthew Gitzendanner; Steven Smith; Douglas Soltis; Pamela Soltis (2023). A high-quality genome assembly for Dillenia turbinata (Dilleniales) [Dataset]. http://doi.org/10.5061/dryad.msbcc2g3j
    Explore at:
    Dataset updated
    Nov 29, 2023
    Dataset provided by
    Dryad Digital Repository
    Authors
    Andre Chanderbali; Xing Guo; Qiong Qiong Lin; Shi Xiao Luo; Huan Liu; Matthew Gitzendanner; Steven Smith; Douglas Soltis; Pamela Soltis
    Time period covered
    May 19, 2023
    Description

    Objectives: Dillenia turbinata (Dilleniaceae) is a member of the order Dilleniales, an enigmatic clade of critical importance for understanding the diversification history of flowering plants but for which genome sequences are not available. We have produced and annotated a chromosome-scale whole genome assembly for D. turbinata through the resources of the 10KP (10,000 Plants) Genomes Project. The genome assembly and associated data provided here will serve as a useful resource for comparative and evolutionary genomics research across the flowering plants. Data description: The D. turbinata genome was assembled from Oxford Nanopore Technology (ONT) and whole-genome shotgun (WGS) sequences, and scaffolded into chromosome-scale pseudomolecules using Hi-C data. The genome assembly is 723,739,077 base pairs in length with a BUSCO completeness score of 97%.  Twenty-eight scaffolds contain more than 99% of the assembly. The repeat-masked genome sequence is annotated with 36,967 protein-codin..., Genome assembly and annotation Raw nanopore reads in fastq format were assembled with Canu v2.2 (Koren et al. 2017) using an estimated genome of 900Mb to guide coverage parameters during the read correction, trimming, and assembly steps of the pipeline. The resulting primary assembly was polished with the WGS reads using NextPolish v1.3.1 (Hu et al. 2020), and duplicated constructs were removed by Purge Haplotigs (Roach et al. 2018). The set of deduplicated contigs was scaffolded on the basis of Hi-C reads using the Juicer pipeline (Durand et al. 2016) and 3d-dna tools (Dudchenko et al. 2017) with default parameters. Genome annotation was performed using the MAKER-P pipeline (Campbell et al. 2014) supplied with coding DNA sequences (CDS) from a Trinity (Grabherr et al. 2011) assembly of the Dillenia transcriptome reads, proteomes from four publicly available eudicot genomes —Arabidopsis thaliana, Aquilegia coerulea, Nelumbo nucifera, and Vitis vinifera, and a custom repeat library of tr..., The included data files may be opened with MS Word (Detailed Methods.docx), MS Ecel (Dillenia.genome.assembly.stats.xlsx), standard image viewer software (Dillenia.BUSCO.summaries.png), and standard text editor programs (Dillenia.genome.fasta and Dillenia.maker.predict.36967.final.gff), .

  8. Genome and repeat annotation of the phased telomere-to-telomere assembly of...

    • zenodo.org
    zip
    Updated Aug 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hanane El Idrissi; Anestis Gkanogiannis; Anestis Gkanogiannis; Driss Iraqi; Siham Khoulassa; Mohamed Fokar; Bouabid Badaoui; Rachid Moussadek; Rachid Mentag; Slimane Khayi; Hanane El Idrissi; Driss Iraqi; Siham Khoulassa; Mohamed Fokar; Bouabid Badaoui; Rachid Moussadek; Rachid Mentag; Slimane Khayi (2025). Genome and repeat annotation of the phased telomere-to-telomere assembly of Moroccan argane tree (Argania spinosa) [Dataset]. http://doi.org/10.5281/zenodo.16017913
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 17, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Hanane El Idrissi; Anestis Gkanogiannis; Anestis Gkanogiannis; Driss Iraqi; Siham Khoulassa; Mohamed Fokar; Bouabid Badaoui; Rachid Moussadek; Rachid Mentag; Slimane Khayi; Hanane El Idrissi; Driss Iraqi; Siham Khoulassa; Mohamed Fokar; Bouabid Badaoui; Rachid Moussadek; Rachid Mentag; Slimane Khayi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Morocco
    Description

    This dataset provides the full genome annotation supporting the publication:

    "Phased T2T reference genome assembly of Moroccan Argane (Argania spinosa)"
    Hanane El Idrissi, Anestis Gkanogiannis, Driss Iraqi, Siham Khoulassa, Mohamed Fokar, Bouabid Badaoui, Rachid Moussadek, Rachid Mentag and Slimane Khayi (2025)

    We present the genome annotation files for the phased, telomere-to-telomere (T2T), chromosome-scale genome assembly of Argania spinosa, an ecologically and economically important tree endemic to Morocco. The assembly comprises two fully phased haplotypes, each organized into 11 pseudochromosomes.

    This Zenodo entry includes:

    • Structural gene annotation (GFF3) generated using the Funannotate pipeline

    • Predicted protein sequences (FASTA)

    • Repeat annotations:

      • GFF3 files for simple, complex, and combined repeat annotations

      • Soft-masked genome FASTA (simple and complex repeats masked in lowercase)

      • Hard-masked genome FASTA (repeats replaced with Ns)

    Gene prediction was performed using transcript evidence (RNA-Seq from root and leaf tissues), protein homology from Ericales and SwissProt, and de novo ab initio models (AUGUSTUS, GeneMark-ES). Functional annotations were assigned using InterProScan, eggNOG, Pfam, and Gene Ontology databases. Repeat annotation was performed using RepeatModeler and RepeatMasker in a multi-round strategy incorporating both lineage-specific and de novo repeat libraries.

    NCBI BioProject: PRJNA1223813

  9. n

    3D-Genomics Database

    • neuinfo.org
    • scicrunch.org
    • +2more
    Updated Oct 17, 2010
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2010). 3D-Genomics Database [Dataset]. http://identifiers.org/RRID:SCR_007430
    Explore at:
    Dataset updated
    Oct 17, 2010
    Description

    THIS RESOURCE IS NO LONGER IN SERVICE, documented August 29, 2016. Database containing structural annotations for the proteomes of just under 100 organisms. Using data derived from public databases of translated genomic sequences, representatives from the major branches of Life are included: Prokaryota, Eukaryota and Archaea. The annotations stored in the database may be accessed in a number of ways. The help page provides information on how to access the database. 3D-GENOMICS is now part of a larger project, called e-Protein. The project brings together similar databases at three sites: Imperial College London , University College London and the European Bioinformatics Institute . e-Protein''s mission statement is To provide a fully automated distributed pipeline for large-scale structural and functional annotation of all major proteomes via the use of cutting-edge computer GRID technologies. The following databases are incorporated: NRprot, SCOP, ASTRAL, PFAM, Prosite, taxonomy, COG The following eukaryotic genomes are incorporated: Anopheles gambiae, protein sequences from the mosquito genome; Arabidopsis thaliana, protein sequences from the Arabidopsis genome; Caenorhabditis briggsae, protein sequences from the C.briggsae genome; Caenorhabditis elegans protein sequences from the worm genome; Ciona intestinalis protein sequences from the sea squirt genome; Danio rerio protein sequences from the zebrafish genome; Drosophila melanogaster protein sequences from the fruitfly genome; Encephalitozoon cuniculi protein sequences from the E.cuniculi genome; Fugu rubripes protein sequences from the pufferfish genome; Guillardia theta protein sequences from the G.theta genome; Homo sapiens protein sequences from the human genome; Mus musculus protein sequences from the mouse genome; Neurospora crassa protein sequences from the N.crassa genome; Oryza sativa protein sequences from the rice genome; Plasmodium falciparum protein sequences from the P.falciparum genome; Rattus norvegicus protein sequences from the rat genome; Saccharomyces cerevisiae protein sequences from the yeast genome; Schizosaccharomyces pombe protein sequences from the yeast genome

  10. OKI2018_I69 assembly and annotation of the genome of an individual...

    • zenodo.org
    bin
    Updated Jun 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aleksandra Bliznina; Aleksandra Bliznina (2021). OKI2018_I69 assembly and annotation of the genome of an individual Oikopleura dioica from Okinawa [Dataset]. http://doi.org/10.5281/zenodo.4604144
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 3, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Aleksandra Bliznina; Aleksandra Bliznina
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    A chromosome-scale assembly of the Oikopleura dioica genome from Okinawa, Japan. The contig assembly was generated with long-read Nanopore data using Canu pipeline v1.8, and polished with short Illumina MiSeq reads using Pilon v1.22. Both Nanopore and Illumina data were generated from DNA of a single O. dioica male. Hi-C chromosomal conformation capture data was used to order and orient the contigs into scaffolds using Juicer v1.6 and 3D de novo assembly (3D-DNA) pipelines. The OKI2018_I69_1.0 assembly comprises 19 scaffolds with an N50 of 16.2 Mbp (OKI2018_I69_1.0.fa). The total assembly length is 64.3 Mbp. The five longest scaffolds represent autosomal chromosomes (chr 1 and chr 2), and sex chromosomes split into pseudo-autosomal region (PAR) and X-specific (XSR) or Y-specific (YSR) regions. One of the smaller scaffolds represent a draft assembly of mitochondrial genome (chrUn_12). The rest of scaffolds are highly repetitive and were marked as unplaced. The OKI2018_I69_1.0 assembly was annotated with AUGUSTUS v3.3 and MAKER v3.01.03 pipelines. Gene predictions from these software were refined and merged using EvidenceModeler v1.1.1. To predict UTRs and alternative isoforms, the EVM models were updated using two round of PASA pipeline, resulting in 18,485 transcript models distributed among 16,936 protein-coding genes (OKI2018_I69_1.0.gene_models.gff3).

  11. Chromosome-scale genome assembly of the African spiny mouse (Acomys...

    • zenodo.org
    bin
    Updated Apr 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Danny Miller; Danny Miller (2023). Chromosome-scale genome assembly of the African spiny mouse (Acomys cahirinus) [Dataset]. http://doi.org/10.5281/zenodo.7761277
    Explore at:
    binAvailable download formats
    Dataset updated
    Apr 1, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Danny Miller; Danny Miller
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Genomic DNA was extracted from blood from a single male A. cahirinus animal using a Monarch HMW DNA Extraction Kit for Cells & Blood (T3050, New England Biolabs, Ipswich MA) following the manufacturer’s recommended protocol. DNA was quantified prior to library construction using the Qubit DNA HS Assay (ThermoFischer, Waltham MA) and DNA fragment lengths were assessed using the Agilent Femto Pulse System (Santa Clara, CA). Libraries were prepared for sequencing using the Oxford Nanopore ligation kit (SQK-LSK110) following the manufacturers’ instructions, except that DNA repair and A-tailing was performed for 30 min and the ligation was allowed to continue for 1 hr. Prepared libraries were quantified using a Qubit fluorometer and 30 fmol of the library was loaded onto a Nanopore version R.9.4.1 flow cell and loaded on a PromethION running MinKNOW version (21.05.20). To increase output, the flow cell was washed after approximately 24 hr of sequencing then an additional 12 fmol of library was added to the flow cell and run for an additional 48 hr. Basecalling was performed using Guppy 5.0.12 (Oxford Nanopore) using the superior model (dna_r9.4.1_450bps_sup_prom.cfg). FASTQ files for assembly were extracted from unaligned bam files using samtools (Li et al. 2009) then Flye version 2.9 for assembly using the --nano-hq flag (Kolmogorov et al. 2019). Haplotigs and overlaps in the assembly were purged using purge_dups (https://github.com/dfguan/purge_dups). The assembly was then polished using Medaka version 1.4.2 (https://github.com/nanoporetech/medaka) followed by a second polishing step with pilon version 1.24 (Walker et al. 2014). Assembly statistics at each step were generated using Quast (Gurevich et al. 2013) and BUSCO (Simão et al. 2015) (Table S2). The primary contigs assembled from the Nanopore data were anchored to chromosomes using 505,210,505 read pairs of a Hi-C library isolated from another A. cahirinus individual of unknown sex downloaded from the NCBI Short Read Archive (SRX13258644) (Wang et al. 2022). After aligning the Hi-C reads with the ArimaHi-C Mapping Pipeline (https://github.com/ArimaGenomics/mapping_pipeline), YaHS v1.0 (Zhou et al. 2023) was used with default error correction for scaffolding, and Juicebox v1.11.08 (Dudchenko et al. 2018) was used to generate a Hi-C contact map. Progressive Cactus was used (Armstrong et al. 2020) to perform a whole-genome alignment of the A. cahirinus draft assembly to the Mus musculus GRCm39 reference genome (RefSeq GCF_000001635.27_GRCm39). Comparative annotation of the draft genomes was then performed using the Comparative Annotation Toolkit (CAT) (Fiddes et al. 2018). Briefly, the M. musculus RefSeq annotation GFF was parsed and validated with the “parse_ncbi_gff3” and “validate_gff3” programs (respectively) from CAT. The M. musculus reference transcript cDNA sequences were downloaded and mapped to the M. musculus draft genome with minimap2 (Li 2018) and provided to CAT as long-read RNA-seq reads in the “[ISO_SEQ_BAM]” field of the configuration file. For A. cahirinus, bulk RNA-seq data obtained from multiple pooled organs were downloaded from NCBI SRA BioProject PRJNA342864 (Bellofiore et al. 2017) and mapped to the draft assembly with STAR (Dobin et al. 2013) then provided to CAT in the “[BAMS]” field. CpG islands were identified using the cpg_lh utility from the UCSC suite of tools (Kent et al. 2002).

  12. Lizard dataset

    • kaggle.com
    zip
    Updated Nov 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aadam (2021). Lizard dataset [Dataset]. https://www.kaggle.com/datasets/aadimator/lizard-dataset
    Explore at:
    zip(786544364 bytes)Available download formats
    Dataset updated
    Nov 27, 2021
    Authors
    Aadam
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The development of deep segmentation models for computational pathology (CPath) can help foster the investigation of interpretable morphological biomarkers. Yet, there is a major bottleneck in the success of such approaches because supervised deep learning models require an abundance of accurately labelled data. This issue is exacerbated in the field of CPath because the generation of detailed annotations usually demands the input of a pathologist to be able to distinguish between different tissue constructs and nuclei. Manually labelling nuclei may not be a feasible approach for collecting large-scale annotated datasets, especially when a single image region can contain thousands of different cells. Yet, solely relying on automatic generation of annotations will limit the accuracy and reliability of ground truth. Therefore, to help overcome the above challenges, we propose a multi-stage annotation pipeline to enable the collection of large-scale datasets for histology image analysis, with pathologist-in-the-loop refinement steps. Using this pipeline, we generate the largest known nuclear instance segmentation and classification dataset, containing nearly half a million labelled nuclei in H&E stained colon tissue. We will publish the dataset and encourage the research community to utilise it to drive forward the development of downstream cell-based models in CPath.

    Link to the dataset paper.

    Citation

    @inproceedings{graham2021lizard,
     title={Lizard: A Large-Scale Dataset for Colonic Nuclear Instance Segmentation and Classification},
     author={Graham, Simon and Jahanifar, Mostafa and Azam, Ayesha and Nimir, Mohammed and Tsang, Yee-Wah and Dodd, Katherine and Hero, Emily and Sahota, Harvir and Tank, Atisha and Benes, Ksenija and others},
     booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
     pages={684--693},
     year={2021}
    }
    

    Acknowledgements

    We would like to acknowledge the following institutions, where the images in this dataset originated from:

    • University Hospitals Coventry and Warwickshire, United Kingdom
    • Histo Pathology Diagnostic Center, Shanghai, China
    • Ruijin Hospital, Shanghai, China
    • Xijing Hospital, Xi'an, China
    • Shanghai Songjiang District Central Hospital, Shanghai, China
    • The National Cancer Institute (NCI), United States of America
  13. f

    Additional file 3: of Genome-wide sequencing and metabolic annotation of...

    • datasetcatalog.nlm.nih.gov
    • springernature.figshare.com
    Updated Jun 29, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RiaĂąo-PachĂłn, Diego; Fernandes, Bruna; Zaiat, Marcelo; Pradella, JosĂŠ; Resende, Tiago; Rocha, Isabel; Dias, Oscar; Neto, Antonio Kaupert; Costa, Gisela; Oliveira, Juliana (2019). Additional file 3: of Genome-wide sequencing and metabolic annotation of Pythium irregulare CBS 494.86: understanding Eicosapentaenoic acid production [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000168537
    Explore at:
    Dataset updated
    Jun 29, 2019
    Authors
    RiaĂąo-PachĂłn, Diego; Fernandes, Bruna; Zaiat, Marcelo; Pradella, JosĂŠ; Resende, Tiago; Rocha, Isabel; Dias, Oscar; Neto, Antonio Kaupert; Costa, Gisela; Oliveira, Juliana
    Description

    Pipeline classification - Pipeline classification for annotation and reconstruction of genome-scale metabolic models established according dataset analysis. (XLSX 1790 kb)

  14. Z

    Data from: Aphidinae comparative genomics resource

    • nde-dev.biothings.io
    • data.niaid.nih.gov
    • +2more
    Updated Jul 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mathers, Thomas C (2024). Aphidinae comparative genomics resource [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_5908004
    Explore at:
    Dataset updated
    Jul 17, 2024
    Dataset provided by
    Wouters, Roland H M
    Swarbreck, David
    Van Oosterhout, Cock
    Mugford, Sam T
    Botha, Anna-Maria
    Hogenhout, Saskia A
    Heavens, Darren
    Mathers, Thomas C
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Here we provide early access to 18 new genome assemblies, including 8 assembled to chromosome-scale, for aphids from the subfamily Aphidinae. For consistency and to aid comparative analysis, all genomes have been annotated using the same repeat masking and RNA-seq-based gene prediction pipeline. Using this pipeline we also provide new annotations for three previously published genome assemblies.

    The genome assemblies and annotations are made freely available without restriction, we only request that this Zenodo resource is cited when using the data. Raw sequence data upload to NCBI is underway and full details of all accessions will be given in an updated version of this resource. Manuscripts are in preparation describing the individual genome assemblies in detail and larger comparative genome analyses and we will update this resource with additional citation information as papers are published.

    Full details of all genome assemblies and annotations included in this release are given in the attached "Data_Description.pdf" document.

    Aphid species included in this release (bold type = chromosome-scale assembly):

    Aphis fabae Aphis glycines (updated annotation) Aphis gossypii Aphis thalictri Aphis rumicis Brachycaudus cardui Brachycaudus helichrysi Brachycaudus klugkisti Brevicoryne brassicae Diuraphis noxia Macrosiphum albifrons Metopolophium dirhodum Myzus cerasi (updated annotation) Myzus ligustri Myzus lythri Myzus varians Pentalonia nigronervosa (updated annotation) Phorodon humuli Rhopalosiphum padi Sitobion avenae Sitobion miscanthi

  15. n

    LifeDB

    • neuinfo.org
    • scicrunch.org
    • +2more
    Updated Jun 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). LifeDB [Dataset]. http://identifiers.org/RRID:SCR_006899
    Explore at:
    Dataset updated
    Jun 30, 2024
    Description

    Database that integrates large-scale functional genomics assays and manual cDNA annotation with bioinformatics gene expression and protein analysis. LifeDB integrates data regarding full length cDNA clones and data on expression of encoded protein and their subcellular localization on mammalian cell line. LifeDB enables the scientific community to systematically search and select genes, proteins as well as cDNA of interest by specific database identifiers as well as gene name. It enables to visualize cDNA clone and subcellular location of proteins. It also links the results to external biological databases in order to provide a broader functional information. LifeDB also provides an annotation pipeline which facilitates an improved mapping of clones to known human reference transcripts from the RefSeq database and the Ensembl database. An advanced web interface enables the researchers to view the data in a more user friendly manner. Users can search using any one of the following search options available both in Search gene and cDNA clones and Search Sub-cellular locations of human proteins: By Keyword, By gene/transcript identifier, By plate name, By clone name, By cellular location. * The Search genes and cDNA clones results include: Gene Name, Ensemble ID, Genomic Region, Clone name, Plate name, Plate position, Classification class, Synonymous SNP''s, Non- synonymous SNP''s, Number of ambiguous positions, and Alignment with reference genes. * The Search sub-cellular locations of human proteins results include: Subcellular location, Gene Name, Ensemble ID, Clone name, True localization, Images, Start tag and End tag. Every result page has an option to download result data (excluding the microscopy images). On click of ''Download results as CSV-file'' link in the result page the user will be given a choice to open or save result data in form of a CSV (Comma Separated Values) file. Later the CSV file can be easily opened using Excel or OpenOffice.

  16. Robust subset of annotations from the IBD dataset.

    • figshare.com
    xls
    Updated Dec 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Baptiste Ruiz; Arnaud Belcour; Samuel Blanquart; Sylvie Buffet-Bataillon; Isabelle Le Huërou-Luron; Anne Siegel; Yann Le Cunff (2024). Robust subset of annotations from the IBD dataset. [Dataset]. http://doi.org/10.1371/journal.pcbi.1012577.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 2, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Baptiste Ruiz; Arnaud Belcour; Samuel Blanquart; Sylvie Buffet-Bataillon; Isabelle Le Huërou-Luron; Anne Siegel; Yann Le Cunff
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Robust subset of annotations from the IBD dataset.

  17. MammoTab 25: A Large-Scale Dataset for Semantic Table Interpretation –...

    • zenodo.org
    bin, zip
    Updated Nov 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marco Cremaschi; Marco Cremaschi; Federico Belotti; Federico Belotti; Jennifer D'Souza; Jennifer D'Souza; Matteo Palmonari; Matteo Palmonari (2025). MammoTab 25: A Large-Scale Dataset for Semantic Table Interpretation – Training, Testing, and Detecting Weaknesses [Dataset]. http://doi.org/10.5281/zenodo.16562700
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    Nov 10, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marco Cremaschi; Marco Cremaschi; Federico Belotti; Federico Belotti; Jennifer D'Souza; Jennifer D'Souza; Matteo Palmonari; Matteo Palmonari
    License

    https://www.gnu.org/licenses/agpl.txthttps://www.gnu.org/licenses/agpl.txt

    Description

    MammoTab 25 is a large‑scale, richly‑annotated benchmark designed to advance research on Semantic Table Interpretation (STI) and to evaluate the reasoning abilities of modern Large Language Models (LLMs).

    • Scale and origin – The corpus contains 838930 tables automatically extracted from 63 million English‑language Wikipedia pages.

    • Comprehensive annotations – Every table is accompanied by:

      • Cell-Entity Annotation (CEA),

      • Column-Type Annotation (CTA),

      • Columns-Property Annotation (CPA),

      • Four ready‑to‑use prompt templates for LLM training and stress‑testing,

      • Fine‑grained metadata capturing column roles (Named‑Entity vs Literal), NIL flags, header/caption context, and structural statistics.

    • Challenge coverage – Tagged metadata enables users to isolate and diagnose all key STI challenges, including multi-domain tables, acronyms, aliases, typos, approximate numeric values, and true NIL mentions, making the dataset suitable for both benchmarking and error analysis.

    • Format & access – Tables are stored as CSV files; annotations are provided in separate CSVs following the SemTab format; contextual information is packed in JSON side‑cars.

    The pipeline for regenerating the dataset is openly available on GitHub at https://github.com/unimib-datAI/mammotab/.

    The documentation is available at https://unimib-datai.github.io/mammotab-docs/.

  18. Structural Annotation of Mycobacterium tuberculosis Proteome

    • plos.figshare.com
    tiff
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Praveen Anand; Sandhya Sankaran; Sumanta Mukherjee; Kalidas Yeturu; Roman Laskowski; Anshu Bhardwaj; Raghu Bhagavat; Samir K. Brahmachari; Nagasuma Chandra (2023). Structural Annotation of Mycobacterium tuberculosis Proteome [Dataset]. http://doi.org/10.1371/journal.pone.0027044
    Explore at:
    tiffAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Praveen Anand; Sandhya Sankaran; Sumanta Mukherjee; Kalidas Yeturu; Roman Laskowski; Anshu Bhardwaj; Raghu Bhagavat; Samir K. Brahmachari; Nagasuma Chandra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Of the ∼4000 ORFs identified through the genome sequence of Mycobacterium tuberculosis (TB) H37Rv, experimentally determined structures are available for 312. Since knowledge of protein structures is essential to obtain a high-resolution understanding of the underlying biology, we seek to obtain a structural annotation for the genome, using computational methods. Structural models were obtained and validated for ∼2877 ORFs, covering ∼70% of the genome. Functional annotation of each protein was based on fold-based functional assignments and a novel binding site based ligand association. New algorithms for binding site detection and genome scale binding site comparison at the structural level, recently reported from the laboratory, were utilized. Besides these, the annotation covers detection of various sequence and sub-structural motifs and quaternary structure predictions based on the corresponding templates. The study provides an opportunity to obtain a global perspective of the fold distribution in the genome. The annotation indicates that cellular metabolism can be achieved with only 219 folds. New insights about the folds that predominate in the genome, as well as the fold-combinations that make up multi-domain proteins are also obtained. 1728 binding pockets have been associated with ligands through binding site identification and sub-structure similarity analyses. The resource (http://proline.physics.iisc.ernet.in/Tbstructuralannotation), being one of the first to be based on structure-derived functional annotations at a genome scale, is expected to be useful for better understanding of TB and for application in drug discovery. The reported annotation pipeline is fairly generic and can be applied to other genomes as well.

  19. h

    ReasonPlan_PDR

    • huggingface.co
    Updated May 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Liu Xueyi (2025). ReasonPlan_PDR [Dataset]. https://huggingface.co/datasets/LiuxyIA/ReasonPlan_PDR
    Explore at:
    Dataset updated
    May 26, 2025
    Authors
    Liu Xueyi
    Description

    Dataset Card for Dataset Name

    PDR is the official dataset used in paper https://huggingface.co/papers/2505.20024.

      Dataset Details
    

    PDR is a large-scale instruction dataset tailored for closed-loop planning, which contains 203,353 training samples and 11,047 testing samples. Using an automated annotation pipeline, PDR captures the complete decision reasoning process in training scenarios on the Bench2Drive, including the following stages: Scene Understanding, Traffic Sign… See the full description on the dataset page: https://huggingface.co/datasets/LiuxyIA/ReasonPlan_PDR.

  20. FANTASIA V3 - LookUp Table - UniProt July 2025 - Experimental Evidence code

    • zenodo.org
    csv, tar
    Updated Jul 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francisco M. Perez-Canales; Francisco M. Perez-Canales; Ana Rojas; Ana Rojas (2025). FANTASIA V3 - LookUp Table - UniProt July 2025 - Experimental Evidence code [Dataset]. http://doi.org/10.5281/zenodo.16582433
    Explore at:
    csv, tarAvailable download formats
    Dataset updated
    Jul 29, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Francisco M. Perez-Canales; Francisco M. Perez-Canales; Ana Rojas; Ana Rojas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    📘 FANTASIA V3 - LookUp Table (UniProt July 2025)

    Experimental Evidence Code Only

    Overview

    This is a PostgreSQL database backup using the pgvector extension to store high-dimensional protein embeddings. It contains precomputed embeddings and functional annotations from the UniProt July 2025 release, including only entries supported by experimental evidence.

    This lookup table was generated using version v2.0.0 of the Protein Information System (PIS), an integrated biological information system designed for the automated extraction, processing, and management of protein-related data. PIS consolidates information from UniProt, PDB, and GOA, allowing efficient retrieval and organization of sequences, structures, and annotations.

    The resulting database is designed for compatibility with FANTASIA V3, an advanced pipeline for large-scale functional annotation of proteins using state-of-the-art Protein Language Models (PLMs). While the lookup table is stored in a vector database for persistence, FANTASIA loads the relevant data into memory at runtime to enable high-speed annotation.

    FANTASIA uses precomputed deep learning embeddings to perform nearest-neighbor searches in embedding space and transfer Gene Ontology (GO) terms from experimentally annotated proteins to query sequences.

    Dataset Details

    • Total proteins: 127,546

    • Total sequences: 124,397

    • Total embeddings: 621,849

    • Total GO annotations: 627,932

    • Included evidence codes (Gene Ontology, experimental only):

      • EXP – Inferred from Experiment

      • IDA – Inferred from Direct Assay

      • IPI – Inferred from Physical Interaction

      • IMP – Inferred from Mutant Phenotype

      • IGI – Inferred from Genetic Interaction

      • IEP – Inferred from Expression Pattern

      • TAS – Traceable Author Statement

      • IC – Inferred by Curator

    Included Embedding Models

    • ESM-2 (650M parameters)
      A transformer-based protein language model trained on UniRef50 using masked language modeling. It captures structural and functional features directly from raw sequences without requiring MSAs. ESM-2 is widely used for contact map prediction, unsupervised learning, and representation extraction.

    • ProtT5-XL-UniRef50 (~1.2B parameters)
      A large-scale encoder-decoder model using the T5 architecture, trained on UniRef50 via masked span prediction. It generates high-dimensional sequence representations that perform well across structure and function prediction tasks.

    • ProstT5 (~1.2B parameters)
      A multi-modal extension of ProtT5, trained to predict both sequence and coarse-grained 3Di structural states. Useful for downstream applications like contact prediction, functional annotation, and classification.

    • Ankh3-Large (620M parameters)
      An encoder-only T5-style model trained with masked span prediction. Optimized for fast inference, it encodes both semantic and structural protein information and can replace ProtT5 in many ML pipelines.

    • ESM3c (Cambrian 600M)
      Part of the new ESM C model family, trained on UniRef, MGnify, and JGI datasets. With rotary embeddings and 36 layers, it offers enhanced performance for masked language modeling, producing high-quality structural and functional embeddings without alignments.

    Missing Proteins

    A small subset of proteins could not be processed on the Finisterrae III (CESGA) supercomputer due to memory limitations with 40 GB A100 GPUs.

    The file missing_proteins.csv lists all affected UniProt identifiers. These entries are excluded from the final lookup table.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Surya Saha; Amanda M. Cooksey; Anna K. Childers; Monica F. Poelchau; Fiona M. McCarthy (2025). Functional annotation for 15 diverse arthropod genomes [Dataset]. http://doi.org/10.15482/USDA.ADC/1522860

Data from: Functional annotation for 15 diverse arthropod genomes

Related Article
Explore at:
application/x-gzipAvailable download formats
Dataset updated
Nov 22, 2025
Dataset provided by
Ag Data Commons
Authors
Surya Saha; Amanda M. Cooksey; Anna K. Childers; Monica F. Poelchau; Fiona M. McCarthy
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

We present the annotation results of 15 arthropod proteomes using an open source, open access and containerized pipeline for genome-scale functional annotation of insect proteomes and apply it to a diverse range of arthropod species. You can find more information about the pipeline at our readthedocs site. The files for each genome include GOanna, InterproScan and KOBAS predictions. Arthropod genomes selected for this study and their assembly and annotation statistics.

Apis Mellifera (honey bee) Drosophila melanogaster (fruit fly) Tribolium castaneum (red flour beetle) Latrodectus hesperus (Western black widow spider) Limnephilus lunatus (caddisfly) Oncopeltus fasciatus (Large milkweed bug) Homalodisca vitripennis (Glassy-winged sharpshooter) Eurytemora affinis (calanoid copepod) Agrilus planipennis (emerald ash borer) Copidosoma floridanum (parasitoid wasp) Athalia rosae (turnip sawfly) Ceratitis capitata (Mediterranean fruit fly) Cimex lectularius (Cimicidae bed bug) Varroa destructor(parasitic mite)

Diaphorina citri (Asian citrus psyllid) Resources in this dataset:Resource Title: Cimex lectularius (Cimicidae bed bug) annotation. File Name: CLEC.tar.gzResource Description: Functional annotation for Clec-OGSv1.2 protein setResource Title: Tribolium castaneum (red flour beetle) annotation. File Name: TCAS.tar.gzResource Description: Functional annotation for TCAS_OGS_v3 protein setResource Title: Drosophila melanogaster (fruit fly) annotation. File Name: DMEL.tar.gzResource Description: Functional annotation for DMEL_r6.38 protein set

Resource Title: Varroa destructor (parasitic mite) annotation. File Name: VDES.tar.gzResource Description: Functional annotation for NCBI Varroa destructor Annotation Release 100 protein set based on Vdes_3.0 genome (GCA_002443255.1) Resource Title: Oncopeltus fasciatus (Large milkweed bug) annotation. File Name: ONCFAS.tar.gzResource Description: Functional annotation for oncfas_OGSv1.2 protein setResource Title: Apis Mellifera (honey bee) annotation. File Name: AMEL.tar.gzResource Description: Functional annotation for OGSv3.3 protein set from Amel_4.5 genome (GCA_000002195.1) Resource Title: Homalodisca vitripennis (Glassy-winged sharpshooter) annotation. File Name: HVIT.tar.gzResource Description: Functional annotation for HVIT-BCM_version_0.5.3 protein set based on Hvit_1.0 genome (GCA_000696855.1) Resource Title: Limnephilus lunatus (caddisfly) annotation. File Name: LLUN.tar.gzResource Description: Functional annotation for LLUN-BCM_version_0.5.3 protein set from Llun_1.0 genome (GCA_000648945.1) Resource Title: Latrodectus hesperus (Western black widow spider) annotation. File Name: LHES.tar.gzResource Description: Functional annotation for LHES-BCM_version_0.5.3 protein set from Lhes_1.0 genome (GCA_000697925.1) Resource Title: Eurytemora affinis (calanoid copepod) annotation. File Name: EAFF.tar.gzResource Description: Functional annotation for EAFF-BCM_version_0.5.3 protein set from Eaff_1.0 genome (GCA_000591075.1) Resource Title: Copidosoma floridanum (parasitoid wasp) annotation. File Name: CFLO.tar.gzResource Description: Functional annotation for CFLO-BCM_version_0.5.3 protein set based on Cflo_1.0 genome (GCA_000648655.1) Resource Title: Ceratitis capitata (Mediterranean fruit fly) annotation. File Name: CCAP.tar.gzResource Description: Functional annotation for Ccap-OGSv1 protein set based on Ccap_1.1 assembly (GCA_000347755.2) Resource Title: Athalia rosae (turnip sawfly) annotation. File Name: AROS.tar.gzResource Description: Functional annotation for AROS-BCM_version_0.5.3 protein set based on Aros_1.0 genome (GCA_000344095.1)Resource Title: Agrilus planipennis (emerald ash borer) annotation. File Name: APLA.tar.gzResource Description: Functional annotation for APLA-BCM_version_0.5.3 protein set based on Apla_1.0 genome (GCA_000699045.1)

Search
Clear search
Close search
Google apps
Main menu