47 datasets found

u
Data from: Functional annotation for 15 diverse arthropod genomes
agdatacommons.nal.usda.gov
catalog.data.gov
application/x-gzip
Updated Nov 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Surya Saha; Amanda M. Cooksey; Anna K. Childers; Monica F. Poelchau; Fiona M. McCarthy (2025). Functional annotation for 15 diverse arthropod genomes [Dataset]. http://doi.org/10.15482/USDA.ADC/1522860
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.15482/USDA.ADC/1522860
Dataset updated
Nov 22, 2025
Dataset provided by
Ag Data Commons
Authors
Surya Saha; Amanda M. Cooksey; Anna K. Childers; Monica F. Poelchau; Fiona M. McCarthy
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
We present the annotation results of 15 arthropod proteomes using an open source, open access and containerized pipeline for genome-scale functional annotation of insect proteomes and apply it to a diverse range of arthropod species. You can find more information about the pipeline at our readthedocs site. The files for each genome include GOanna, InterproScan and KOBAS predictions. Arthropod genomes selected for this study and their assembly and annotation statistics.

Apis Mellifera (honey bee) Drosophila melanogaster (fruit fly) Tribolium castaneum (red flour beetle) Latrodectus hesperus (Western black widow spider) Limnephilus lunatus (caddisfly) Oncopeltus fasciatus (Large milkweed bug) Homalodisca vitripennis (Glassy-winged sharpshooter) Eurytemora affinis (calanoid copepod) Agrilus planipennis (emerald ash borer) Copidosoma floridanum (parasitoid wasp) Athalia rosae (turnip sawfly) Ceratitis capitata (Mediterranean fruit fly) Cimex lectularius (Cimicidae bed bug) Varroa destructor(parasitic mite)

Diaphorina citri (Asian citrus psyllid) Resources in this dataset:Resource Title: Cimex lectularius (Cimicidae bed bug) annotation. File Name: CLEC.tar.gzResource Description: Functional annotation for Clec-OGSv1.2 protein setResource Title: Tribolium castaneum (red flour beetle) annotation. File Name: TCAS.tar.gzResource Description: Functional annotation for TCAS_OGS_v3 protein setResource Title: Drosophila melanogaster (fruit fly) annotation. File Name: DMEL.tar.gzResource Description: Functional annotation for DMEL_r6.38 protein set

Resource Title: Varroa destructor (parasitic mite) annotation. File Name: VDES.tar.gzResource Description: Functional annotation for NCBI Varroa destructor Annotation Release 100 protein set based on Vdes_3.0 genome (GCA_002443255.1) Resource Title: Oncopeltus fasciatus (Large milkweed bug) annotation. File Name: ONCFAS.tar.gzResource Description: Functional annotation for oncfas_OGSv1.2 protein setResource Title: Apis Mellifera (honey bee) annotation. File Name: AMEL.tar.gzResource Description: Functional annotation for OGSv3.3 protein set from Amel_4.5 genome (GCA_000002195.1) Resource Title: Homalodisca vitripennis (Glassy-winged sharpshooter) annotation. File Name: HVIT.tar.gzResource Description: Functional annotation for HVIT-BCM_version_0.5.3 protein set based on Hvit_1.0 genome (GCA_000696855.1) Resource Title: Limnephilus lunatus (caddisfly) annotation. File Name: LLUN.tar.gzResource Description: Functional annotation for LLUN-BCM_version_0.5.3 protein set from Llun_1.0 genome (GCA_000648945.1) Resource Title: Latrodectus hesperus (Western black widow spider) annotation. File Name: LHES.tar.gzResource Description: Functional annotation for LHES-BCM_version_0.5.3 protein set from Lhes_1.0 genome (GCA_000697925.1) Resource Title: Eurytemora affinis (calanoid copepod) annotation. File Name: EAFF.tar.gzResource Description: Functional annotation for EAFF-BCM_version_0.5.3 protein set from Eaff_1.0 genome (GCA_000591075.1) Resource Title: Copidosoma floridanum (parasitoid wasp) annotation. File Name: CFLO.tar.gzResource Description: Functional annotation for CFLO-BCM_version_0.5.3 protein set based on Cflo_1.0 genome (GCA_000648655.1) Resource Title: Ceratitis capitata (Mediterranean fruit fly) annotation. File Name: CCAP.tar.gzResource Description: Functional annotation for Ccap-OGSv1 protein set based on Ccap_1.1 assembly (GCA_000347755.2) Resource Title: Athalia rosae (turnip sawfly) annotation. File Name: AROS.tar.gzResource Description: Functional annotation for AROS-BCM_version_0.5.3 protein set based on Aros_1.0 genome (GCA_000344095.1)Resource Title: Agrilus planipennis (emerald ash borer) annotation. File Name: APLA.tar.gzResource Description: Functional annotation for APLA-BCM_version_0.5.3 protein set based on Apla_1.0 genome (GCA_000699045.1)
Supplementary materials for "Robustness analysis of metabolic predictions in...
zenodo.org
data-staging.niaid.nih.gov
xml, zip
Updated Apr 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elham Karimi; Elham Karimi; Enora Geslain; Arnaud Belcour; Clémence Frioux; Méziane Aïte; Anne Siegel; Erwan Corre; Simon M. Dittami; Enora Geslain; Arnaud Belcour; Clémence Frioux; Méziane Aïte; Anne Siegel; Erwan Corre; Simon M. Dittami (2021). Supplementary materials for "Robustness analysis of metabolic predictions in algal microbial communities based on different annotation pipelines" [Dataset]. http://doi.org/10.5281/zenodo.4436003
Explore at:
zip, xmlAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4436003
Dataset updated
Apr 1, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Elham Karimi; Elham Karimi; Enora Geslain; Arnaud Belcour; Clémence Frioux; Méziane Aïte; Anne Siegel; Erwan Corre; Simon M. Dittami; Enora Geslain; Arnaud Belcour; Clémence Frioux; Méziane Aïte; Anne Siegel; Erwan Corre; Simon M. Dittami
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Supplementary materials for the revised version of the PeerJ preprint " Robustness analysis of metabolic predictions in algal microbial communities based on different annotation pipelines".
Genome Annotation of the Biting Midges Culicoides sonorensis and Culicoides...
researchdata.edu.au
data.csiro.au
datadownload
Updated Nov 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gunjan Pandey; Debbie Eagles; Stacey Lynch; Prasad Paradkar; Tom Walsh; Rahul Rane; Leon Court; Melissa Klein; Asif Ahmed; Rahul Rane; Prasad Paradkar; Leon Court; Khandaker Asif ahmed; Gunjan Pandey; Debbie Eagles (2025). Genome Annotation of the Biting Midges Culicoides sonorensis and Culicoides stellifer Generated Using the EGAPx-alpha Pipeline [Dataset]. http://doi.org/10.25919/1TF9-BN03
Explore at:
datadownloadAvailable download formats
Unique identifier
https://doi.org/10.25919/1TF9-BN03
Dataset updated
Nov 17, 2025
Dataset provided by
CSIROhttps://www.csiro.au/
Authors
Gunjan Pandey; Debbie Eagles; Stacey Lynch; Prasad Paradkar; Tom Walsh; Rahul Rane; Leon Court; Melissa Klein; Asif Ahmed; Rahul Rane; Prasad Paradkar; Leon Court; Khandaker Asif ahmed; Gunjan Pandey; Debbie Eagles
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Sep 10, 2022 - Oct 10, 2025
Description
This collection contains gene annotation datasets for Culicoides sonorensis and Culicoides stellifer (Diptera: Ceratopogonidae), two biting midge species of veterinary significance. Culicoides sonorensis is a confirmed vector of Bluetongue and Epizootic Hemorrhagic Disease viruses in North America, while C. stellifer has been implicated in Orbivirus transmission in the southeastern United States. The reference genome assemblies for these species are publicly available at NCBI under accession numbers GCA_047716325.1 (C. sonorensis) and GCA_040583785.1 (C. stellifer).

Gene models for both assemblies were predicted using the EGAPx-alpha pipeline (egapx:0.3.2-alpha), employing the same C. sonorensis RNA-Seq dataset (ERR2171964 – ERR2171978) as transcriptomic evidence for model training, as no C. stellifer RNA-Seq data are currently available (Last checked- 17/11/2025). The dataset includes gene model coordinates (GFF), transcript nucleotide sequences (FNA), and predicted protein sequences (FAA).

These curated annotation datasets were generated to support the forthcoming manuscript: Chromosome-scale genome of Culicoides brevitarsis highlights genetic basis of vector competency. They provide consistent annotation resources for cross-species comparative analyses among Culicoides midges. Lineage: Culicoides sonorensis Genome assembly: GCA_047716325.1 (idCulSono.KS.ABADRU.1.0.female) Isolate: Kansas colony WGS project: JBLLJK01 Submitter: Ag100Pest Initiative (USDA-ARS ABADRU) Release date: 12 Feb 2025 Annotation pipeline: EGAPx-alpha (egapx:0.3.2-alpha) Transcript evidence: C. sonorensis RNA-Seq ERR2171964 – ERR2171978

Culicoides stellifer: Genome assembly: GCA_040583785.1 (c_stellifer_primary030_purged) Submitter: University of Guelph Release date: 10 Jul 2024 Annotation pipeline: EGAPx-alpha (egapx:0.3.2-alpha) Transcript evidence: Shared C. sonorensis RNA-Seq dataset (ERR2171964 – ERR2171978) used for training and gene structure validation
n
Data from: A chromosome-scale high-contiguity genome assembly of the...
data.niaid.nih.gov
search.dataone.org
zip
Updated Jan 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sven Winter; René Meißner; Carola Greve; Alexander Ben Hamadou; Petr Horin; Stefan Prost; Pamela A. Burger (2023). A chromosome-scale high-contiguity genome assembly of the threatened cheetah (Acinonyx jubatus) [Dataset]. http://doi.org/10.5061/dryad.xksn02vkr
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.xksn02vkr
Dataset updated
Jan 18, 2023
Dataset provided by
University of Oulu
University of Veterinary Medicine Vienna
LOEWE Centre for Translational Biodiversity Genomics
University of Veterinary Science Brno
Authors
Sven Winter; René Meißner; Carola Greve; Alexander Ben Hamadou; Petr Horin; Stefan Prost; Pamela A. Burger
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
The cheetah (Acinonyx jubatus, SCHREBER 1775) is a large felid and is considered the fastest land animal. Historically, it inhabited open grassland across Africa, the Arabian Peninsula, and southwestern Asia; however, only small and fragmented populations remain today. Here, we present a de novo genome assembly of the cheetah based on PacBio continuous long reads and Hi-C proximity ligation data. The final assembly (VMU_Ajub_asm_v1.0) has a total length of 2.38 Gb, of which 99.7% are anchored into the expected 19 chromosome-scale scaffolds. The contig and scaffold N50 values of 96.8 Mb and 144.4 Mb, respectively, a BUSCO completeness of 95.4% and a k-mer completeness of 98.4%, emphasize the high quality of the assembly. Furthermore, annotation of the assembly identified 23,622 genes and a repeat content of 40.4%. This new highly contiguous and chromosome-scale assembly will greatly benefit conservation and evolutionary genomic analyses and will be a valuable resource, e.g., to gain a detailed understanding of the function and diversity of immune response genes in felids. Methods The presented data is related to the eponymous publication "A chromosome-scale high-contiguity genome assembly of the threatened cheetah (Acinonyx jubatus)" soon to be published in the Journal of Heredity. Any questions regarding this dataset or the publication can be addressed to the corresponding authors, Sven Winter (sven.winter@vetmeduni.ac.at) and Pamela Burger (pamela.burger@vetmeduni.ac.at). Assembly: The assembly was generated from one PacBio CLR library sequenced on one SMRTCell on a Sequel IIe using Flye v. 2.9, including one iteration of long-read polishing followed by one iteration of short-read polishing with pilon v.1.23 using trimmed standard Illumina short-reads generated on the Illumina Novaseq 6000 platform. Subsequently, the contigs of the polished assembly were anchored into chromosome-scale scaffolds with YaHS v.1.1 using publically available Hi-C data for the cheetah (SRR8616936, SRR8616937) that were prepared following the Arima Hi-C mapping pipeline (https://github.com/VGP/vgp-assembly/blob/master/pipeline/salsa/arima_mapping_pipeline.sh). Finally, two iterations of gap-closing were performed with TGS-GapCloser v. 1.1.1 using a different random subset of PacBio reads (25%) for each iteration. Annotation: Repeat Annotation To improve gene prediction, we first identified and masked the repeats in the assembly. A de novo repeat library was generated with RepeatModeler v.2.0.1 and combined with a Felidae-specific repeat library from RepBase. This custom repeat library was then used with RepeatMasker v.4.1.0 to hard-mask interspersed repeats and soft-mask simple repeats. Gene annotation We predicted genes in the masked assembly based on homology using the GeMoMa pipeline v. 1.7.1 and the following reference assemblies and annotation files: Homo sapiens (GCF_000001405.40), Mus musculus (GCF_000001635.27), Lynx canadensis(GCF_007474595.2), Canis lupus familiaris (GCF_014441545.1), Prionailuris bengalensis (GCF_016509475.1), Leopardus geoffroyi (GCF_018350155.1), Felis catus (GCF_018350175.1), Panthera tigris (GCF_018350195.1), and Panthera leo (GCF_018350215.1). We functionally annotated the predicted proteins using InterProScan v.5.50.84 and a BLASTP v.2.11.0 search against the Swiss-Prot database (release 2021-02). For more details on assembly quality assessment and comparative analyses to other Felidae assemblies, please read the original manuscript. This dataset comprises the following files: VMU_Ajub_asm_v1.0.fasta (final unmasked assembly, also available at GenBank under accession GCA_027475565.1) VMU_Ajub_asm_v1.0.fasta.masked (final assembly with all repeats hard-masked) VMU_Ajub_asm_v1.0.fasta.out (RepeatMasker output for the fully hard-masked assembly VMU_Ajub_asm_v1.0.fasta.masked) VMU_Ajub_asm_v1.0.fasta.tbl (RepeatMasker output for the fully hard-masked assembly VMU_Ajub_asm_v1.0.fasta.masked) VMU_Ajub_asm_v1.0_hardmaskedTE.fasta (final assembly with all interspersed repeats hard-masked) VMU_Ajub_asm_v1.0_hardmaskedTE.fasta.out (RepeatMasker output for the fully hard-masked assembly VMU_Ajub_asm_v1.0_hardmaskedTE.fasta) VMU_Ajub_asm_v1.0_hardmaskedTE.fasta.tbl (RepeatMasker output for the fully hard-masked assembly VMU_Ajub_asm_v1.0_hardmaskedTE.fasta) VMU_Ajub_asm_v1.0_hardmaskedTE_softmaskedSR.fasta (final assembly with all interspersed repeats hard-masked and simple repeats soft-masked) VMU_Ajub_asm_v1.0_hardmaskedTE_softmaskedSR.fasta.out (RepeatMasker output for the fully hard-masked assembly VMU_Ajub_asm_v1.0_hardmaskedTE_softmaskedSR.fasta) VMU_Ajub_asm_v1.0_hardmaskedTE_softmaskedSR.fasta.tbl (RepeatMasker output for the fully hard-masked assembly VMU_Ajub_asm_v1.0_hardmaskedTE_softmaskedSR.fasta) consensi.fa.classified (de novo repeat library for the final assembly VMU_Ajub_asm_v1.0.fasta generated With RepeatModeler2) Ajub_assembly_commands.txt (List with all commands used to generate the assembly and all related analyses)
Additional file 3: of A De-Novo Genome Analysis Pipeline (DeNoGAP) for...
springernature.figshare.com
xlsx
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shalabh Thakur; David Guttman (2023). Additional file 3: of A De-Novo Genome Analysis Pipeline (DeNoGAP) for large-scale comparative prokaryotic genomics studies [Dataset]. http://doi.org/10.6084/m9.figshare.c.3628187_D2.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.c.3628187_D2.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Shalabh Thakur; David Guttman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
List of genomes used for development and testing of the DeNoGAP pipeline. (XLSX 25Â kb)
Data from: Globe230k: A Benchmark Dense-Pixel Annotation Dataset for Global...
zenodo.org
txt, zip
Updated Jul 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qian Shi; Da He; Zhengyu Liu; Xiaoping Liu; Jingqian Xue; Qian Shi; Da He; Zhengyu Liu; Xiaoping Liu; Jingqian Xue (2024). Globe230k: A Benchmark Dense-Pixel Annotation Dataset for Global Land Cover Mapping [Dataset]. http://doi.org/10.5281/zenodo.10435661
Explore at:
txt, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10435661
Dataset updated
Jul 7, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Qian Shi; Da He; Zhengyu Liu; Xiaoping Liu; Jingqian Xue; Qian Shi; Da He; Zhengyu Liu; Xiaoping Liu; Jingqian Xue
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We (Intelligent Mining and Analysis of Remote Sensing big data, IMARS) create a large-scale annotated dataset (Globe230k) for land use/land cover (LULC) mapping, which is annotated on Google Earth image of 1 m spatial resolution. Globe230k is annotated by numerous experts and students major in survey and mapping after necessary training, through visual interpretation on very high-resolution images, as well as in-situ field survey, under the guidance of the organized annotation pipeline. Globe230k has three superiorities:

1) Large scale: the Globe230k includes 232,819 annotated images with the size of 512x512 and spatial resolution of 1 m, with more than 3x1010 annotated pixels, and it includes 10 first-level categories.

2) Rich diversity: the annotated images are sampled from worldwide regions, with coverage area of over 60,000 km2, indicating a high variability and diversity. Besides, in order to ensure the category balance, we intentionally give more chance to the rare categories to be sampled, such as wetland, ice/snow, etc.

3) Multi-modal: Globe230k not only contains RGB bands, but also include other important features for Earth system research, such as Normalized differential vegetation index (NDVI), digital elevation model (DEM), vertical-vertical polarization (VV) bands, vertical-horizontal polarization (VH) bands, which can facilitate the multi-modal data fusion research. Due to the large size of the multi-modal dataset (DEM 1.91G, NDVI 164G, VVVH 372G), these dataset are stored on Baidu Yunpan, the download link is :https://pan.baidu.com/s/12AKbiqOXSf4fnm7mYkCE0g?pwd=230k, the extraction code is 230k.

The image patches and their corresponding annotated patches are respectively stored in "image_patch.zip" and "label_patch.zip" file. The RGB image is in forms of ".jpg", with size of 512x512, the pixel value is ranged from 0-255. The annotated patches is in forms of ".png", also with size of 512x512, the pixel value is ranged from 1-10, which respectively represent 1#cropland, 2#forest, 3#grass, 4#shrubland, 5#wetland, 6#water, 7#tundra, 8#impervious, 9#bareland, 10#ice/snow. The corresponding DEM, NDVI and VVVH patches are all in form of ".tif", with size of 512x512 (due to the different resolution of DEM, NDVI and VVVH patches, they are all uniformly resized to the same scale as the image patch).

The total 232,819 pairs are officially divided into training set, validation set, and test set, based on ratio of 7:1:2, which can be find in "train_num.txt","val_num.txt","test_num.txt" file. Based on this division, the official baseline accuracy of several state-of-the-art semantic segmentation can be found in the related arcticle (https://spj.science.org/doi/10.34133/remotesensing.0078).

We hope it can be used as a benchmark to promote further development of global land cover mapping and semantic segmentation algorithm development.
d
A high-quality genome assembly for Dillenia turbinata (Dilleniales)
search.dataone.org
datadryad.org
Updated Nov 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andre Chanderbali; Xing Guo; Qiong Qiong Lin; Shi Xiao Luo; Huan Liu; Matthew Gitzendanner; Steven Smith; Douglas Soltis; Pamela Soltis (2023). A high-quality genome assembly for Dillenia turbinata (Dilleniales) [Dataset]. http://doi.org/10.5061/dryad.msbcc2g3j
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.msbcc2g3j
Dataset updated
Nov 29, 2023
Dataset provided by
Dryad Digital Repository
Authors
Andre Chanderbali; Xing Guo; Qiong Qiong Lin; Shi Xiao Luo; Huan Liu; Matthew Gitzendanner; Steven Smith; Douglas Soltis; Pamela Soltis
Time period covered
May 19, 2023
Description
Objectives: Dillenia turbinata (Dilleniaceae) is a member of the order Dilleniales, an enigmatic clade of critical importance for understanding the diversification history of flowering plants but for which genome sequences are not available. We have produced and annotated a chromosome-scale whole genome assembly for D. turbinata through the resources of the 10KP (10,000 Plants) Genomes Project. The genome assembly and associated data provided here will serve as a useful resource for comparative and evolutionary genomics research across the flowering plants. Data description: The D. turbinata genome was assembled from Oxford Nanopore Technology (ONT) and whole-genome shotgun (WGS) sequences, and scaffolded into chromosome-scale pseudomolecules using Hi-C data. The genome assembly is 723,739,077 base pairs in length with a BUSCO completeness score of 97%. Â Twenty-eight scaffolds contain more than 99% of the assembly. The repeat-masked genome sequence is annotated with 36,967 protein-codin..., Genome assembly and annotation Raw nanopore reads in fastq format were assembled with Canu v2.2 (Koren et al. 2017) using an estimated genome of 900Mb to guide coverage parameters during the read correction, trimming, and assembly steps of the pipeline. The resulting primary assembly was polished with the WGS reads using NextPolish v1.3.1 (Hu et al. 2020), and duplicated constructs were removed by Purge Haplotigs (Roach et al. 2018). The set of deduplicated contigs was scaffolded on the basis of Hi-C reads using the Juicer pipeline (Durand et al. 2016) and 3d-dna tools (Dudchenko et al. 2017) with default parameters. Genome annotation was performed using the MAKER-P pipeline (Campbell et al. 2014) supplied with coding DNA sequences (CDS) from a Trinity (Grabherr et al. 2011) assembly of the Dillenia transcriptome reads, proteomes from four publicly available eudicot genomes â€”Arabidopsis thaliana, Aquilegia coerulea, Nelumbo nucifera, and Vitis vinifera, and a custom repeat library of tr..., The included data files may be opened with MS Word (Detailed Methods.docx), MS Ecel (Dillenia.genome.assembly.stats.xlsx), standard image viewer software (Dillenia.BUSCO.summaries.png), and standard text editor programs (Dillenia.genome.fasta and Dillenia.maker.predict.36967.final.gff), .
Genome and repeat annotation of the phased telomere-to-telomere assembly of...
zenodo.org
zip
Updated Aug 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hanane El Idrissi; Anestis Gkanogiannis; Anestis Gkanogiannis; Driss Iraqi; Siham Khoulassa; Mohamed Fokar; Bouabid Badaoui; Rachid Moussadek; Rachid Mentag; Slimane Khayi; Hanane El Idrissi; Driss Iraqi; Siham Khoulassa; Mohamed Fokar; Bouabid Badaoui; Rachid Moussadek; Rachid Mentag; Slimane Khayi (2025). Genome and repeat annotation of the phased telomere-to-telomere assembly of Moroccan argane tree (Argania spinosa) [Dataset]. http://doi.org/10.5281/zenodo.16017913
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.16017913
Dataset updated
Aug 17, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Hanane El Idrissi; Anestis Gkanogiannis; Anestis Gkanogiannis; Driss Iraqi; Siham Khoulassa; Mohamed Fokar; Bouabid Badaoui; Rachid Moussadek; Rachid Mentag; Slimane Khayi; Hanane El Idrissi; Driss Iraqi; Siham Khoulassa; Mohamed Fokar; Bouabid Badaoui; Rachid Moussadek; Rachid Mentag; Slimane Khayi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Morocco
Description
This dataset provides the full genome annotation supporting the publication:

"Phased T2T reference genome assembly of Moroccan Argane (Argania spinosa)"
Hanane El Idrissi, Anestis Gkanogiannis, Driss Iraqi, Siham Khoulassa, Mohamed Fokar, Bouabid Badaoui, Rachid Moussadek, Rachid Mentag and Slimane Khayi (2025)

We present the genome annotation files for the phased, telomere-to-telomere (T2T), chromosome-scale genome assembly of Argania spinosa, an ecologically and economically important tree endemic to Morocco. The assembly comprises two fully phased haplotypes, each organized into 11 pseudochromosomes.

This Zenodo entry includes:

Structural gene annotation (GFF3) generated using the Funannotate pipeline

Predicted protein sequences (FASTA)

Repeat annotations:

GFF3 files for simple, complex, and combined repeat annotations

Soft-masked genome FASTA (simple and complex repeats masked in lowercase)

Hard-masked genome FASTA (repeats replaced with Ns)

Gene prediction was performed using transcript evidence (RNA-Seq from root and leaf tissues), protein homology from Ericales and SwissProt, and de novo ab initio models (AUGUSTUS, GeneMark-ES). Functional annotations were assigned using InterProScan, eggNOG, Pfam, and Gene Ontology databases. Repeat annotation was performed using RepeatModeler and RepeatMasker in a multi-round strategy incorporating both lineage-specific and de novo repeat libraries.

NCBI BioProject: PRJNA1223813
n
3D-Genomics Database
neuinfo.org
scicrunch.org
+2more
Updated Oct 17, 2010
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2010). 3D-Genomics Database [Dataset]. http://identifiers.org/RRID:SCR_007430
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_007430
Dataset updated
Oct 17, 2010
Description
THIS RESOURCE IS NO LONGER IN SERVICE, documented August 29, 2016. Database containing structural annotations for the proteomes of just under 100 organisms. Using data derived from public databases of translated genomic sequences, representatives from the major branches of Life are included: Prokaryota, Eukaryota and Archaea. The annotations stored in the database may be accessed in a number of ways. The help page provides information on how to access the database. 3D-GENOMICS is now part of a larger project, called e-Protein. The project brings together similar databases at three sites: Imperial College London , University College London and the European Bioinformatics Institute . e-Protein''s mission statement is To provide a fully automated distributed pipeline for large-scale structural and functional annotation of all major proteomes via the use of cutting-edge computer GRID technologies. The following databases are incorporated: NRprot, SCOP, ASTRAL, PFAM, Prosite, taxonomy, COG The following eukaryotic genomes are incorporated: Anopheles gambiae, protein sequences from the mosquito genome; Arabidopsis thaliana, protein sequences from the Arabidopsis genome; Caenorhabditis briggsae, protein sequences from the C.briggsae genome; Caenorhabditis elegans protein sequences from the worm genome; Ciona intestinalis protein sequences from the sea squirt genome; Danio rerio protein sequences from the zebrafish genome; Drosophila melanogaster protein sequences from the fruitfly genome; Encephalitozoon cuniculi protein sequences from the E.cuniculi genome; Fugu rubripes protein sequences from the pufferfish genome; Guillardia theta protein sequences from the G.theta genome; Homo sapiens protein sequences from the human genome; Mus musculus protein sequences from the mouse genome; Neurospora crassa protein sequences from the N.crassa genome; Oryza sativa protein sequences from the rice genome; Plasmodium falciparum protein sequences from the P.falciparum genome; Rattus norvegicus protein sequences from the rat genome; Saccharomyces cerevisiae protein sequences from the yeast genome; Schizosaccharomyces pombe protein sequences from the yeast genome
OKI2018_I69 assembly and annotation of the genome of an individual...
zenodo.org
bin
Updated Jun 3, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aleksandra Bliznina; Aleksandra Bliznina (2021). OKI2018_I69 assembly and annotation of the genome of an individual Oikopleura dioica from Okinawa [Dataset]. http://doi.org/10.5281/zenodo.4604144
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4604144
Dataset updated
Jun 3, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Aleksandra Bliznina; Aleksandra Bliznina
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
A chromosome-scale assembly of the Oikopleura dioica genome from Okinawa, Japan. The contig assembly was generated with long-read Nanopore data using Canu pipeline v1.8, and polished with short Illumina MiSeq reads using Pilon v1.22. Both Nanopore and Illumina data were generated from DNA of a single O. dioica male. Hi-C chromosomal conformation capture data was used to order and orient the contigs into scaffolds using Juicer v1.6 and 3D de novo assembly (3D-DNA) pipelines. The OKI2018_I69_1.0 assembly comprises 19 scaffolds with an N50 of 16.2 Mbp (OKI2018_I69_1.0.fa). The total assembly length is 64.3 Mbp. The five longest scaffolds represent autosomal chromosomes (chr 1 and chr 2), and sex chromosomes split into pseudo-autosomal region (PAR) and X-specific (XSR) or Y-specific (YSR) regions. One of the smaller scaffolds represent a draft assembly of mitochondrial genome (chrUn_12). The rest of scaffolds are highly repetitive and were marked as unplaced. The OKI2018_I69_1.0 assembly was annotated with AUGUSTUS v3.3 and MAKER v3.01.03 pipelines. Gene predictions from these software were refined and merged using EvidenceModeler v1.1.1. To predict UTRs and alternative isoforms, the EVM models were updated using two round of PASA pipeline, resulting in 18,485 transcript models distributed among 16,936 protein-coding genes (OKI2018_I69_1.0.gene_models.gff3).
Chromosome-scale genome assembly of the African spiny mouse (Acomys...
zenodo.org
bin
Updated Apr 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Danny Miller; Danny Miller (2023). Chromosome-scale genome assembly of the African spiny mouse (Acomys cahirinus) [Dataset]. http://doi.org/10.5281/zenodo.7761277
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7761277
Dataset updated
Apr 1, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Danny Miller; Danny Miller
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Genomic DNA was extracted from blood from a single male A. cahirinus animal using a Monarch HMW DNA Extraction Kit for Cells & Blood (T3050, New England Biolabs, Ipswich MA) following the manufacturer’s recommended protocol. DNA was quantified prior to library construction using the Qubit DNA HS Assay (ThermoFischer, Waltham MA) and DNA fragment lengths were assessed using the Agilent Femto Pulse System (Santa Clara, CA). Libraries were prepared for sequencing using the Oxford Nanopore ligation kit (SQK-LSK110) following the manufacturers’ instructions, except that DNA repair and A-tailing was performed for 30 min and the ligation was allowed to continue for 1 hr. Prepared libraries were quantified using a Qubit fluorometer and 30 fmol of the library was loaded onto a Nanopore version R.9.4.1 flow cell and loaded on a PromethION running MinKNOW version (21.05.20). To increase output, the flow cell was washed after approximately 24 hr of sequencing then an additional 12 fmol of library was added to the flow cell and run for an additional 48 hr. Basecalling was performed using Guppy 5.0.12 (Oxford Nanopore) using the superior model (dna_r9.4.1_450bps_sup_prom.cfg). FASTQ files for assembly were extracted from unaligned bam files using samtools (Li et al. 2009) then Flye version 2.9 for assembly using the --nano-hq flag (Kolmogorov et al. 2019). Haplotigs and overlaps in the assembly were purged using purge_dups (https://github.com/dfguan/purge_dups). The assembly was then polished using Medaka version 1.4.2 (https://github.com/nanoporetech/medaka) followed by a second polishing step with pilon version 1.24 (Walker et al. 2014). Assembly statistics at each step were generated using Quast (Gurevich et al. 2013) and BUSCO (Simão et al. 2015) (Table S2). The primary contigs assembled from the Nanopore data were anchored to chromosomes using 505,210,505 read pairs of a Hi-C library isolated from another A. cahirinus individual of unknown sex downloaded from the NCBI Short Read Archive (SRX13258644) (Wang et al. 2022). After aligning the Hi-C reads with the ArimaHi-C Mapping Pipeline (https://github.com/ArimaGenomics/mapping_pipeline), YaHS v1.0 (Zhou et al. 2023) was used with default error correction for scaffolding, and Juicebox v1.11.08 (Dudchenko et al. 2018) was used to generate a Hi-C contact map. Progressive Cactus was used (Armstrong et al. 2020) to perform a whole-genome alignment of the A. cahirinus draft assembly to the Mus musculus GRCm39 reference genome (RefSeq GCF_000001635.27_GRCm39). Comparative annotation of the draft genomes was then performed using the Comparative Annotation Toolkit (CAT) (Fiddes et al. 2018). Briefly, the M. musculus RefSeq annotation GFF was parsed and validated with the “parse_ncbi_gff3” and “validate_gff3” programs (respectively) from CAT. The M. musculus reference transcript cDNA sequences were downloaded and mapped to the M. musculus draft genome with minimap2 (Li 2018) and provided to CAT as long-read RNA-seq reads in the “[ISO_SEQ_BAM]” field of the configuration file. For A. cahirinus, bulk RNA-seq data obtained from multiple pooled organs were downloaded from NCBI SRA BioProject PRJNA342864 (Bellofiore et al. 2017) and mapped to the draft assembly with STAR (Dobin et al. 2013) then provided to CAT in the “[BAMS]” field. CpG islands were identified using the cpg_lh utility from the UCSC suite of tools (Kent et al. 2002).
Lizard dataset
kaggle.com
zip
Updated Nov 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aadam (2021). Lizard dataset [Dataset]. https://www.kaggle.com/datasets/aadimator/lizard-dataset
Explore at:
zip(786544364 bytes)Available download formats
Dataset updated
Nov 27, 2021
Authors
Aadam
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The development of deep segmentation models for computational pathology (CPath) can help foster the investigation of interpretable morphological biomarkers. Yet, there is a major bottleneck in the success of such approaches because supervised deep learning models require an abundance of accurately labelled data. This issue is exacerbated in the field of CPath because the generation of detailed annotations usually demands the input of a pathologist to be able to distinguish between different tissue constructs and nuclei. Manually labelling nuclei may not be a feasible approach for collecting large-scale annotated datasets, especially when a single image region can contain thousands of different cells. Yet, solely relying on automatic generation of annotations will limit the accuracy and reliability of ground truth. Therefore, to help overcome the above challenges, we propose a multi-stage annotation pipeline to enable the collection of large-scale datasets for histology image analysis, with pathologist-in-the-loop refinement steps. Using this pipeline, we generate the largest known nuclear instance segmentation and classification dataset, containing nearly half a million labelled nuclei in H&E stained colon tissue. We will publish the dataset and encourage the research community to utilise it to drive forward the development of downstream cell-based models in CPath.

Link to the dataset paper.

Citation

@inproceedings{graham2021lizard, title={Lizard: A Large-Scale Dataset for Colonic Nuclear Instance Segmentation and Classification}, author={Graham, Simon and Jahanifar, Mostafa and Azam, Ayesha and Nimir, Mohammed and Tsang, Yee-Wah and Dodd, Katherine and Hero, Emily and Sahota, Harvir and Tank, Atisha and Benes, Ksenija and others}, booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision}, pages={684--693}, year={2021} }

Acknowledgements

We would like to acknowledge the following institutions, where the images in this dataset originated from:

University Hospitals Coventry and Warwickshire, United Kingdom

Histo Pathology Diagnostic Center, Shanghai, China

Ruijin Hospital, Shanghai, China

Xijing Hospital, Xi'an, China

Shanghai Songjiang District Central Hospital, Shanghai, China

The National Cancer Institute (NCI), United States of America
f
Additional file 3: of Genome-wide sequencing and metabolic annotation of...
datasetcatalog.nlm.nih.gov
springernature.figshare.com
Updated Jun 29, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
RiaĂąo-PachĂłn, Diego; Fernandes, Bruna; Zaiat, Marcelo; Pradella, JosĂŠ; Resende, Tiago; Rocha, Isabel; Dias, Oscar; Neto, Antonio Kaupert; Costa, Gisela; Oliveira, Juliana (2019). Additional file 3: of Genome-wide sequencing and metabolic annotation of Pythium irregulare CBS 494.86: understanding Eicosapentaenoic acid production [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000168537
Explore at:
Dataset updated
Jun 29, 2019
Authors
RiaĂąo-PachĂłn, Diego; Fernandes, Bruna; Zaiat, Marcelo; Pradella, JosĂŠ; Resende, Tiago; Rocha, Isabel; Dias, Oscar; Neto, Antonio Kaupert; Costa, Gisela; Oliveira, Juliana
Description
Pipeline classification - Pipeline classification for annotation and reconstruction of genome-scale metabolic models established according dataset analysis. (XLSX 1790 kb)
Z
Data from: Aphidinae comparative genomics resource
nde-dev.biothings.io
data.niaid.nih.gov
+2more
Updated Jul 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mathers, Thomas C (2024). Aphidinae comparative genomics resource [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_5908004
Explore at:
Dataset updated
Jul 17, 2024
Dataset provided by
Wouters, Roland H M
Swarbreck, David
Van Oosterhout, Cock
Mugford, Sam T
Botha, Anna-Maria
Hogenhout, Saskia A
Heavens, Darren
Mathers, Thomas C
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Here we provide early access to 18 new genome assemblies, including 8 assembled to chromosome-scale, for aphids from the subfamily Aphidinae. For consistency and to aid comparative analysis, all genomes have been annotated using the same repeat masking and RNA-seq-based gene prediction pipeline. Using this pipeline we also provide new annotations for three previously published genome assemblies.

The genome assemblies and annotations are made freely available without restriction, we only request that this Zenodo resource is cited when using the data. Raw sequence data upload to NCBI is underway and full details of all accessions will be given in an updated version of this resource. Manuscripts are in preparation describing the individual genome assemblies in detail and larger comparative genome analyses and we will update this resource with additional citation information as papers are published.

Full details of all genome assemblies and annotations included in this release are given in the attached "Data_Description.pdf" document.

Aphid species included in this release (bold type = chromosome-scale assembly):

Aphis fabae Aphis glycines (updated annotation) Aphis gossypii Aphis thalictri Aphis rumicis Brachycaudus cardui Brachycaudus helichrysi Brachycaudus klugkisti Brevicoryne brassicae Diuraphis noxia Macrosiphum albifrons Metopolophium dirhodum Myzus cerasi (updated annotation) Myzus ligustri Myzus lythri Myzus varians Pentalonia nigronervosa (updated annotation) Phorodon humuli Rhopalosiphum padi Sitobion avenae Sitobion miscanthi
n
LifeDB
neuinfo.org
scicrunch.org
+2more
Updated Jun 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). LifeDB [Dataset]. http://identifiers.org/RRID:SCR_006899
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_006899
Dataset updated
Jun 30, 2024
Description
Database that integrates large-scale functional genomics assays and manual cDNA annotation with bioinformatics gene expression and protein analysis. LifeDB integrates data regarding full length cDNA clones and data on expression of encoded protein and their subcellular localization on mammalian cell line. LifeDB enables the scientific community to systematically search and select genes, proteins as well as cDNA of interest by specific database identifiers as well as gene name. It enables to visualize cDNA clone and subcellular location of proteins. It also links the results to external biological databases in order to provide a broader functional information. LifeDB also provides an annotation pipeline which facilitates an improved mapping of clones to known human reference transcripts from the RefSeq database and the Ensembl database. An advanced web interface enables the researchers to view the data in a more user friendly manner. Users can search using any one of the following search options available both in Search gene and cDNA clones and Search Sub-cellular locations of human proteins: By Keyword, By gene/transcript identifier, By plate name, By clone name, By cellular location. * The Search genes and cDNA clones results include: Gene Name, Ensemble ID, Genomic Region, Clone name, Plate name, Plate position, Classification class, Synonymous SNP''s, Non- synonymous SNP''s, Number of ambiguous positions, and Alignment with reference genes. * The Search sub-cellular locations of human proteins results include: Subcellular location, Gene Name, Ensemble ID, Clone name, True localization, Images, Start tag and End tag. Every result page has an option to download result data (excluding the microscopy images). On click of ''Download results as CSV-file'' link in the result page the user will be given a choice to open or save result data in form of a CSV (Comma Separated Values) file. Later the CSV file can be easily opened using Excel or OpenOffice.
Robust subset of annotations from the IBD dataset.
figshare.com
xls
Updated Dec 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Baptiste Ruiz; Arnaud Belcour; Samuel Blanquart; Sylvie Buffet-Bataillon; Isabelle Le Huërou-Luron; Anne Siegel; Yann Le Cunff (2024). Robust subset of annotations from the IBD dataset. [Dataset]. http://doi.org/10.1371/journal.pcbi.1012577.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1012577.t003
Dataset updated
Dec 2, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Baptiste Ruiz; Arnaud Belcour; Samuel Blanquart; Sylvie Buffet-Bataillon; Isabelle Le Huërou-Luron; Anne Siegel; Yann Le Cunff
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Robust subset of annotations from the IBD dataset.
MammoTab 25: A Large-Scale Dataset for Semantic Table Interpretation –...
zenodo.org
bin, zip
Updated Nov 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marco Cremaschi; Marco Cremaschi; Federico Belotti; Federico Belotti; Jennifer D'Souza; Jennifer D'Souza; Matteo Palmonari; Matteo Palmonari (2025). MammoTab 25: A Large-Scale Dataset for Semantic Table Interpretation – Training, Testing, and Detecting Weaknesses [Dataset]. http://doi.org/10.5281/zenodo.16562700
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.16562700
Dataset updated
Nov 10, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Marco Cremaschi; Marco Cremaschi; Federico Belotti; Federico Belotti; Jennifer D'Souza; Jennifer D'Souza; Matteo Palmonari; Matteo Palmonari
License
https://www.gnu.org/licenses/agpl.txthttps://www.gnu.org/licenses/agpl.txt
Description
MammoTab 25 is a large‑scale, richly‑annotated benchmark designed to advance research on Semantic Table Interpretation (STI) and to evaluate the reasoning abilities of modern Large Language Models (LLMs).

Scale and origin – The corpus contains 838930 tables automatically extracted from 63 million English‑language Wikipedia pages.

Comprehensive annotations – Every table is accompanied by:

Cell-Entity Annotation (CEA),

Column-Type Annotation (CTA),

Columns-Property Annotation (CPA),

Four ready‑to‑use prompt templates for LLM training and stress‑testing,

Fine‑grained metadata capturing column roles (Named‑Entity vs Literal), NIL flags, header/caption context, and structural statistics.

Challenge coverage – Tagged metadata enables users to isolate and diagnose all key STI challenges, including multi-domain tables, acronyms, aliases, typos, approximate numeric values, and true NIL mentions, making the dataset suitable for both benchmarking and error analysis.

Format & access – Tables are stored as CSV files; annotations are provided in separate CSVs following the SemTab format; contextual information is packed in JSON side‑cars.

The pipeline for regenerating the dataset is openly available on GitHub at https://github.com/unimib-datAI/mammotab/.

The documentation is available at https://unimib-datai.github.io/mammotab-docs/.
Structural Annotation of Mycobacterium tuberculosis Proteome
plos.figshare.com
tiff
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Praveen Anand; Sandhya Sankaran; Sumanta Mukherjee; Kalidas Yeturu; Roman Laskowski; Anshu Bhardwaj; Raghu Bhagavat; Samir K. Brahmachari; Nagasuma Chandra (2023). Structural Annotation of Mycobacterium tuberculosis Proteome [Dataset]. http://doi.org/10.1371/journal.pone.0027044
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0027044
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Praveen Anand; Sandhya Sankaran; Sumanta Mukherjee; Kalidas Yeturu; Roman Laskowski; Anshu Bhardwaj; Raghu Bhagavat; Samir K. Brahmachari; Nagasuma Chandra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Of the ∼4000 ORFs identified through the genome sequence of Mycobacterium tuberculosis (TB) H37Rv, experimentally determined structures are available for 312. Since knowledge of protein structures is essential to obtain a high-resolution understanding of the underlying biology, we seek to obtain a structural annotation for the genome, using computational methods. Structural models were obtained and validated for ∼2877 ORFs, covering ∼70% of the genome. Functional annotation of each protein was based on fold-based functional assignments and a novel binding site based ligand association. New algorithms for binding site detection and genome scale binding site comparison at the structural level, recently reported from the laboratory, were utilized. Besides these, the annotation covers detection of various sequence and sub-structural motifs and quaternary structure predictions based on the corresponding templates. The study provides an opportunity to obtain a global perspective of the fold distribution in the genome. The annotation indicates that cellular metabolism can be achieved with only 219 folds. New insights about the folds that predominate in the genome, as well as the fold-combinations that make up multi-domain proteins are also obtained. 1728 binding pockets have been associated with ligands through binding site identification and sub-structure similarity analyses. The resource (http://proline.physics.iisc.ernet.in/Tbstructuralannotation), being one of the first to be based on structure-derived functional annotations at a genome scale, is expected to be useful for better understanding of TB and for application in drug discovery. The reported annotation pipeline is fairly generic and can be applied to other genomes as well.
h
ReasonPlan_PDR
huggingface.co
Updated May 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Liu Xueyi (2025). ReasonPlan_PDR [Dataset]. https://huggingface.co/datasets/LiuxyIA/ReasonPlan_PDR
Explore at:
Dataset updated
May 26, 2025
Authors
Liu Xueyi
Description
Dataset Card for Dataset Name

PDR is the official dataset used in paper https://huggingface.co/papers/2505.20024.

Dataset Details

PDR is a large-scale instruction dataset tailored for closed-loop planning, which contains 203,353 training samples and 11,047 testing samples. Using an automated annotation pipeline, PDR captures the complete decision reasoning process in training scenarios on the Bench2Drive, including the following stages: Scene Understanding, Traffic Sign… See the full description on the dataset page: https://huggingface.co/datasets/LiuxyIA/ReasonPlan_PDR.
FANTASIA V3 - LookUp Table - UniProt July 2025 - Experimental Evidence code
zenodo.org
csv, tar
Updated Jul 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Francisco M. Perez-Canales; Francisco M. Perez-Canales; Ana Rojas; Ana Rojas (2025). FANTASIA V3 - LookUp Table - UniProt July 2025 - Experimental Evidence code [Dataset]. http://doi.org/10.5281/zenodo.16582433
Explore at:
csv, tarAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.16582433
Dataset updated
Jul 29, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Francisco M. Perez-Canales; Francisco M. Perez-Canales; Ana Rojas; Ana Rojas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
📘 FANTASIA V3 - LookUp Table (UniProt July 2025)

Experimental Evidence Code Only

Overview

This is a PostgreSQL database backup using the pgvector extension to store high-dimensional protein embeddings. It contains precomputed embeddings and functional annotations from the UniProt July 2025 release, including only entries supported by experimental evidence.

This lookup table was generated using version v2.0.0 of the Protein Information System (PIS), an integrated biological information system designed for the automated extraction, processing, and management of protein-related data. PIS consolidates information from UniProt, PDB, and GOA, allowing efficient retrieval and organization of sequences, structures, and annotations.

The resulting database is designed for compatibility with FANTASIA V3, an advanced pipeline for large-scale functional annotation of proteins using state-of-the-art Protein Language Models (PLMs). While the lookup table is stored in a vector database for persistence, FANTASIA loads the relevant data into memory at runtime to enable high-speed annotation.

FANTASIA uses precomputed deep learning embeddings to perform nearest-neighbor searches in embedding space and transfer Gene Ontology (GO) terms from experimentally annotated proteins to query sequences.

Dataset Details

Total proteins: 127,546

Total sequences: 124,397

Total embeddings: 621,849

Total GO annotations: 627,932

Included evidence codes (Gene Ontology, experimental only):

EXP – Inferred from Experiment

IDA – Inferred from Direct Assay

IPI – Inferred from Physical Interaction

IMP – Inferred from Mutant Phenotype

IGI – Inferred from Genetic Interaction

IEP – Inferred from Expression Pattern

TAS – Traceable Author Statement

IC – Inferred by Curator

Included Embedding Models

ESM-2 (650M parameters)
A transformer-based protein language model trained on UniRef50 using masked language modeling. It captures structural and functional features directly from raw sequences without requiring MSAs. ESM-2 is widely used for contact map prediction, unsupervised learning, and representation extraction.

ProtT5-XL-UniRef50 (~1.2B parameters)
A large-scale encoder-decoder model using the T5 architecture, trained on UniRef50 via masked span prediction. It generates high-dimensional sequence representations that perform well across structure and function prediction tasks.

ProstT5 (~1.2B parameters)
A multi-modal extension of ProtT5, trained to predict both sequence and coarse-grained 3Di structural states. Useful for downstream applications like contact prediction, functional annotation, and classification.

Ankh3-Large (620M parameters)
An encoder-only T5-style model trained with masked span prediction. Optimized for fast inference, it encodes both semantic and structural protein information and can replace ProtT5 in many ML pipelines.

ESM3c (Cambrian 600M)
Part of the new ESM C model family, trained on UniRef, MGnify, and JGI datasets. With rotary embeddings and 36 layers, it offers enhanced performance for masked language modeling, producing high-quality structural and functional embeddings without alignments.

Missing Proteins

A small subset of proteins could not be processed on the Finisterrae III (CESGA) supercomputer due to memory limitations with 40 GB A100 GPUs.

The file missing_proteins.csv lists all affected UniProt identifiers. These entries are excluded from the final lookup table.

Facebook

Twitter

Click to copy link

Link copied

Cite

Surya Saha; Amanda M. Cooksey; Anna K. Childers; Monica F. Poelchau; Fiona M. McCarthy (2025). Functional annotation for 15 diverse arthropod genomes [Dataset]. http://doi.org/10.15482/USDA.ADC/1522860

Data from: Functional annotation for 15 diverse arthropod genomes

Explore at:

application/x-gzipAvailable download formats

Unique identifier

https://doi.org/10.15482/USDA.ADC/1522860

Dataset updated

Nov 22, 2025

Dataset provided by

Ag Data Commons

Authors

Surya Saha; Amanda M. Cooksey; Anna K. Childers; Monica F. Poelchau; Fiona M. McCarthy

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

We present the annotation results of 15 arthropod proteomes using an open source, open access and containerized pipeline for genome-scale functional annotation of insect proteomes and apply it to a diverse range of arthropod species. You can find more information about the pipeline at our readthedocs site. The files for each genome include GOanna, InterproScan and KOBAS predictions. Arthropod genomes selected for this study and their assembly and annotation statistics.

Apis Mellifera (honey bee) Drosophila melanogaster (fruit fly) Tribolium castaneum (red flour beetle) Latrodectus hesperus (Western black widow spider) Limnephilus lunatus (caddisfly) Oncopeltus fasciatus (Large milkweed bug) Homalodisca vitripennis (Glassy-winged sharpshooter) Eurytemora affinis (calanoid copepod) Agrilus planipennis (emerald ash borer) Copidosoma floridanum (parasitoid wasp) Athalia rosae (turnip sawfly) Ceratitis capitata (Mediterranean fruit fly) Cimex lectularius (Cimicidae bed bug) Varroa destructor(parasitic mite)

Diaphorina citri (Asian citrus psyllid) Resources in this dataset:Resource Title: Cimex lectularius (Cimicidae bed bug) annotation. File Name: CLEC.tar.gzResource Description: Functional annotation for Clec-OGSv1.2 protein setResource Title: Tribolium castaneum (red flour beetle) annotation. File Name: TCAS.tar.gzResource Description: Functional annotation for TCAS_OGS_v3 protein setResource Title: Drosophila melanogaster (fruit fly) annotation. File Name: DMEL.tar.gzResource Description: Functional annotation for DMEL_r6.38 protein set

Resource Title: Varroa destructor (parasitic mite) annotation. File Name: VDES.tar.gzResource Description: Functional annotation for NCBI Varroa destructor Annotation Release 100 protein set based on Vdes_3.0 genome (GCA_002443255.1) Resource Title: Oncopeltus fasciatus (Large milkweed bug) annotation. File Name: ONCFAS.tar.gzResource Description: Functional annotation for oncfas_OGSv1.2 protein setResource Title: Apis Mellifera (honey bee) annotation. File Name: AMEL.tar.gzResource Description: Functional annotation for OGSv3.3 protein set from Amel_4.5 genome (GCA_000002195.1) Resource Title: Homalodisca vitripennis (Glassy-winged sharpshooter) annotation. File Name: HVIT.tar.gzResource Description: Functional annotation for HVIT-BCM_version_0.5.3 protein set based on Hvit_1.0 genome (GCA_000696855.1) Resource Title: Limnephilus lunatus (caddisfly) annotation. File Name: LLUN.tar.gzResource Description: Functional annotation for LLUN-BCM_version_0.5.3 protein set from Llun_1.0 genome (GCA_000648945.1) Resource Title: Latrodectus hesperus (Western black widow spider) annotation. File Name: LHES.tar.gzResource Description: Functional annotation for LHES-BCM_version_0.5.3 protein set from Lhes_1.0 genome (GCA_000697925.1) Resource Title: Eurytemora affinis (calanoid copepod) annotation. File Name: EAFF.tar.gzResource Description: Functional annotation for EAFF-BCM_version_0.5.3 protein set from Eaff_1.0 genome (GCA_000591075.1) Resource Title: Copidosoma floridanum (parasitoid wasp) annotation. File Name: CFLO.tar.gzResource Description: Functional annotation for CFLO-BCM_version_0.5.3 protein set based on Cflo_1.0 genome (GCA_000648655.1) Resource Title: Ceratitis capitata (Mediterranean fruit fly) annotation. File Name: CCAP.tar.gzResource Description: Functional annotation for Ccap-OGSv1 protein set based on Ccap_1.1 assembly (GCA_000347755.2) Resource Title: Athalia rosae (turnip sawfly) annotation. File Name: AROS.tar.gzResource Description: Functional annotation for AROS-BCM_version_0.5.3 protein set based on Aros_1.0 genome (GCA_000344095.1)Resource Title: Agrilus planipennis (emerald ash borer) annotation. File Name: APLA.tar.gzResource Description: Functional annotation for APLA-BCM_version_0.5.3 protein set based on Apla_1.0 genome (GCA_000699045.1)

Clear search

Close search

Google apps

Main menu

Data from: Functional annotation for 15 diverse arthropod genomes

Supplementary materials for "Robustness analysis of metabolic predictions in...

Genome Annotation of the Biting Midges Culicoides sonorensis and Culicoides...

Data from: A chromosome-scale high-contiguity genome assembly of the...

Additional file 3: of A De-Novo Genome Analysis Pipeline (DeNoGAP) for...

Data from: Globe230k: A Benchmark Dense-Pixel Annotation Dataset for Global...

A high-quality genome assembly for Dillenia turbinata (Dilleniales)

Genome and repeat annotation of the phased telomere-to-telomere assembly of...

3D-Genomics Database

OKI2018_I69 assembly and annotation of the genome of an individual...

Chromosome-scale genome assembly of the African spiny mouse (Acomys...

Lizard dataset

Citation

Acknowledgements

Additional file 3: of Genome-wide sequencing and metabolic annotation of...

Data from: Aphidinae comparative genomics resource

LifeDB

Robust subset of annotations from the IBD dataset.

MammoTab 25: A Large-Scale Dataset for Semantic Table Interpretation –...

Structural Annotation of Mycobacterium tuberculosis Proteome

ReasonPlan_PDR

FANTASIA V3 - LookUp Table - UniProt July 2025 - Experimental Evidence code

📘 FANTASIA V3 - LookUp Table (UniProt July 2025)

Overview

Dataset Details

Included Embedding Models

Missing Proteins

Data from: Functional annotation for 15 diverse arthropod genomes