Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
We present the annotation results of 15 arthropod proteomes using an open source, open access and containerized pipeline for genome-scale functional annotation of insect proteomes and apply it to a diverse range of arthropod species. You can find more information about the pipeline at our readthedocs site. The files for each genome include GOanna, InterproScan and KOBAS predictions. Arthropod genomes selected for this study and their assembly and annotation statistics.
Apis Mellifera (honey bee) Drosophila melanogaster (fruit fly) Tribolium castaneum (red flour beetle) Latrodectus hesperus (Western black widow spider) Limnephilus lunatus (caddisfly) Oncopeltus fasciatus (Large milkweed bug) Homalodisca vitripennis (Glassy-winged sharpshooter) Eurytemora affinis (calanoid copepod) Agrilus planipennis (emerald ash borer) Copidosoma floridanum (parasitoid wasp) Athalia rosae (turnip sawfly) Ceratitis capitata (Mediterranean fruit fly) Cimex lectularius (Cimicidae bed bug) Varroa destructor(parasitic mite)
Diaphorina citri (Asian citrus psyllid) Resources in this dataset:Resource Title: Cimex lectularius (Cimicidae bed bug) annotation. File Name: CLEC.tar.gzResource Description: Functional annotation for Clec-OGSv1.2 protein setResource Title: Tribolium castaneum (red flour beetle) annotation. File Name: TCAS.tar.gzResource Description: Functional annotation for TCAS_OGS_v3 protein setResource Title: Drosophila melanogaster (fruit fly) annotation. File Name: DMEL.tar.gzResource Description: Functional annotation for DMEL_r6.38 protein set
Resource Title: Varroa destructor (parasitic mite) annotation. File Name: VDES.tar.gzResource Description: Functional annotation for NCBI Varroa destructor Annotation Release 100 protein set based on Vdes_3.0 genome (GCA_002443255.1) Resource Title: Oncopeltus fasciatus (Large milkweed bug) annotation. File Name: ONCFAS.tar.gzResource Description: Functional annotation for oncfas_OGSv1.2 protein setResource Title: Apis Mellifera (honey bee) annotation. File Name: AMEL.tar.gzResource Description: Functional annotation for OGSv3.3 protein set from Amel_4.5 genome (GCA_000002195.1) Resource Title: Homalodisca vitripennis (Glassy-winged sharpshooter) annotation. File Name: HVIT.tar.gzResource Description: Functional annotation for HVIT-BCM_version_0.5.3 protein set based on Hvit_1.0 genome (GCA_000696855.1) Resource Title: Limnephilus lunatus (caddisfly) annotation. File Name: LLUN.tar.gzResource Description: Functional annotation for LLUN-BCM_version_0.5.3 protein set from Llun_1.0 genome (GCA_000648945.1) Resource Title: Latrodectus hesperus (Western black widow spider) annotation. File Name: LHES.tar.gzResource Description: Functional annotation for LHES-BCM_version_0.5.3 protein set from Lhes_1.0 genome (GCA_000697925.1) Resource Title: Eurytemora affinis (calanoid copepod) annotation. File Name: EAFF.tar.gzResource Description: Functional annotation for EAFF-BCM_version_0.5.3 protein set from Eaff_1.0 genome (GCA_000591075.1) Resource Title: Copidosoma floridanum (parasitoid wasp) annotation. File Name: CFLO.tar.gzResource Description: Functional annotation for CFLO-BCM_version_0.5.3 protein set based on Cflo_1.0 genome (GCA_000648655.1) Resource Title: Ceratitis capitata (Mediterranean fruit fly) annotation. File Name: CCAP.tar.gzResource Description: Functional annotation for Ccap-OGSv1 protein set based on Ccap_1.1 assembly (GCA_000347755.2) Resource Title: Athalia rosae (turnip sawfly) annotation. File Name: AROS.tar.gzResource Description: Functional annotation for AROS-BCM_version_0.5.3 protein set based on Aros_1.0 genome (GCA_000344095.1)Resource Title: Agrilus planipennis (emerald ash borer) annotation. File Name: APLA.tar.gzResource Description: Functional annotation for APLA-BCM_version_0.5.3 protein set based on Apla_1.0 genome (GCA_000699045.1)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Supplementary materials for the revised version of the PeerJ preprint " Robustness analysis of metabolic predictions in algal microbial communities based on different annotation pipelines".
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This collection contains gene annotation datasets for Culicoides sonorensis and Culicoides stellifer (Diptera: Ceratopogonidae), two biting midge species of veterinary significance. Culicoides sonorensis is a confirmed vector of Bluetongue and Epizootic Hemorrhagic Disease viruses in North America, while C. stellifer has been implicated in Orbivirus transmission in the southeastern United States. The reference genome assemblies for these species are publicly available at NCBI under accession numbers GCA_047716325.1 (C. sonorensis) and GCA_040583785.1 (C. stellifer).
Gene models for both assemblies were predicted using the EGAPx-alpha pipeline (egapx:0.3.2-alpha), employing the same C. sonorensis RNA-Seq dataset (ERR2171964 – ERR2171978) as transcriptomic evidence for model training, as no C. stellifer RNA-Seq data are currently available (Last checked- 17/11/2025). The dataset includes gene model coordinates (GFF), transcript nucleotide sequences (FNA), and predicted protein sequences (FAA).
These curated annotation datasets were generated to support the forthcoming manuscript: Chromosome-scale genome of Culicoides brevitarsis highlights genetic basis of vector competency. They provide consistent annotation resources for cross-species comparative analyses among Culicoides midges. Lineage: Culicoides sonorensis Genome assembly: GCA_047716325.1 (idCulSono.KS.ABADRU.1.0.female) Isolate: Kansas colony WGS project: JBLLJK01 Submitter: Ag100Pest Initiative (USDA-ARS ABADRU) Release date: 12 Feb 2025 Annotation pipeline: EGAPx-alpha (egapx:0.3.2-alpha) Transcript evidence: C. sonorensis RNA-Seq ERR2171964 – ERR2171978
Culicoides stellifer: Genome assembly: GCA_040583785.1 (c_stellifer_primary030_purged) Submitter: University of Guelph Release date: 10 Jul 2024 Annotation pipeline: EGAPx-alpha (egapx:0.3.2-alpha) Transcript evidence: Shared C. sonorensis RNA-Seq dataset (ERR2171964 – ERR2171978) used for training and gene structure validation
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The cheetah (Acinonyx jubatus, SCHREBER 1775) is a large felid and is considered the fastest land animal. Historically, it inhabited open grassland across Africa, the Arabian Peninsula, and southwestern Asia; however, only small and fragmented populations remain today. Here, we present a de novo genome assembly of the cheetah based on PacBio continuous long reads and Hi-C proximity ligation data. The final assembly (VMU_Ajub_asm_v1.0) has a total length of 2.38 Gb, of which 99.7% are anchored into the expected 19 chromosome-scale scaffolds. The contig and scaffold N50 values of 96.8 Mb and 144.4 Mb, respectively, a BUSCO completeness of 95.4% and a k-mer completeness of 98.4%, emphasize the high quality of the assembly. Furthermore, annotation of the assembly identified 23,622 genes and a repeat content of 40.4%. This new highly contiguous and chromosome-scale assembly will greatly benefit conservation and evolutionary genomic analyses and will be a valuable resource, e.g., to gain a detailed understanding of the function and diversity of immune response genes in felids. Methods The presented data is related to the eponymous publication "A chromosome-scale high-contiguity genome assembly of the threatened cheetah (Acinonyx jubatus)" soon to be published in the Journal of Heredity. Any questions regarding this dataset or the publication can be addressed to the corresponding authors, Sven Winter (sven.winter@vetmeduni.ac.at) and Pamela Burger (pamela.burger@vetmeduni.ac.at). Assembly: The assembly was generated from one PacBio CLR library sequenced on one SMRTCell on a Sequel IIe using Flye v. 2.9, including one iteration of long-read polishing followed by one iteration of short-read polishing with pilon v.1.23 using trimmed standard Illumina short-reads generated on the Illumina Novaseq 6000 platform. Subsequently, the contigs of the polished assembly were anchored into chromosome-scale scaffolds with YaHS v.1.1 using publically available Hi-C data for the cheetah (SRR8616936, SRR8616937) that were prepared following the Arima Hi-C mapping pipeline (https://github.com/VGP/vgp-assembly/blob/master/pipeline/salsa/arima_mapping_pipeline.sh). Finally, two iterations of gap-closing were performed with TGS-GapCloser v. 1.1.1 using a different random subset of PacBio reads (25%) for each iteration. Annotation: Repeat Annotation To improve gene prediction, we first identified and masked the repeats in the assembly. A de novo repeat library was generated with RepeatModeler v.2.0.1 and combined with a Felidae-specific repeat library from RepBase. This custom repeat library was then used with RepeatMasker v.4.1.0 to hard-mask interspersed repeats and soft-mask simple repeats. Gene annotation We predicted genes in the masked assembly based on homology using the GeMoMa pipeline v. 1.7.1 and the following reference assemblies and annotation files: Homo sapiens (GCF_000001405.40), Mus musculus (GCF_000001635.27), Lynx canadensis(GCF_007474595.2), Canis lupus familiaris (GCF_014441545.1), Prionailuris bengalensis (GCF_016509475.1), Leopardus geoffroyi (GCF_018350155.1), Felis catus (GCF_018350175.1), Panthera tigris (GCF_018350195.1), and Panthera leo (GCF_018350215.1). We functionally annotated the predicted proteins using InterProScan v.5.50.84 and a BLASTP v.2.11.0 search against the Swiss-Prot database (release 2021-02). For more details on assembly quality assessment and comparative analyses to other Felidae assemblies, please read the original manuscript. This dataset comprises the following files: VMU_Ajub_asm_v1.0.fasta (final unmasked assembly, also available at GenBank under accession GCA_027475565.1) VMU_Ajub_asm_v1.0.fasta.masked (final assembly with all repeats hard-masked) VMU_Ajub_asm_v1.0.fasta.out (RepeatMasker output for the fully hard-masked assembly VMU_Ajub_asm_v1.0.fasta.masked) VMU_Ajub_asm_v1.0.fasta.tbl (RepeatMasker output for the fully hard-masked assembly VMU_Ajub_asm_v1.0.fasta.masked) VMU_Ajub_asm_v1.0_hardmaskedTE.fasta (final assembly with all interspersed repeats hard-masked) VMU_Ajub_asm_v1.0_hardmaskedTE.fasta.out (RepeatMasker output for the fully hard-masked assembly VMU_Ajub_asm_v1.0_hardmaskedTE.fasta) VMU_Ajub_asm_v1.0_hardmaskedTE.fasta.tbl (RepeatMasker output for the fully hard-masked assembly VMU_Ajub_asm_v1.0_hardmaskedTE.fasta) VMU_Ajub_asm_v1.0_hardmaskedTE_softmaskedSR.fasta (final assembly with all interspersed repeats hard-masked and simple repeats soft-masked) VMU_Ajub_asm_v1.0_hardmaskedTE_softmaskedSR.fasta.out (RepeatMasker output for the fully hard-masked assembly VMU_Ajub_asm_v1.0_hardmaskedTE_softmaskedSR.fasta) VMU_Ajub_asm_v1.0_hardmaskedTE_softmaskedSR.fasta.tbl (RepeatMasker output for the fully hard-masked assembly VMU_Ajub_asm_v1.0_hardmaskedTE_softmaskedSR.fasta) consensi.fa.classified (de novo repeat library for the final assembly VMU_Ajub_asm_v1.0.fasta generated With RepeatModeler2) Ajub_assembly_commands.txt (List with all commands used to generate the assembly and all related analyses)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
List of genomes used for development and testing of the DeNoGAP pipeline. (XLSX 25Â kb)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We (Intelligent Mining and Analysis of Remote Sensing big data, IMARS) create a large-scale annotated dataset (Globe230k) for land use/land cover (LULC) mapping, which is annotated on Google Earth image of 1 m spatial resolution. Globe230k is annotated by numerous experts and students major in survey and mapping after necessary training, through visual interpretation on very high-resolution images, as well as in-situ field survey, under the guidance of the organized annotation pipeline. Globe230k has three superiorities:
1) Large scale: the Globe230k includes 232,819 annotated images with the size of 512x512 and spatial resolution of 1 m, with more than 3x1010 annotated pixels, and it includes 10 first-level categories.
2) Rich diversity: the annotated images are sampled from worldwide regions, with coverage area of over 60,000 km2, indicating a high variability and diversity. Besides, in order to ensure the category balance, we intentionally give more chance to the rare categories to be sampled, such as wetland, ice/snow, etc.
3) Multi-modal: Globe230k not only contains RGB bands, but also include other important features for Earth system research, such as Normalized differential vegetation index (NDVI), digital elevation model (DEM), vertical-vertical polarization (VV) bands, vertical-horizontal polarization (VH) bands, which can facilitate the multi-modal data fusion research. Due to the large size of the multi-modal dataset (DEM 1.91G, NDVI 164G, VVVH 372G), these dataset are stored on Baidu Yunpan, the download link is :https://pan.baidu.com/s/12AKbiqOXSf4fnm7mYkCE0g?pwd=230k, the extraction code is 230k.
The image patches and their corresponding annotated patches are respectively stored in "image_patch.zip" and "label_patch.zip" file. The RGB image is in forms of ".jpg", with size of 512x512, the pixel value is ranged from 0-255. The annotated patches is in forms of ".png", also with size of 512x512, the pixel value is ranged from 1-10, which respectively represent 1#cropland, 2#forest, 3#grass, 4#shrubland, 5#wetland, 6#water, 7#tundra, 8#impervious, 9#bareland, 10#ice/snow. The corresponding DEM, NDVI and VVVH patches are all in form of ".tif", with size of 512x512 (due to the different resolution of DEM, NDVI and VVVH patches, they are all uniformly resized to the same scale as the image patch).
The total 232,819 pairs are officially divided into training set, validation set, and test set, based on ratio of 7:1:2, which can be find in "train_num.txt","val_num.txt","test_num.txt" file. Based on this division, the official baseline accuracy of several state-of-the-art semantic segmentation can be found in the related arcticle (https://spj.science.org/doi/10.34133/remotesensing.0078).
We hope it can be used as a benchmark to promote further development of global land cover mapping and semantic segmentation algorithm development.
Facebook
TwitterObjectives: Dillenia turbinata (Dilleniaceae) is a member of the order Dilleniales, an enigmatic clade of critical importance for understanding the diversification history of flowering plants but for which genome sequences are not available. We have produced and annotated a chromosome-scale whole genome assembly for D. turbinata through the resources of the 10KP (10,000 Plants) Genomes Project. The genome assembly and associated data provided here will serve as a useful resource for comparative and evolutionary genomics research across the flowering plants. Data description: The D. turbinata genome was assembled from Oxford Nanopore Technology (ONT) and whole-genome shotgun (WGS) sequences, and scaffolded into chromosome-scale pseudomolecules using Hi-C data. The genome assembly is 723,739,077 base pairs in length with a BUSCO completeness score of 97%.  Twenty-eight scaffolds contain more than 99% of the assembly. The repeat-masked genome sequence is annotated with 36,967 protein-codin..., Genome assembly and annotation Raw nanopore reads in fastq format were assembled with Canu v2.2 (Koren et al. 2017) using an estimated genome of 900Mb to guide coverage parameters during the read correction, trimming, and assembly steps of the pipeline. The resulting primary assembly was polished with the WGS reads using NextPolish v1.3.1 (Hu et al. 2020), and duplicated constructs were removed by Purge Haplotigs (Roach et al. 2018). The set of deduplicated contigs was scaffolded on the basis of Hi-C reads using the Juicer pipeline (Durand et al. 2016) and 3d-dna tools (Dudchenko et al. 2017) with default parameters. Genome annotation was performed using the MAKER-P pipeline (Campbell et al. 2014) supplied with coding DNA sequences (CDS) from a Trinity (Grabherr et al. 2011) assembly of the Dillenia transcriptome reads, proteomes from four publicly available eudicot genomes —Arabidopsis thaliana, Aquilegia coerulea, Nelumbo nucifera, and Vitis vinifera, and a custom repeat library of tr..., The included data files may be opened with MS Word (Detailed Methods.docx), MS Ecel (Dillenia.genome.assembly.stats.xlsx), standard image viewer software (Dillenia.BUSCO.summaries.png), and standard text editor programs (Dillenia.genome.fasta and Dillenia.maker.predict.36967.final.gff), .
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides the full genome annotation supporting the publication:
"Phased T2T reference genome assembly of Moroccan Argane (Argania spinosa)"
Hanane El Idrissi, Anestis Gkanogiannis, Driss Iraqi, Siham Khoulassa, Mohamed Fokar, Bouabid Badaoui, Rachid Moussadek, Rachid Mentag and Slimane Khayi (2025)
We present the genome annotation files for the phased, telomere-to-telomere (T2T), chromosome-scale genome assembly of Argania spinosa, an ecologically and economically important tree endemic to Morocco. The assembly comprises two fully phased haplotypes, each organized into 11 pseudochromosomes.
This Zenodo entry includes:
Structural gene annotation (GFF3) generated using the Funannotate pipeline
Predicted protein sequences (FASTA)
Repeat annotations:
GFF3 files for simple, complex, and combined repeat annotations
Soft-masked genome FASTA (simple and complex repeats masked in lowercase)
Hard-masked genome FASTA (repeats replaced with Ns)
Gene prediction was performed using transcript evidence (RNA-Seq from root and leaf tissues), protein homology from Ericales and SwissProt, and de novo ab initio models (AUGUSTUS, GeneMark-ES). Functional annotations were assigned using InterProScan, eggNOG, Pfam, and Gene Ontology databases. Repeat annotation was performed using RepeatModeler and RepeatMasker in a multi-round strategy incorporating both lineage-specific and de novo repeat libraries.
NCBI BioProject: PRJNA1223813
Facebook
TwitterTHIS RESOURCE IS NO LONGER IN SERVICE, documented August 29, 2016. Database containing structural annotations for the proteomes of just under 100 organisms. Using data derived from public databases of translated genomic sequences, representatives from the major branches of Life are included: Prokaryota, Eukaryota and Archaea. The annotations stored in the database may be accessed in a number of ways. The help page provides information on how to access the database. 3D-GENOMICS is now part of a larger project, called e-Protein. The project brings together similar databases at three sites: Imperial College London , University College London and the European Bioinformatics Institute . e-Protein''s mission statement is To provide a fully automated distributed pipeline for large-scale structural and functional annotation of all major proteomes via the use of cutting-edge computer GRID technologies. The following databases are incorporated: NRprot, SCOP, ASTRAL, PFAM, Prosite, taxonomy, COG The following eukaryotic genomes are incorporated: Anopheles gambiae, protein sequences from the mosquito genome; Arabidopsis thaliana, protein sequences from the Arabidopsis genome; Caenorhabditis briggsae, protein sequences from the C.briggsae genome; Caenorhabditis elegans protein sequences from the worm genome; Ciona intestinalis protein sequences from the sea squirt genome; Danio rerio protein sequences from the zebrafish genome; Drosophila melanogaster protein sequences from the fruitfly genome; Encephalitozoon cuniculi protein sequences from the E.cuniculi genome; Fugu rubripes protein sequences from the pufferfish genome; Guillardia theta protein sequences from the G.theta genome; Homo sapiens protein sequences from the human genome; Mus musculus protein sequences from the mouse genome; Neurospora crassa protein sequences from the N.crassa genome; Oryza sativa protein sequences from the rice genome; Plasmodium falciparum protein sequences from the P.falciparum genome; Rattus norvegicus protein sequences from the rat genome; Saccharomyces cerevisiae protein sequences from the yeast genome; Schizosaccharomyces pombe protein sequences from the yeast genome
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
A chromosome-scale assembly of the Oikopleura dioica genome from Okinawa, Japan. The contig assembly was generated with long-read Nanopore data using Canu pipeline v1.8, and polished with short Illumina MiSeq reads using Pilon v1.22. Both Nanopore and Illumina data were generated from DNA of a single O. dioica male. Hi-C chromosomal conformation capture data was used to order and orient the contigs into scaffolds using Juicer v1.6 and 3D de novo assembly (3D-DNA) pipelines. The OKI2018_I69_1.0 assembly comprises 19 scaffolds with an N50 of 16.2 Mbp (OKI2018_I69_1.0.fa). The total assembly length is 64.3 Mbp. The five longest scaffolds represent autosomal chromosomes (chr 1 and chr 2), and sex chromosomes split into pseudo-autosomal region (PAR) and X-specific (XSR) or Y-specific (YSR) regions. One of the smaller scaffolds represent a draft assembly of mitochondrial genome (chrUn_12). The rest of scaffolds are highly repetitive and were marked as unplaced. The OKI2018_I69_1.0 assembly was annotated with AUGUSTUS v3.3 and MAKER v3.01.03 pipelines. Gene predictions from these software were refined and merged using EvidenceModeler v1.1.1. To predict UTRs and alternative isoforms, the EVM models were updated using two round of PASA pipeline, resulting in 18,485 transcript models distributed among 16,936 protein-coding genes (OKI2018_I69_1.0.gene_models.gff3).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Genomic DNA was extracted from blood from a single male A. cahirinus animal using a Monarch HMW DNA Extraction Kit for Cells & Blood (T3050, New England Biolabs, Ipswich MA) following the manufacturer’s recommended protocol. DNA was quantified prior to library construction using the Qubit DNA HS Assay (ThermoFischer, Waltham MA) and DNA fragment lengths were assessed using the Agilent Femto Pulse System (Santa Clara, CA). Libraries were prepared for sequencing using the Oxford Nanopore ligation kit (SQK-LSK110) following the manufacturers’ instructions, except that DNA repair and A-tailing was performed for 30 min and the ligation was allowed to continue for 1 hr. Prepared libraries were quantified using a Qubit fluorometer and 30 fmol of the library was loaded onto a Nanopore version R.9.4.1 flow cell and loaded on a PromethION running MinKNOW version (21.05.20). To increase output, the flow cell was washed after approximately 24 hr of sequencing then an additional 12 fmol of library was added to the flow cell and run for an additional 48 hr. Basecalling was performed using Guppy 5.0.12 (Oxford Nanopore) using the superior model (dna_r9.4.1_450bps_sup_prom.cfg). FASTQ files for assembly were extracted from unaligned bam files using samtools (Li et al. 2009) then Flye version 2.9 for assembly using the --nano-hq flag (Kolmogorov et al. 2019). Haplotigs and overlaps in the assembly were purged using purge_dups (https://github.com/dfguan/purge_dups). The assembly was then polished using Medaka version 1.4.2 (https://github.com/nanoporetech/medaka) followed by a second polishing step with pilon version 1.24 (Walker et al. 2014). Assembly statistics at each step were generated using Quast (Gurevich et al. 2013) and BUSCO (Simão et al. 2015) (Table S2). The primary contigs assembled from the Nanopore data were anchored to chromosomes using 505,210,505 read pairs of a Hi-C library isolated from another A. cahirinus individual of unknown sex downloaded from the NCBI Short Read Archive (SRX13258644) (Wang et al. 2022). After aligning the Hi-C reads with the ArimaHi-C Mapping Pipeline (https://github.com/ArimaGenomics/mapping_pipeline), YaHS v1.0 (Zhou et al. 2023) was used with default error correction for scaffolding, and Juicebox v1.11.08 (Dudchenko et al. 2018) was used to generate a Hi-C contact map. Progressive Cactus was used (Armstrong et al. 2020) to perform a whole-genome alignment of the A. cahirinus draft assembly to the Mus musculus GRCm39 reference genome (RefSeq GCF_000001635.27_GRCm39). Comparative annotation of the draft genomes was then performed using the Comparative Annotation Toolkit (CAT) (Fiddes et al. 2018). Briefly, the M. musculus RefSeq annotation GFF was parsed and validated with the “parse_ncbi_gff3” and “validate_gff3” programs (respectively) from CAT. The M. musculus reference transcript cDNA sequences were downloaded and mapped to the M. musculus draft genome with minimap2 (Li 2018) and provided to CAT as long-read RNA-seq reads in the “[ISO_SEQ_BAM]” field of the configuration file. For A. cahirinus, bulk RNA-seq data obtained from multiple pooled organs were downloaded from NCBI SRA BioProject PRJNA342864 (Bellofiore et al. 2017) and mapped to the draft assembly with STAR (Dobin et al. 2013) then provided to CAT in the “[BAMS]” field. CpG islands were identified using the cpg_lh utility from the UCSC suite of tools (Kent et al. 2002).
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The development of deep segmentation models for computational pathology (CPath) can help foster the investigation of interpretable morphological biomarkers. Yet, there is a major bottleneck in the success of such approaches because supervised deep learning models require an abundance of accurately labelled data. This issue is exacerbated in the field of CPath because the generation of detailed annotations usually demands the input of a pathologist to be able to distinguish between different tissue constructs and nuclei. Manually labelling nuclei may not be a feasible approach for collecting large-scale annotated datasets, especially when a single image region can contain thousands of different cells. Yet, solely relying on automatic generation of annotations will limit the accuracy and reliability of ground truth. Therefore, to help overcome the above challenges, we propose a multi-stage annotation pipeline to enable the collection of large-scale datasets for histology image analysis, with pathologist-in-the-loop refinement steps. Using this pipeline, we generate the largest known nuclear instance segmentation and classification dataset, containing nearly half a million labelled nuclei in H&E stained colon tissue. We will publish the dataset and encourage the research community to utilise it to drive forward the development of downstream cell-based models in CPath.
@inproceedings{graham2021lizard,
title={Lizard: A Large-Scale Dataset for Colonic Nuclear Instance Segmentation and Classification},
author={Graham, Simon and Jahanifar, Mostafa and Azam, Ayesha and Nimir, Mohammed and Tsang, Yee-Wah and Dodd, Katherine and Hero, Emily and Sahota, Harvir and Tank, Atisha and Benes, Ksenija and others},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={684--693},
year={2021}
}
We would like to acknowledge the following institutions, where the images in this dataset originated from:
Facebook
TwitterPipeline classification - Pipeline classification for annotation and reconstruction of genome-scale metabolic models established according dataset analysis. (XLSX 1790 kb)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here we provide early access to 18 new genome assemblies, including 8 assembled to chromosome-scale, for aphids from the subfamily Aphidinae. For consistency and to aid comparative analysis, all genomes have been annotated using the same repeat masking and RNA-seq-based gene prediction pipeline. Using this pipeline we also provide new annotations for three previously published genome assemblies.
The genome assemblies and annotations are made freely available without restriction, we only request that this Zenodo resource is cited when using the data. Raw sequence data upload to NCBI is underway and full details of all accessions will be given in an updated version of this resource. Manuscripts are in preparation describing the individual genome assemblies in detail and larger comparative genome analyses and we will update this resource with additional citation information as papers are published.
Full details of all genome assemblies and annotations included in this release are given in the attached "Data_Description.pdf" document.
Aphid species included in this release (bold type = chromosome-scale assembly):
Aphis fabae Aphis glycines (updated annotation) Aphis gossypii Aphis thalictri Aphis rumicis Brachycaudus cardui Brachycaudus helichrysi Brachycaudus klugkisti Brevicoryne brassicae Diuraphis noxia Macrosiphum albifrons Metopolophium dirhodum Myzus cerasi (updated annotation) Myzus ligustri Myzus lythri Myzus varians Pentalonia nigronervosa (updated annotation) Phorodon humuli Rhopalosiphum padi Sitobion avenae Sitobion miscanthi
Facebook
TwitterDatabase that integrates large-scale functional genomics assays and manual cDNA annotation with bioinformatics gene expression and protein analysis. LifeDB integrates data regarding full length cDNA clones and data on expression of encoded protein and their subcellular localization on mammalian cell line. LifeDB enables the scientific community to systematically search and select genes, proteins as well as cDNA of interest by specific database identifiers as well as gene name. It enables to visualize cDNA clone and subcellular location of proteins. It also links the results to external biological databases in order to provide a broader functional information. LifeDB also provides an annotation pipeline which facilitates an improved mapping of clones to known human reference transcripts from the RefSeq database and the Ensembl database. An advanced web interface enables the researchers to view the data in a more user friendly manner. Users can search using any one of the following search options available both in Search gene and cDNA clones and Search Sub-cellular locations of human proteins: By Keyword, By gene/transcript identifier, By plate name, By clone name, By cellular location. * The Search genes and cDNA clones results include: Gene Name, Ensemble ID, Genomic Region, Clone name, Plate name, Plate position, Classification class, Synonymous SNP''s, Non- synonymous SNP''s, Number of ambiguous positions, and Alignment with reference genes. * The Search sub-cellular locations of human proteins results include: Subcellular location, Gene Name, Ensemble ID, Clone name, True localization, Images, Start tag and End tag. Every result page has an option to download result data (excluding the microscopy images). On click of ''Download results as CSV-file'' link in the result page the user will be given a choice to open or save result data in form of a CSV (Comma Separated Values) file. Later the CSV file can be easily opened using Excel or OpenOffice.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Robust subset of annotations from the IBD dataset.
Facebook
Twitterhttps://www.gnu.org/licenses/agpl.txthttps://www.gnu.org/licenses/agpl.txt
MammoTab 25 is a large‑scale, richly‑annotated benchmark designed to advance research on Semantic Table Interpretation (STI) and to evaluate the reasoning abilities of modern Large Language Models (LLMs).
Scale and origin – The corpus contains 838930 tables automatically extracted from 63 million English‑language Wikipedia pages.
Comprehensive annotations – Every table is accompanied by:
Cell-Entity Annotation (CEA),
Column-Type Annotation (CTA),
Columns-Property Annotation (CPA),
Four ready‑to‑use prompt templates for LLM training and stress‑testing,
Fine‑grained metadata capturing column roles (Named‑Entity vs Literal), NIL flags, header/caption context, and structural statistics.
Challenge coverage – Tagged metadata enables users to isolate and diagnose all key STI challenges, including multi-domain tables, acronyms, aliases, typos, approximate numeric values, and true NIL mentions, making the dataset suitable for both benchmarking and error analysis.
Format & access – Tables are stored as CSV files; annotations are provided in separate CSVs following the SemTab format; contextual information is packed in JSON side‑cars.
The pipeline for regenerating the dataset is openly available on GitHub at https://github.com/unimib-datAI/mammotab/.
The documentation is available at https://unimib-datai.github.io/mammotab-docs/.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Of the ∼4000 ORFs identified through the genome sequence of Mycobacterium tuberculosis (TB) H37Rv, experimentally determined structures are available for 312. Since knowledge of protein structures is essential to obtain a high-resolution understanding of the underlying biology, we seek to obtain a structural annotation for the genome, using computational methods. Structural models were obtained and validated for ∼2877 ORFs, covering ∼70% of the genome. Functional annotation of each protein was based on fold-based functional assignments and a novel binding site based ligand association. New algorithms for binding site detection and genome scale binding site comparison at the structural level, recently reported from the laboratory, were utilized. Besides these, the annotation covers detection of various sequence and sub-structural motifs and quaternary structure predictions based on the corresponding templates. The study provides an opportunity to obtain a global perspective of the fold distribution in the genome. The annotation indicates that cellular metabolism can be achieved with only 219 folds. New insights about the folds that predominate in the genome, as well as the fold-combinations that make up multi-domain proteins are also obtained. 1728 binding pockets have been associated with ligands through binding site identification and sub-structure similarity analyses. The resource (http://proline.physics.iisc.ernet.in/Tbstructuralannotation), being one of the first to be based on structure-derived functional annotations at a genome scale, is expected to be useful for better understanding of TB and for application in drug discovery. The reported annotation pipeline is fairly generic and can be applied to other genomes as well.
Facebook
TwitterDataset Card for Dataset Name
PDR is the official dataset used in paper https://huggingface.co/papers/2505.20024.
Dataset Details
PDR is a large-scale instruction dataset tailored for closed-loop planning, which contains 203,353 training samples and 11,047 testing samples. Using an automated annotation pipeline, PDR captures the complete decision reasoning process in training scenarios on the Bench2Drive, including the following stages: Scene Understanding, Traffic Sign… See the full description on the dataset page: https://huggingface.co/datasets/LiuxyIA/ReasonPlan_PDR.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Experimental Evidence Code Only
This is a PostgreSQL database backup using the pgvector extension to store high-dimensional protein embeddings. It contains precomputed embeddings and functional annotations from the UniProt July 2025 release, including only entries supported by experimental evidence.
This lookup table was generated using version v2.0.0 of the Protein Information System (PIS), an integrated biological information system designed for the automated extraction, processing, and management of protein-related data. PIS consolidates information from UniProt, PDB, and GOA, allowing efficient retrieval and organization of sequences, structures, and annotations.
The resulting database is designed for compatibility with FANTASIA V3, an advanced pipeline for large-scale functional annotation of proteins using state-of-the-art Protein Language Models (PLMs). While the lookup table is stored in a vector database for persistence, FANTASIA loads the relevant data into memory at runtime to enable high-speed annotation.
FANTASIA uses precomputed deep learning embeddings to perform nearest-neighbor searches in embedding space and transfer Gene Ontology (GO) terms from experimentally annotated proteins to query sequences.
Total proteins: 127,546
Total sequences: 124,397
Total embeddings: 621,849
Total GO annotations: 627,932
Included evidence codes (Gene Ontology, experimental only):
EXP – Inferred from Experiment
IDA – Inferred from Direct Assay
IPI – Inferred from Physical Interaction
IMP – Inferred from Mutant Phenotype
IGI – Inferred from Genetic Interaction
IEP – Inferred from Expression Pattern
TAS – Traceable Author Statement
IC – Inferred by Curator
ESM-2 (650M parameters)
A transformer-based protein language model trained on UniRef50 using masked language modeling. It captures structural and functional features directly from raw sequences without requiring MSAs. ESM-2 is widely used for contact map prediction, unsupervised learning, and representation extraction.
ProtT5-XL-UniRef50 (~1.2B parameters)
A large-scale encoder-decoder model using the T5 architecture, trained on UniRef50 via masked span prediction. It generates high-dimensional sequence representations that perform well across structure and function prediction tasks.
ProstT5 (~1.2B parameters)
A multi-modal extension of ProtT5, trained to predict both sequence and coarse-grained 3Di structural states. Useful for downstream applications like contact prediction, functional annotation, and classification.
Ankh3-Large (620M parameters)
An encoder-only T5-style model trained with masked span prediction. Optimized for fast inference, it encodes both semantic and structural protein information and can replace ProtT5 in many ML pipelines.
ESM3c (Cambrian 600M)
Part of the new ESM C model family, trained on UniRef, MGnify, and JGI datasets. With rotary embeddings and 36 layers, it offers enhanced performance for masked language modeling, producing high-quality structural and functional embeddings without alignments.
A small subset of proteins could not be processed on the Finisterrae III (CESGA) supercomputer due to memory limitations with 40 GB A100 GPUs.
The file missing_proteins.csv lists all affected UniProt identifiers. These entries are excluded from the final lookup table.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
We present the annotation results of 15 arthropod proteomes using an open source, open access and containerized pipeline for genome-scale functional annotation of insect proteomes and apply it to a diverse range of arthropod species. You can find more information about the pipeline at our readthedocs site. The files for each genome include GOanna, InterproScan and KOBAS predictions. Arthropod genomes selected for this study and their assembly and annotation statistics.
Apis Mellifera (honey bee) Drosophila melanogaster (fruit fly) Tribolium castaneum (red flour beetle) Latrodectus hesperus (Western black widow spider) Limnephilus lunatus (caddisfly) Oncopeltus fasciatus (Large milkweed bug) Homalodisca vitripennis (Glassy-winged sharpshooter) Eurytemora affinis (calanoid copepod) Agrilus planipennis (emerald ash borer) Copidosoma floridanum (parasitoid wasp) Athalia rosae (turnip sawfly) Ceratitis capitata (Mediterranean fruit fly) Cimex lectularius (Cimicidae bed bug) Varroa destructor(parasitic mite)
Diaphorina citri (Asian citrus psyllid) Resources in this dataset:Resource Title: Cimex lectularius (Cimicidae bed bug) annotation. File Name: CLEC.tar.gzResource Description: Functional annotation for Clec-OGSv1.2 protein setResource Title: Tribolium castaneum (red flour beetle) annotation. File Name: TCAS.tar.gzResource Description: Functional annotation for TCAS_OGS_v3 protein setResource Title: Drosophila melanogaster (fruit fly) annotation. File Name: DMEL.tar.gzResource Description: Functional annotation for DMEL_r6.38 protein set
Resource Title: Varroa destructor (parasitic mite) annotation. File Name: VDES.tar.gzResource Description: Functional annotation for NCBI Varroa destructor Annotation Release 100 protein set based on Vdes_3.0 genome (GCA_002443255.1) Resource Title: Oncopeltus fasciatus (Large milkweed bug) annotation. File Name: ONCFAS.tar.gzResource Description: Functional annotation for oncfas_OGSv1.2 protein setResource Title: Apis Mellifera (honey bee) annotation. File Name: AMEL.tar.gzResource Description: Functional annotation for OGSv3.3 protein set from Amel_4.5 genome (GCA_000002195.1) Resource Title: Homalodisca vitripennis (Glassy-winged sharpshooter) annotation. File Name: HVIT.tar.gzResource Description: Functional annotation for HVIT-BCM_version_0.5.3 protein set based on Hvit_1.0 genome (GCA_000696855.1) Resource Title: Limnephilus lunatus (caddisfly) annotation. File Name: LLUN.tar.gzResource Description: Functional annotation for LLUN-BCM_version_0.5.3 protein set from Llun_1.0 genome (GCA_000648945.1) Resource Title: Latrodectus hesperus (Western black widow spider) annotation. File Name: LHES.tar.gzResource Description: Functional annotation for LHES-BCM_version_0.5.3 protein set from Lhes_1.0 genome (GCA_000697925.1) Resource Title: Eurytemora affinis (calanoid copepod) annotation. File Name: EAFF.tar.gzResource Description: Functional annotation for EAFF-BCM_version_0.5.3 protein set from Eaff_1.0 genome (GCA_000591075.1) Resource Title: Copidosoma floridanum (parasitoid wasp) annotation. File Name: CFLO.tar.gzResource Description: Functional annotation for CFLO-BCM_version_0.5.3 protein set based on Cflo_1.0 genome (GCA_000648655.1) Resource Title: Ceratitis capitata (Mediterranean fruit fly) annotation. File Name: CCAP.tar.gzResource Description: Functional annotation for Ccap-OGSv1 protein set based on Ccap_1.1 assembly (GCA_000347755.2) Resource Title: Athalia rosae (turnip sawfly) annotation. File Name: AROS.tar.gzResource Description: Functional annotation for AROS-BCM_version_0.5.3 protein set based on Aros_1.0 genome (GCA_000344095.1)Resource Title: Agrilus planipennis (emerald ash borer) annotation. File Name: APLA.tar.gzResource Description: Functional annotation for APLA-BCM_version_0.5.3 protein set based on Apla_1.0 genome (GCA_000699045.1)