60 datasets found
  1. d

    Data from: Functional annotation for 15 diverse arthropod genomes

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    Updated Jun 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Functional annotation for 15 diverse arthropod genomes [Dataset]. https://catalog.data.gov/dataset/functional-annotation-for-15-diverse-arthropod-genomes-2c303
    Explore at:
    Dataset updated
    Jun 5, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    We present the annotation results of 15 arthropod proteomes using an open source, open access and containerized pipeline for genome-scale functional annotation of insect proteomes and apply it to a diverse range of arthropod species. You can find more information about the pipeline at our readthedocs site. The files for each genome include GOanna, InterproScan and KOBAS predictions. Arthropod genomes selected for this study and their assembly and annotation statistics. Apis Mellifera (honey bee) Drosophila melanogaster (fruit fly) Tribolium castaneum (red flour beetle) Latrodectus hesperus (Western black widow spider) Limnephilus lunatus (caddisfly) Oncopeltus fasciatus (Large milkweed bug) Homalodisca vitripennis (Glassy-winged sharpshooter) Eurytemora affinis (calanoid copepod) Agrilus planipennis (emerald ash borer) Copidosoma floridanum (parasitoid wasp) Athalia rosae (turnip sawfly) Ceratitis capitata (Mediterranean fruit fly) Cimex lectularius (Cimicidae bed bug) Varroa destructor(parasitic mite) Diaphorina citri (Asian citrus psyllid) Resources in this dataset:Resource Title: Cimex lectularius (Cimicidae bed bug) annotation. File Name: CLEC.tar.gzResource Description: Functional annotation for Clec-OGSv1.2 protein setResource Title: Tribolium castaneum (red flour beetle) annotation. File Name: TCAS.tar.gzResource Description: Functional annotation for TCAS_OGS_v3 protein setResource Title: Drosophila melanogaster (fruit fly) annotation. File Name: DMEL.tar.gzResource Description: Functional annotation for DMEL_r6.38 protein set Resource Title: Varroa destructor (parasitic mite) annotation. File Name: VDES.tar.gzResource Description: Functional annotation for NCBI Varroa destructor Annotation Release 100 protein set based on Vdes_3.0 genome (GCA_002443255.1) Resource Title: Oncopeltus fasciatus (Large milkweed bug) annotation. File Name: ONCFAS.tar.gzResource Description: Functional annotation for oncfas_OGSv1.2 protein setResource Title: Apis Mellifera (honey bee) annotation. File Name: AMEL.tar.gzResource Description: Functional annotation for OGSv3.3 protein set from Amel_4.5 genome (GCA_000002195.1) Resource Title: Homalodisca vitripennis (Glassy-winged sharpshooter) annotation. File Name: HVIT.tar.gzResource Description: Functional annotation for HVIT-BCM_version_0.5.3 protein set based on Hvit_1.0 genome (GCA_000696855.1) Resource Title: Limnephilus lunatus (caddisfly) annotation. File Name: LLUN.tar.gzResource Description: Functional annotation for LLUN-BCM_version_0.5.3 protein set from Llun_1.0 genome (GCA_000648945.1) Resource Title: Latrodectus hesperus (Western black widow spider) annotation. File Name: LHES.tar.gzResource Description: Functional annotation for LHES-BCM_version_0.5.3 protein set from Lhes_1.0 genome (GCA_000697925.1) Resource Title: Eurytemora affinis (calanoid copepod) annotation. File Name: EAFF.tar.gzResource Description: Functional annotation for EAFF-BCM_version_0.5.3 protein set from Eaff_1.0 genome (GCA_000591075.1) Resource Title: Copidosoma floridanum (parasitoid wasp) annotation. File Name: CFLO.tar.gzResource Description: Functional annotation for CFLO-BCM_version_0.5.3 protein set based on Cflo_1.0 genome (GCA_000648655.1) Resource Title: Ceratitis capitata (Mediterranean fruit fly) annotation. File Name: CCAP.tar.gzResource Description: Functional annotation for Ccap-OGSv1 protein set based on Ccap_1.1 assembly (GCA_000347755.2) Resource Title: Athalia rosae (turnip sawfly) annotation. File Name: AROS.tar.gzResource Description: Functional annotation for AROS-BCM_version_0.5.3 protein set based on Aros_1.0 genome (GCA_000344095.1)Resource Title: Agrilus planipennis (emerald ash borer) annotation. File Name: APLA.tar.gzResource Description: Functional annotation for APLA-BCM_version_0.5.3 protein set based on Apla_1.0 genome (GCA_000699045.1)

  2. Z

    Supplementary materials for "Robustness analysis of metabolic predictions in...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dittami, Simon M. (2021). Supplementary materials for "Robustness analysis of metabolic predictions in algal microbial communities based on different annotation pipelines" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4436002
    Explore at:
    Dataset updated
    Apr 1, 2021
    Dataset provided by
    Belcour, Arnaud
    Geslain, Enora
    Aïte, Méziane
    Siegel, Anne
    Dittami, Simon M.
    Corre, Erwan
    Frioux, Clémence
    Karimi, Elham
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supplementary materials for the revised version of the PeerJ preprint " Robustness analysis of metabolic predictions in algal microbial communities based on different annotation pipelines".

  3. P

    ViDAS Dataset

    • paperswithcode.com
    Updated Sep 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). ViDAS Dataset [Dataset]. https://paperswithcode.com/dataset/vidas
    Explore at:
    Dataset updated
    Sep 30, 2024
    Description

    100 videos with varying danger levels (on a scale of 0-10) and different scenarios, annotated by 18 human annotators using our annotation pipeline to represent human perception and respective Vision Language model summaries for each of the videos as benchmarks for testing LLMs' danger perceptions.

  4. n

    Data from: A chromosome-scale genome assembly of the okapi (Okapia...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    zip
    Updated Jul 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sven Winter; Raphael T. F. Coimbra; Philippe Helsen; Axel Janke (2022). A chromosome-scale genome assembly of the okapi (Okapia johnstoni) [Dataset]. http://doi.org/10.5061/dryad.37pvmcvp3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 5, 2022
    Dataset provided by
    Senckenberg Biodiversity and Climate Research Centre
    Centre for Research and Conservation, Royal Zoological Society of Antwerp
    Authors
    Sven Winter; Raphael T. F. Coimbra; Philippe Helsen; Axel Janke
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    The okapi (Okapia johnstoni), or forest giraffe, is the only species in its genus and the only extant sister group of the giraffe within the family Giraffidae. The species is one of the remaining large vertebrates surrounded by mystery because of its elusive behavior as well as the armed conflicts in the region where it occurs, making it difficult to study. Deforestation puts the okapi under constant anthropogenic pressure, and it is currently listed as “Endangered” on the IUCN Red List. Here, we present the first annotated de novo okapi genome assembly based on PacBio continuous long reads, polished with short reads, and anchored into chromosome-scale scaffolds using Hi-C proximity ligation sequencing. The final assembly (TBG_Okapi_asm_v1) has a length of 2.39 Gbp, of which 98% are represented by 28 scaffolds >3.9 Mbp. The contig N50 of 61 Mbp and scaffold N50 of 102 Mbp, together with a BUSCO score of 94.7%, and 23,412 annotated genes, underline the high quality of the assembly. This chromosome-scale genome assembly is a valuable resource for future conservation of the species and comparative genomic studies among the giraffids and other ruminants. Methods Genome assembly:
    We assembled the genome of the okapi from pacbio CLR reads using WTDBG2 v. 2.5 (Ruan & Li, 2019) using the preset for PacBio Sequel reads (flag '-x sq') followed by three iterations of long-read polishing with racon v.1.4.3 (Vaser et al., 2017) and three iterations of short-read polishing with pilon v.1.23 (Walker et al., 2014). The assembly was scaffolded into chromosome-scale scaffolds with the Dovetail Genomics´ HiRise pipeline (Putnam et al., 2016) using publically available Hi-C data (SRR8616855, SRR8616856). Subsequently, three iterations of gap-closing were performed using TGS-GapCloser v.1.1.1 (Xu et al., 2020). The resulting final assembly can be found under the filename: TBG_Okapi_asm_v1.fasta. Genome annotation: Prior to gene annotation, we used RepeatModeler v. 2.0.1 for the generation of a de novo repeat library. This library was combined with a Cetartiodactyla-specific (Flynn et al., 2020) library from RepBase (Bao et al., 2015) and used as a custom repeat library for the masking of repeats with RepeatMasker v.4.1.0 (http://www.repeatmasker.org/RMDownload.html). Interspersed repeats were hard-masked while simple repeats were soft-masked. The masked assembly file can be found under the filename: TBG_Okapi_asm_v1_hardmaskedTEs_softmaskedSR.fasta After repeat masking we used the GeMoMa pipeline v.1.7.1 (Keilwagen et al., 2016, 2018) for homology-based gene prediction with the alignment tool MMSeqs2 (Steinegger & Söding, 2017). As references we used the assemblies and annotations of the following ten mammals species from GenBank: Bos taurus (GCF_002263795.1), Homo sapiens (GCF_000001405.39), Mus musculus (GCF_000001635.27), Sus scrofa (GCF_000003025.6), Camelus dromedarius (GCF_000803125.2), Equus caballus (GCF_002863925.1), Ovis aries (GCF_002742125.1), Tursiops truncatus (GCF_011762595.1), Cervus hanglu yarkandensis (GCA_010411085.1), and Capra hircus (GCF_001704415.1). Subsequently, the predicted genes were annotated by a BLASTP v.2.11.0+ (Zhang et al., 2000) search against the Swiss-Prot database (release 2021-01). with an e-value cutoff of 10-6. We further annotated Gene ontology terms, motifs, and domains using InterProScan v.5.50.84 (Jones et al., 2014; Quevillon et al., 2005). The annotation results (gff3, CDS fasta, proteins fasta) can be found under the filenames: TBG_Okapi_asm_v1_annotation_all.fun.gff TBG_Okapi_asm_v1_annotation_CDS.fun.fasta TBG_Okapi_asm_v1_annotation_proteins.fun.fasta For detailed methods and additional results, please read the linked publication in the Journal of Heredity.

  5. d

    A high-quality genome assembly for Dillenia turbinata (Dilleniales)

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Nov 29, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andre Chanderbali; Xing Guo; Qiong Qiong Lin; Shi Xiao Luo; Huan Liu; Matthew Gitzendanner; Steven Smith; Douglas Soltis; Pamela Soltis (2023). A high-quality genome assembly for Dillenia turbinata (Dilleniales) [Dataset]. http://doi.org/10.5061/dryad.msbcc2g3j
    Explore at:
    Dataset updated
    Nov 29, 2023
    Dataset provided by
    Dryad Digital Repository
    Authors
    Andre Chanderbali; Xing Guo; Qiong Qiong Lin; Shi Xiao Luo; Huan Liu; Matthew Gitzendanner; Steven Smith; Douglas Soltis; Pamela Soltis
    Time period covered
    May 19, 2023
    Description

    Objectives: Dillenia turbinata (Dilleniaceae) is a member of the order Dilleniales, an enigmatic clade of critical importance for understanding the diversification history of flowering plants but for which genome sequences are not available. We have produced and annotated a chromosome-scale whole genome assembly for D. turbinata through the resources of the 10KP (10,000 Plants) Genomes Project. The genome assembly and associated data provided here will serve as a useful resource for comparative and evolutionary genomics research across the flowering plants. Data description: The D. turbinata genome was assembled from Oxford Nanopore Technology (ONT) and whole-genome shotgun (WGS) sequences, and scaffolded into chromosome-scale pseudomolecules using Hi-C data. The genome assembly is 723,739,077 base pairs in length with a BUSCO completeness score of 97%.  Twenty-eight scaffolds contain more than 99% of the assembly. The repeat-masked genome sequence is annotated with 36,967 protein-codin..., Genome assembly and annotation Raw nanopore reads in fastq format were assembled with Canu v2.2 (Koren et al. 2017) using an estimated genome of 900Mb to guide coverage parameters during the read correction, trimming, and assembly steps of the pipeline. The resulting primary assembly was polished with the WGS reads using NextPolish v1.3.1 (Hu et al. 2020), and duplicated constructs were removed by Purge Haplotigs (Roach et al. 2018). The set of deduplicated contigs was scaffolded on the basis of Hi-C reads using the Juicer pipeline (Durand et al. 2016) and 3d-dna tools (Dudchenko et al. 2017) with default parameters. Genome annotation was performed using the MAKER-P pipeline (Campbell et al. 2014) supplied with coding DNA sequences (CDS) from a Trinity (Grabherr et al. 2011) assembly of the Dillenia transcriptome reads, proteomes from four publicly available eudicot genomes —Arabidopsis thaliana, Aquilegia coerulea, Nelumbo nucifera, and Vitis vinifera, and a custom repeat library of tr..., The included data files may be opened with MS Word (Detailed Methods.docx), MS Ecel (Dillenia.genome.assembly.stats.xlsx), standard image viewer software (Dillenia.BUSCO.summaries.png), and standard text editor programs (Dillenia.genome.fasta and Dillenia.maker.predict.36967.final.gff), .

  6. Data from: Globe230k: A Benchmark Dense-Pixel Annotation Dataset for Global...

    • zenodo.org
    • explore.openaire.eu
    txt, zip
    Updated Jul 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qian Shi; Da He; Zhengyu Liu; Xiaoping Liu; Jingqian Xue; Qian Shi; Da He; Zhengyu Liu; Xiaoping Liu; Jingqian Xue (2024). Globe230k: A Benchmark Dense-Pixel Annotation Dataset for Global Land Cover Mapping [Dataset]. http://doi.org/10.5281/zenodo.10435661
    Explore at:
    txt, zipAvailable download formats
    Dataset updated
    Jul 7, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Qian Shi; Da He; Zhengyu Liu; Xiaoping Liu; Jingqian Xue; Qian Shi; Da He; Zhengyu Liu; Xiaoping Liu; Jingqian Xue
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We (Intelligent Mining and Analysis of Remote Sensing big data, IMARS) create a large-scale annotated dataset (Globe230k) for land use/land cover (LULC) mapping, which is annotated on Google Earth image of 1 m spatial resolution. Globe230k is annotated by numerous experts and students major in survey and mapping after necessary training, through visual interpretation on very high-resolution images, as well as in-situ field survey, under the guidance of the organized annotation pipeline. Globe230k has three superiorities:

    1) Large scale: the Globe230k includes 232,819 annotated images with the size of 512x512 and spatial resolution of 1 m, with more than 3x1010 annotated pixels, and it includes 10 first-level categories.

    2) Rich diversity: the annotated images are sampled from worldwide regions, with coverage area of over 60,000 km2, indicating a high variability and diversity. Besides, in order to ensure the category balance, we intentionally give more chance to the rare categories to be sampled, such as wetland, ice/snow, etc.

    3) Multi-modal: Globe230k not only contains RGB bands, but also include other important features for Earth system research, such as Normalized differential vegetation index (NDVI), digital elevation model (DEM), vertical-vertical polarization (VV) bands, vertical-horizontal polarization (VH) bands, which can facilitate the multi-modal data fusion research. Due to the large size of the multi-modal dataset (DEM 1.91G, NDVI 164G, VVVH 372G), these dataset are stored on Baidu Yunpan, the download link is :https://pan.baidu.com/s/12AKbiqOXSf4fnm7mYkCE0g?pwd=230k, the extraction code is 230k.

    The image patches and their corresponding annotated patches are respectively stored in "image_patch.zip" and "label_patch.zip" file. The RGB image is in forms of ".jpg", with size of 512x512, the pixel value is ranged from 0-255. The annotated patches is in forms of ".png", also with size of 512x512, the pixel value is ranged from 1-10, which respectively represent 1#cropland, 2#forest, 3#grass, 4#shrubland, 5#wetland, 6#water, 7#tundra, 8#impervious, 9#bareland, 10#ice/snow. The corresponding DEM, NDVI and VVVH patches are all in form of ".tif", with size of 512x512 (due to the different resolution of DEM, NDVI and VVVH patches, they are all uniformly resized to the same scale as the image patch).

    The total 232,819 pairs are officially divided into training set, validation set, and test set, based on ratio of 7:1:2, which can be find in "train_num.txt","val_num.txt","test_num.txt" file. Based on this division, the official baseline accuracy of several state-of-the-art semantic segmentation can be found in the related arcticle (https://spj.science.org/doi/10.34133/remotesensing.0078).

    We hope it can be used as a benchmark to promote further development of global land cover mapping and semantic segmentation algorithm development.

  7. f

    Additional file 3: of A De-Novo Genome Analysis Pipeline (DeNoGAP) for...

    • springernature.figshare.com
    xlsx
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shalabh Thakur; David Guttman (2023). Additional file 3: of A De-Novo Genome Analysis Pipeline (DeNoGAP) for large-scale comparative prokaryotic genomics studies [Dataset]. http://doi.org/10.6084/m9.figshare.c.3628187_D2.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Authors
    Shalabh Thakur; David Guttman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    List of genomes used for development and testing of the DeNoGAP pipeline. (XLSX 25Â kb)

  8. d

    FC309 genome assembly and annotation files

    • search.dataone.org
    Updated Mar 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kevin Dorn (2025). FC309 genome assembly and annotation files [Dataset]. http://doi.org/10.5061/dryad.wstqjq2x5
    Explore at:
    Dataset updated
    Mar 4, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Kevin Dorn
    Description

    Sugar beet (Beta vulgaris L.) is a global source of table sugar and animal fodder. Here we report a highly contiguous, haplotype phased genome assembly and annotation for sugar beet line FC309. Both assembled haplomes for FC309 represent the largest and most contiguous assembled beet genomes reported to date, as well as gene annotations sets that capture over 1,500 additional protein-coding loci compared to prior beet genome annotations. These new genomic resources were used to identify novel quantitative trait loci (QTL) for Fusarium yellows resistance from the FC309 genetic background using an F2 mapping-by-sequencing approach. The highest QTL signals were detected on Chromosome 3, spanning approximately 10Mbp in both haplomes. A parallel transcriptome profiling experiment identified candidate genes within the Chromosome 3 QTL with plausible roles in disease response, including NBS-LRR genes with expression trends supporting a role in resistance. Investigation of genetic variants in t..., High molecular weight DNA was isolated from a single plant from the FC309 sugar beet line for PacBio HiFi sequencing. Young, dark treated leaf tissue was collected from the same plant for DoveTail Omni-C proximity ligation library preparation and Illumina sequencing. PacBio HiFi and DoveTail Omni-C reads were assembled using the software package Hifiasm to produce a phased contig level assembly. The two phased contig level assemblies were scaffolded using the DoveTail HiRise method to break mis-joined contigs and anchor/orient contigs into psuedochromosomes. The final assemblies, named USDA_Bvulg_FC309_v1.1 (haplome 1) and USDA_Bvulg_FC309_v1.2 (haplome 2) were independently annotated to identify and mask repetivie regions of the genome and identify protein coding loci using the GenSAS genome annotation pipeline. , , # FC309 genome assembly and annotation files

    https://doi.org/10.5061/dryad.wstqjq2x5

    Description of the data and file structure

    Genome assembly and annotation files for sugar beet line FC309 developed as described in Todd et al. "A fully phased, chromosome-scale genome of sugar beet line FC309 enables the discovery of Fusarium yellows resistance QTL" published in DNA Research.Â

    Files and variables

    File: FC309.zip

    Description: Two directories containing genome assembly and annotation files for FC309 haplome 1 (v1.1.0) and FC309 haplome 2 (v1.2.0). Each haplome directory contains the assembly file (in FASTA format, USDA_Bvulg_FC309_v1.X.0.fasta), genome annotation file (GFF3 format, Masked-FC309v1.X.0-extra-contigs-publish.gff3) which contains the genomic coordinates of final protein coding loci, and both nucleotide (Masked-FC309v1.X.0-extra-contigs-publish.CDS.fna and Masked-FC309v1.X.0-extra-contigs-publish.genes.fna) and ...

  9. n

    Data from: A chromosome-scale high-contiguity genome assembly of the...

    • data.niaid.nih.gov
    • search.dataone.org
    zip
    Updated Jan 18, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sven Winter; René Meißner; Carola Greve; Alexander Ben Hamadou; Petr Horin; Stefan Prost; Pamela A. Burger (2023). A chromosome-scale high-contiguity genome assembly of the threatened cheetah (Acinonyx jubatus) [Dataset]. http://doi.org/10.5061/dryad.xksn02vkr
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 18, 2023
    Dataset provided by
    University of Oulu
    University of Veterinary Science Brno
    University of Veterinary Medicine Vienna
    LOEWE Centre for Translational Biodiversity Genomics
    Authors
    Sven Winter; René Meißner; Carola Greve; Alexander Ben Hamadou; Petr Horin; Stefan Prost; Pamela A. Burger
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    The cheetah (Acinonyx jubatus, SCHREBER 1775) is a large felid and is considered the fastest land animal. Historically, it inhabited open grassland across Africa, the Arabian Peninsula, and southwestern Asia; however, only small and fragmented populations remain today. Here, we present a de novo genome assembly of the cheetah based on PacBio continuous long reads and Hi-C proximity ligation data. The final assembly (VMU_Ajub_asm_v1.0) has a total length of 2.38 Gb, of which 99.7% are anchored into the expected 19 chromosome-scale scaffolds. The contig and scaffold N50 values of 96.8 Mb and 144.4 Mb, respectively, a BUSCO completeness of 95.4% and a k-mer completeness of 98.4%, emphasize the high quality of the assembly. Furthermore, annotation of the assembly identified 23,622 genes and a repeat content of 40.4%. This new highly contiguous and chromosome-scale assembly will greatly benefit conservation and evolutionary genomic analyses and will be a valuable resource, e.g., to gain a detailed understanding of the function and diversity of immune response genes in felids. Methods The presented data is related to the eponymous publication "A chromosome-scale high-contiguity genome assembly of the threatened cheetah (Acinonyx jubatus)" soon to be published in the Journal of Heredity. Any questions regarding this dataset or the publication can be addressed to the corresponding authors, Sven Winter (sven.winter@vetmeduni.ac.at) and Pamela Burger (pamela.burger@vetmeduni.ac.at). Assembly: The assembly was generated from one PacBio CLR library sequenced on one SMRTCell on a Sequel IIe using Flye v. 2.9, including one iteration of long-read polishing followed by one iteration of short-read polishing with pilon v.1.23 using trimmed standard Illumina short-reads generated on the Illumina Novaseq 6000 platform. Subsequently, the contigs of the polished assembly were anchored into chromosome-scale scaffolds with YaHS v.1.1 using publically available Hi-C data for the cheetah (SRR8616936, SRR8616937) that were prepared following the Arima Hi-C mapping pipeline (https://github.com/VGP/vgp-assembly/blob/master/pipeline/salsa/arima_mapping_pipeline.sh). Finally, two iterations of gap-closing were performed with TGS-GapCloser v. 1.1.1 using a different random subset of PacBio reads (25%) for each iteration. Annotation: Repeat Annotation To improve gene prediction, we first identified and masked the repeats in the assembly. A de novo repeat library was generated with RepeatModeler v.2.0.1 and combined with a Felidae-specific repeat library from RepBase. This custom repeat library was then used with RepeatMasker v.4.1.0 to hard-mask interspersed repeats and soft-mask simple repeats. Gene annotation We predicted genes in the masked assembly based on homology using the GeMoMa pipeline v. 1.7.1 and the following reference assemblies and annotation files: Homo sapiens (GCF_000001405.40), Mus musculus (GCF_000001635.27), Lynx canadensis(GCF_007474595.2), Canis lupus familiaris (GCF_014441545.1), Prionailuris bengalensis (GCF_016509475.1), Leopardus geoffroyi (GCF_018350155.1), Felis catus (GCF_018350175.1), Panthera tigris (GCF_018350195.1), and Panthera leo (GCF_018350215.1). We functionally annotated the predicted proteins using InterProScan v.5.50.84 and a BLASTP v.2.11.0 search against the Swiss-Prot database (release 2021-02). For more details on assembly quality assessment and comparative analyses to other Felidae assemblies, please read the original manuscript. This dataset comprises the following files: VMU_Ajub_asm_v1.0.fasta (final unmasked assembly, also available at GenBank under accession GCA_027475565.1) VMU_Ajub_asm_v1.0.fasta.masked (final assembly with all repeats hard-masked) VMU_Ajub_asm_v1.0.fasta.out (RepeatMasker output for the fully hard-masked assembly VMU_Ajub_asm_v1.0.fasta.masked) VMU_Ajub_asm_v1.0.fasta.tbl (RepeatMasker output for the fully hard-masked assembly VMU_Ajub_asm_v1.0.fasta.masked) VMU_Ajub_asm_v1.0_hardmaskedTE.fasta (final assembly with all interspersed repeats hard-masked) VMU_Ajub_asm_v1.0_hardmaskedTE.fasta.out (RepeatMasker output for the fully hard-masked assembly VMU_Ajub_asm_v1.0_hardmaskedTE.fasta) VMU_Ajub_asm_v1.0_hardmaskedTE.fasta.tbl (RepeatMasker output for the fully hard-masked assembly VMU_Ajub_asm_v1.0_hardmaskedTE.fasta) VMU_Ajub_asm_v1.0_hardmaskedTE_softmaskedSR.fasta (final assembly with all interspersed repeats hard-masked and simple repeats soft-masked) VMU_Ajub_asm_v1.0_hardmaskedTE_softmaskedSR.fasta.out (RepeatMasker output for the fully hard-masked assembly VMU_Ajub_asm_v1.0_hardmaskedTE_softmaskedSR.fasta) VMU_Ajub_asm_v1.0_hardmaskedTE_softmaskedSR.fasta.tbl (RepeatMasker output for the fully hard-masked assembly VMU_Ajub_asm_v1.0_hardmaskedTE_softmaskedSR.fasta) consensi.fa.classified (de novo repeat library for the final assembly VMU_Ajub_asm_v1.0.fasta generated With RepeatModeler2) Ajub_assembly_commands.txt (List with all commands used to generate the assembly and all related analyses)

  10. Structural Annotation of Mycobacterium tuberculosis Proteome

    • plos.figshare.com
    tiff
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Praveen Anand; Sandhya Sankaran; Sumanta Mukherjee; Kalidas Yeturu; Roman Laskowski; Anshu Bhardwaj; Raghu Bhagavat; Samir K. Brahmachari; Nagasuma Chandra (2023). Structural Annotation of Mycobacterium tuberculosis Proteome [Dataset]. http://doi.org/10.1371/journal.pone.0027044
    Explore at:
    tiffAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Praveen Anand; Sandhya Sankaran; Sumanta Mukherjee; Kalidas Yeturu; Roman Laskowski; Anshu Bhardwaj; Raghu Bhagavat; Samir K. Brahmachari; Nagasuma Chandra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Of the ∼4000 ORFs identified through the genome sequence of Mycobacterium tuberculosis (TB) H37Rv, experimentally determined structures are available for 312. Since knowledge of protein structures is essential to obtain a high-resolution understanding of the underlying biology, we seek to obtain a structural annotation for the genome, using computational methods. Structural models were obtained and validated for ∼2877 ORFs, covering ∼70% of the genome. Functional annotation of each protein was based on fold-based functional assignments and a novel binding site based ligand association. New algorithms for binding site detection and genome scale binding site comparison at the structural level, recently reported from the laboratory, were utilized. Besides these, the annotation covers detection of various sequence and sub-structural motifs and quaternary structure predictions based on the corresponding templates. The study provides an opportunity to obtain a global perspective of the fold distribution in the genome. The annotation indicates that cellular metabolism can be achieved with only 219 folds. New insights about the folds that predominate in the genome, as well as the fold-combinations that make up multi-domain proteins are also obtained. 1728 binding pockets have been associated with ligands through binding site identification and sub-structure similarity analyses. The resource (http://proline.physics.iisc.ernet.in/Tbstructuralannotation), being one of the first to be based on structure-derived functional annotations at a genome scale, is expected to be useful for better understanding of TB and for application in drug discovery. The reported annotation pipeline is fairly generic and can be applied to other genomes as well.

  11. n

    Annotation of genes encoding enzymes across marine phytoplankton genomes

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Apr 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Naaman Omar; Brian Beardsall; Katherine Fleury; Esther Ataikiru; Douglas Campbell (2023). Annotation of genes encoding enzymes across marine phytoplankton genomes [Dataset]. http://doi.org/10.5061/dryad.kh1893284
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 17, 2023
    Dataset provided by
    University of Calgary
    Mount Allison University
    Dalhousie University
    Authors
    Naaman Omar; Brian Beardsall; Katherine Fleury; Esther Ataikiru; Douglas Campbell
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Phytoplankton cells span a large size range, from picoplankton (<2µm), nanoplankton (2 to 20µm), microplankton (20 to 200µm) to macroplankton (200 to <2000µm). Cell size interacts with multiple selective pressures, including cellular metabolic rate, light absorption, nutrient uptake, cell nutrient quotas, trophic interactions and diffusional exchanges with the environment. Beyond simple size, cells of different shapes differ in surface area to volume ratio. For example, more elongated cells, such as pennate diatoms, have a larger surface area to volume ratio compared to more rounded cells, such as centric diatoms, of equivalent biovolume, which can in turn influence diffusional exchanges between cells and their environment. We assembled metadata on diverse marine phytoplankters, in parallel with genomic or transcriptomic data annotations to identify genes encoding enzymes, to facilitate analyses of genomic patterns of encoded enzymes across diverse taxa, sizes, growth forms and origins of strains. Methods MetaData.csv contains information on the site and latitude of origin, cell size, genome size, presence of flagella and colony formation assembled for 146 diverse marine phytoplankters from citations listed in MetaData.csv. We used an automated pipeline implemented through Snakemake to pass gene sequences from the downloaded genomes and/or transcriptomes from the 146 phytoplankters, in .fasta format, to the eggNOG 5.0 database. We then used eggNOG-Mapper 2.0.6 and the DIAMOND algorithm to annotate potential orthologs in each analyzed genome or transcriptome, using the following parameters: seed_ortholog_evalue = 0.001, seed_ortholog_score = 60, tax_scope = "auto", go_evidence = "non-electronic", query_cover = 20 and subject_cover = 0.

    The output of automatically annotated orthologs, from each genome or transcriptome, from the bioinformatic pipeline was compiled into combinedHits.csv. Definitions of annotations and variable names are listed in CombinedHitsDataDictionary.csv. Definitions of variable names from MetaData.csv are listed in MetaDataDictionary.csv.

  12. d

    Data from: The genome of the Xingu Scale-backed Antbird (Willisornis vidua...

    • datadryad.org
    • data.niaid.nih.gov
    zip
    Updated Aug 25, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Else K. Mikkelsen; Jason Weir (2020). The genome of the Xingu Scale-backed Antbird (Willisornis vidua nigrigula) reveals lineage-specific adaptations [Dataset]. http://doi.org/10.5061/dryad.xsj3tx9cq
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 25, 2020
    Dataset provided by
    Dryad
    Authors
    Else K. Mikkelsen; Jason Weir
    Time period covered
    2020
    Description

    This dataset provides the genome assembly and annotation of Willisornis poecilinotus vidua from the manuscript "The genome of the common scale-backed Antbird (Willisornis poecilinotus) reveals lineage-specific adaptations". It contains the following files:

    1) GFF format annotations of Willisornis vidua nigrigula, Hypocnemis ochrogyna, and Rhegmatorhina melanosticta

    These files are named "Willisornis_vidua.genome_annotation.gff", "Hypocnemis_ochrogyna.genome_annotation.gff", and "Rhegmatorhina_melanosticta.genome_annotation.gff"
    These GFF files contain annotation information for the locations of protein-coding genes in the genome, as well as locations of repeat-masked sequences.
    They also contain the genomic locations of alignments to known protein-coding genes used for protein prediction during the Maker2 pipeline.
    It also contains functional annotation information listing GO terms and functions predicted for the proteins by Interproscan. 
    It also contains the loc...
    
  13. o

    Data from: Large-scale fungal strain sequencing unravels the molecular...

    • explore.openaire.eu
    • search.dataone.org
    • +1more
    Updated Mar 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Peris; Dabao Sun Lu; Vilde Bruhn Kinneberg; Ine-Susanne Methlie; Malin Stapnes Dahl; Håvard Kauserud; Inger Skrede; Timothy Y. James (2022). Large-scale fungal strain sequencing unravels the molecular diversity in mating loci maintained by long-term balancing selection [Dataset]. http://doi.org/10.5061/dryad.fxpnvx0t4
    Explore at:
    Dataset updated
    Mar 19, 2022
    Authors
    David Peris; Dabao Sun Lu; Vilde Bruhn Kinneberg; Ine-Susanne Methlie; Malin Stapnes Dahl; Håvard Kauserud; Inger Skrede; Timothy Y. James
    Description

    Large-scale fungal strain sequencing unravels the molecular diversity in mating loci maintained by long-term balancing selection publication in Plos Genetics 2022 https://doi.org/10.5061/dryad.fxpnvx0t4 Additional information is described in the dedicated GitHub page ## Information about data in the dryad repository ### iWGS_SPAdes_Assemblies.tar.gz: Compressed file with genome assemblies for individuals sequenced by Illumina technology. Additional information about these assemblies can be found in Supplementary Table 1 of the manuscript. ### CrossingPictures.rar: Compressed picture files related with experimental crosses. ### IndividualGeneAlignments_trimmed.zip: Compressed file with trimmed alignments for the coding sequences (CDS) and amino acid sequences (aa) ### MATA.zip: Assembled MATA regions (.fas) and annotations (.gff) for each specimen. ### MATB.zip: Assembled MATB regions (.fas) and annotations (.gff) for each specimen. ### SourceData.rar: Compressed file with raw data to generate figures and tables in the manuscript: * IQTree_logFiles.tar.gz: IQTree log files with information to replicate the phylogenetic reconstruction represented in iTOL. * AllvsAll_distances.meg: Converted Average Nucleotide Identity (ANI) used for reconstructing a Neighbour-Joining tree * BUSCO_MAT_info.csv: BUSCO annotation statistics and location on TA10106M1 genome * dxy.csv: Absolute divergence statitstic for BUSCO and MAT genes * Fst.csv: Relative divergence statitstic for BUSCO and MAT genes * MKT.csv: Multilocus Hudson–Kreitman–Aguadé (HKA) test performed with HKAdirect 0.7b * paml.csv: Average number of synonymous substitutions per synonymous sites (dS) and non-synonymous substitutions per non-synonymous sites (dN) for BUSCO and MAT genes * pi.csv: Nucleotide diversity values for BUSCO and MAT genes * tajimaD.csv: Tajima’s D values for BUSCO and MAT genes ### Annotation_Tabietinum_10106M1.zip: Compressed file with annotation files for TA10106M individual. * TA10106M1_BUSCO.gff: annotation file with the coordinates of BUSCO genes for the genome TA10106M1. * TA10106M1_nuclearV2.gff: MAKER pipeline annotation file with the coordinates of genes, CDS and other features for the genome TA10106M1. It also includes Interproscan, Blastp and KEGG (GenomeMaple KAAS) annotations. * TA10106M1_RepeatMasker.gff: annotation file with the coordinates of features annotated by RepeatMasker. Balancing selection, an evolutionary force that retains genetic diversity, has been detected in multiple genes and organisms, such as the sexual mating loci in fungi. However, to quantify the strength of balancing selection and define the mating-related genes require a large number of strains. In tetrapolar basidiomycete fungi, sexual type is determined by two unlinked loci, MATA and MATB. Genes in both loci define mating type identity, control successful mating and completion of the life cycle. These loci are usually highly diverse. Previous studies have speculated, based on culture crosses, that species of the non-model genus Trichaptum (Hymenochaetales, Basidiomycota) possess a tetrapolar mating system, with multiple alleles. Here, we sequenced a hundred and eighty strains of three Trichaptum species. We characterized the chromosomal location of MATA and MATB, the molecular structure of MAT regions and their allelic richness. The sequencing effort was sufficient to molecularly characterize multiple MAT alleles segregating before the speciation event of Trichaptum species. Analyses suggested that long-term balancing selection has generated trans-species polymorphisms. Mating sequences were classified in different allelic classes based on an amino acid identity (AAI) threshold supported by phylogenetics. 17,550 mating types were predicted based on the allelic classes. In vitro crosses allowed us to support the degree of allelic divergence needed for successful mating. Even with the high amount of divergence, key amino acids in functional domains are conserved. We conclude that the genetic diversity of mating loci in Trichaptum is due to long-term balancing selection, with limited recombination and duplication activity. The large number of sequenced strains highlighted the importance of sequencing multiple individuals from different species to detect the mating-related genes, the mechanisms generating diversity and the evolutionary forces maintaining them. Additional information can be found in the dedicated GitHub webpage: https://perisd.github.io/TriMAT/ Description of methods included in the manuscript

  14. Lizard dataset

    • kaggle.com
    Updated Dec 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aadam (2021). Lizard dataset [Dataset]. https://www.kaggle.com/datasets/aadimator/lizard-dataset/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 6, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aadam
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The development of deep segmentation models for computational pathology (CPath) can help foster the investigation of interpretable morphological biomarkers. Yet, there is a major bottleneck in the success of such approaches because supervised deep learning models require an abundance of accurately labelled data. This issue is exacerbated in the field of CPath because the generation of detailed annotations usually demands the input of a pathologist to be able to distinguish between different tissue constructs and nuclei. Manually labelling nuclei may not be a feasible approach for collecting large-scale annotated datasets, especially when a single image region can contain thousands of different cells. Yet, solely relying on automatic generation of annotations will limit the accuracy and reliability of ground truth. Therefore, to help overcome the above challenges, we propose a multi-stage annotation pipeline to enable the collection of large-scale datasets for histology image analysis, with pathologist-in-the-loop refinement steps. Using this pipeline, we generate the largest known nuclear instance segmentation and classification dataset, containing nearly half a million labelled nuclei in H&E stained colon tissue. We will publish the dataset and encourage the research community to utilise it to drive forward the development of downstream cell-based models in CPath.

    Link to the dataset paper.

    Citation

    @inproceedings{graham2021lizard,
     title={Lizard: A Large-Scale Dataset for Colonic Nuclear Instance Segmentation and Classification},
     author={Graham, Simon and Jahanifar, Mostafa and Azam, Ayesha and Nimir, Mohammed and Tsang, Yee-Wah and Dodd, Katherine and Hero, Emily and Sahota, Harvir and Tank, Atisha and Benes, Ksenija and others},
     booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
     pages={684--693},
     year={2021}
    }
    

    Acknowledgements

    We would like to acknowledge the following institutions, where the images in this dataset originated from:

    • University Hospitals Coventry and Warwickshire, United Kingdom
    • Histo Pathology Diagnostic Center, Shanghai, China
    • Ruijin Hospital, Shanghai, China
    • Xijing Hospital, Xi'an, China
    • Shanghai Songjiang District Central Hospital, Shanghai, China
    • The National Cancer Institute (NCI), United States of America
  15. r

    3D-Genomics Database

    • rrid.site
    • scicrunch.org
    • +1more
    Updated May 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). 3D-Genomics Database [Dataset]. http://identifiers.org/RRID:SCR_007430/resolver?q=*&i=rrid
    Explore at:
    Dataset updated
    May 24, 2025
    Description

    THIS RESOURCE IS NO LONGER IN SERVICE, documented August 29, 2016. Database containing structural annotations for the proteomes of just under 100 organisms. Using data derived from public databases of translated genomic sequences, representatives from the major branches of Life are included: Prokaryota, Eukaryota and Archaea. The annotations stored in the database may be accessed in a number of ways. The help page provides information on how to access the database. 3D-GENOMICS is now part of a larger project, called e-Protein. The project brings together similar databases at three sites: Imperial College London , University College London and the European Bioinformatics Institute . e-Protein''s mission statement is To provide a fully automated distributed pipeline for large-scale structural and functional annotation of all major proteomes via the use of cutting-edge computer GRID technologies. The following databases are incorporated: NRprot, SCOP, ASTRAL, PFAM, Prosite, taxonomy, COG The following eukaryotic genomes are incorporated: Anopheles gambiae, protein sequences from the mosquito genome; Arabidopsis thaliana, protein sequences from the Arabidopsis genome; Caenorhabditis briggsae, protein sequences from the C.briggsae genome; Caenorhabditis elegans protein sequences from the worm genome; Ciona intestinalis protein sequences from the sea squirt genome; Danio rerio protein sequences from the zebrafish genome; Drosophila melanogaster protein sequences from the fruitfly genome; Encephalitozoon cuniculi protein sequences from the E.cuniculi genome; Fugu rubripes protein sequences from the pufferfish genome; Guillardia theta protein sequences from the G.theta genome; Homo sapiens protein sequences from the human genome; Mus musculus protein sequences from the mouse genome; Neurospora crassa protein sequences from the N.crassa genome; Oryza sativa protein sequences from the rice genome; Plasmodium falciparum protein sequences from the P.falciparum genome; Rattus norvegicus protein sequences from the rat genome; Saccharomyces cerevisiae protein sequences from the yeast genome; Schizosaccharomyces pombe protein sequences from the yeast genome

  16. e

    Large-scale proteogenomics characterization of the Mycobacterium...

    • ebi.ac.uk
    Updated Jun 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eduardo Vieira de Souza (2023). Large-scale proteogenomics characterization of the Mycobacterium tuberculosis hidden proteome [Dataset]. https://www.ebi.ac.uk/pride/archive/projects/PXD042958
    Explore at:
    Dataset updated
    Jun 13, 2023
    Authors
    Eduardo Vieira de Souza
    Variables measured
    Proteomics
    Description

    Traditional genome annotation methods exclude Open Reading Frames shorter than 300 codons (smORFs), which leaves a substantial portion of the proteome overlooked. Proteogenomics is a multi-omics approach that merges genomics, transcriptomics and proteomics to identify proteoforms and unannotated proteins from Mass Spectrometry data. Here, we employed our recently developed proteogenomics pipeline to aid genome annotation and identify hundreds of novel microproteins encoded by smORFs in the genome of Mycobacterium tuberculosis (Mtb). To avoid limitations regarding sensitivity, we used 680 Mass Spectrometry experiments in a large-scale approach, which let us classify the findings by different degrees of confidence using our machine learning model. After integrating the results with RNA-Seq datasets, we explore the biological relevance of the novel sequences and show they are differentially expressed upon starvation and antibiotic treatment, and are co-expressed with many annotated genes that are vital for bacterial virulence. Moreover, some smORFs are located inside essential genomic segments and could be attractive targets for the development of new drugs. Altogether, our results should improve the current annotation of the proteome of Mtb and guide the following studies focusing on studying these microproteins thoroughly.

  17. f

    Robust subset of taxa from the IBD dataset.

    • figshare.com
    xls
    Updated Dec 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Baptiste Ruiz; Arnaud Belcour; Samuel Blanquart; Sylvie Buffet-Bataillon; Isabelle Le Huërou-Luron; Anne Siegel; Yann Le Cunff (2024). Robust subset of taxa from the IBD dataset. [Dataset]. http://doi.org/10.1371/journal.pcbi.1012577.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 2, 2024
    Dataset provided by
    PLOS Computational Biology
    Authors
    Baptiste Ruiz; Arnaud Belcour; Samuel Blanquart; Sylvie Buffet-Bataillon; Isabelle Le Huërou-Luron; Anne Siegel; Yann Le Cunff
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The composition of the gut microbiota is a known factor in various diseases and has proven to be a strong basis for automatic classification of disease state. A need for a better understanding of microbiota data on the functional scale has since been voiced, as it would enhance these approaches’ biological interpretability. In this paper, we have developed a computational pipeline for integrating the functional annotation of the gut microbiota into an automatic classification process and facilitating downstream interpretation of its results. The process takes as input taxonomic composition data, which can be built from 16S or whole genome sequencing, and links each component to its functional annotations through interrogation of the UniProt database. A functional profile of the gut microbiota is built from this basis. Both profiles, microbial and functional, are used to train Random Forest classifiers to discern unhealthy from control samples. SPARTA ensures full reproducibility and exploration of inherent variability by extending state-of-the-art methods in three dimensions: increased number of trained random forests, selection of important variables with an iterative process, repetition of full selection process from different seeds. This process shows that the translation of the microbiota into functional profiles gives non-significantly different performances when compared to microbial profiles on 5 of 6 datasets. This approach’s main contribution however stems from its interpretability rather than its performance: through repetition, it also outputs a robust subset of discriminant variables. These selections were shown to be more consistent than those obtained by a state-of-the-art method, and their contents were validated through a manual bibliographic research. The interconnections between selected taxa and functional annotations were also analyzed and revealed that important annotations emerge from the cumulated influence of non-selected taxa.

  18. o

    Annotated patches of whole oilseed rape (Brassica napus) plant images...

    • explore.openaire.eu
    • data.niaid.nih.gov
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evangeline Corcoran (2024). Annotated patches of whole oilseed rape (Brassica napus) plant images created using the MapReader pipeline [Dataset]. http://doi.org/10.5281/zenodo.12721801
    Explore at:
    Dataset updated
    Jul 11, 2024
    Authors
    Evangeline Corcoran
    Description

    Background and Dataset Creation: Patches derived from whole images of oilseed rape (Brassica napus) plants from the 'Collection of side view and top view RGB images of Brassica napus from a large scale, high throughput experiment' dataset and their associated annotations, which were used to train, validate and test patch classification models as described in the following paper: Corcoran, E., Hosseini, K., Siles, L., Kurup, S., and Ahnert, S. 2024. 'Automated dynamic phenotyping of whole oilseed rape (Brassica napus) plants from images collected under controlled conditions', Frontiers in Plant Science (under review). Patches were created and annotated using the MapReader pipeline. Please see: Kasra Hosseini, Daniel C. S. Wilson, Kaspar Beelen, and Katherine McDonough. 2022. MapReader: a computer vision pipeline for the semantic exploration of maps at scale. In Proceedings of the 6th ACM SIGSPATIAL International Workshop on Geospatial Humanities (GeoHumanities '22). Association for Computing Machinery, New York, NY, USA, 8–19. https://doi.org/10.1145/3557919.3565812 Kasra Hosseini, Rosie Wood, Andy Smith, Katie McDonough, Daniel C.S. Wilson, Christina Last, Kalle Westerling, and Evangeline Mae Corcoran. “Living-with-machines/mapreader: End of Lwm”. Zenodo, July 27, 2023. https://doi.org/10.5281/zenodo.8189653. File structure: Annotations The 'annotations_six_label_sv_5.zip' folder contains annotations for the entire patch dataset in .csv format, these files have two columns 'image_id', 'label' in which: 'image_id' = the path to each image patch 'label' = the label assigned to each patch by the annotator indicated which part of the plant the patch primarily contained, or if it was part of the background. Labels: '0' = non-plant background, '1' = open flower, '2' = flower bud, '3' = leaf, '4' = greed pod containing seed, '5' = branch. Patches The 'b_napus_patch_data.zip' folder contains all patches in csv format. Each file is named in a consistent format e.g. "patch-1580-330-1590-340-#2018-07-06_00_VIS_sv_000-0-0-0.png#.PNG" where '1580-330-1590-340' are the x and y coordinates of the patch boundary and '#2018-07-06_00_VIS_sv_000-0-0-0.png#' indicates the image in the 'Collection of side view and top view RGB images of Brassica napus from a large scale, high throughput experiment' from which the patch was derived.

  19. Data from: OKI2018_I69 assembly and annotation of the genome of an...

    • zenodo.org
    bin
    Updated Jun 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aleksandra Bliznina; Aleksandra Bliznina (2021). OKI2018_I69 assembly and annotation of the genome of an individual Oikopleura dioica from Okinawa [Dataset]. http://doi.org/10.5281/zenodo.4604144
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 3, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Aleksandra Bliznina; Aleksandra Bliznina
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    A chromosome-scale assembly of the Oikopleura dioica genome from Okinawa, Japan. The contig assembly was generated with long-read Nanopore data using Canu pipeline v1.8, and polished with short Illumina MiSeq reads using Pilon v1.22. Both Nanopore and Illumina data were generated from DNA of a single O. dioica male. Hi-C chromosomal conformation capture data was used to order and orient the contigs into scaffolds using Juicer v1.6 and 3D de novo assembly (3D-DNA) pipelines. The OKI2018_I69_1.0 assembly comprises 19 scaffolds with an N50 of 16.2 Mbp (OKI2018_I69_1.0.fa). The total assembly length is 64.3 Mbp. The five longest scaffolds represent autosomal chromosomes (chr 1 and chr 2), and sex chromosomes split into pseudo-autosomal region (PAR) and X-specific (XSR) or Y-specific (YSR) regions. One of the smaller scaffolds represent a draft assembly of mitochondrial genome (chrUn_12). The rest of scaffolds are highly repetitive and were marked as unplaced. The OKI2018_I69_1.0 assembly was annotated with AUGUSTUS v3.3 and MAKER v3.01.03 pipelines. Gene predictions from these software were refined and merged using EvidenceModeler v1.1.1. To predict UTRs and alternative isoforms, the EVM models were updated using two round of PASA pipeline, resulting in 18,485 transcript models distributed among 16,936 protein-coding genes (OKI2018_I69_1.0.gene_models.gff3).

  20. P

    Motion-X++ Dataset

    • paperswithcode.com
    Updated Apr 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuhong Zhang; Jing Lin; Ailing Zeng; Guanlin Wu; Shunlin Lu; Yurong Fu; Yuanhao Cai; Ruimao Zhang; Haoqian Wang; Lei Zhang (2025). Motion-X++ Dataset [Dataset]. https://paperswithcode.com/dataset/motion-x-1
    Explore at:
    Dataset updated
    Apr 26, 2025
    Authors
    Yuhong Zhang; Jing Lin; Ailing Zeng; Guanlin Wu; Shunlin Lu; Yurong Fu; Yuanhao Cai; Ruimao Zhang; Haoqian Wang; Lei Zhang
    Description

    In this paper, we introduce Motion-X++, a large-scale multimodal 3D expressive whole-body human motion dataset. Existing motion datasets predominantly capture body-only poses, lacking facial expressions, hand gestures, and fine-grained pose descriptions, and are typically limited to lab settings with manually labeled text descriptions, thereby restricting their scalability. To address this issue, we develop a scalable annotation pipeline that can automatically capture 3D whole-body human motion and comprehensive textural labels from RGB videos and build the Motion-X dataset comprising 81.1K text-motion pairs. Furthermore, we extend Motion-X into Motion-X++ by improving the annotation pipeline, introducing more data modalities, and scaling up the data quantities. Motion-X++ provides 19.5M 3D whole-body pose annotations covering 120.5K motion sequences from massive scenes, 80.8K RGB videos, 45.3K audios, 19.5M frame-level whole-body pose descriptions, and 120.5K sequence-level semantic labels. Comprehensive experiments validate the accuracy of our annotation pipeline and highlight Motion-X++’s significant benefits for generating expressive, precise, and natural motion with paired multimodal labels supporting several downstream tasks, including text-driven whole-body motion generation, audio-driven motion generation, 3D whole-body human mesh recovery, and 2D whole-body keypoints estimation, etc.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Agricultural Research Service (2025). Functional annotation for 15 diverse arthropod genomes [Dataset]. https://catalog.data.gov/dataset/functional-annotation-for-15-diverse-arthropod-genomes-2c303

Data from: Functional annotation for 15 diverse arthropod genomes

Related Article
Explore at:
Dataset updated
Jun 5, 2025
Dataset provided by
Agricultural Research Service
Description

We present the annotation results of 15 arthropod proteomes using an open source, open access and containerized pipeline for genome-scale functional annotation of insect proteomes and apply it to a diverse range of arthropod species. You can find more information about the pipeline at our readthedocs site. The files for each genome include GOanna, InterproScan and KOBAS predictions. Arthropod genomes selected for this study and their assembly and annotation statistics. Apis Mellifera (honey bee) Drosophila melanogaster (fruit fly) Tribolium castaneum (red flour beetle) Latrodectus hesperus (Western black widow spider) Limnephilus lunatus (caddisfly) Oncopeltus fasciatus (Large milkweed bug) Homalodisca vitripennis (Glassy-winged sharpshooter) Eurytemora affinis (calanoid copepod) Agrilus planipennis (emerald ash borer) Copidosoma floridanum (parasitoid wasp) Athalia rosae (turnip sawfly) Ceratitis capitata (Mediterranean fruit fly) Cimex lectularius (Cimicidae bed bug) Varroa destructor(parasitic mite) Diaphorina citri (Asian citrus psyllid) Resources in this dataset:Resource Title: Cimex lectularius (Cimicidae bed bug) annotation. File Name: CLEC.tar.gzResource Description: Functional annotation for Clec-OGSv1.2 protein setResource Title: Tribolium castaneum (red flour beetle) annotation. File Name: TCAS.tar.gzResource Description: Functional annotation for TCAS_OGS_v3 protein setResource Title: Drosophila melanogaster (fruit fly) annotation. File Name: DMEL.tar.gzResource Description: Functional annotation for DMEL_r6.38 protein set Resource Title: Varroa destructor (parasitic mite) annotation. File Name: VDES.tar.gzResource Description: Functional annotation for NCBI Varroa destructor Annotation Release 100 protein set based on Vdes_3.0 genome (GCA_002443255.1) Resource Title: Oncopeltus fasciatus (Large milkweed bug) annotation. File Name: ONCFAS.tar.gzResource Description: Functional annotation for oncfas_OGSv1.2 protein setResource Title: Apis Mellifera (honey bee) annotation. File Name: AMEL.tar.gzResource Description: Functional annotation for OGSv3.3 protein set from Amel_4.5 genome (GCA_000002195.1) Resource Title: Homalodisca vitripennis (Glassy-winged sharpshooter) annotation. File Name: HVIT.tar.gzResource Description: Functional annotation for HVIT-BCM_version_0.5.3 protein set based on Hvit_1.0 genome (GCA_000696855.1) Resource Title: Limnephilus lunatus (caddisfly) annotation. File Name: LLUN.tar.gzResource Description: Functional annotation for LLUN-BCM_version_0.5.3 protein set from Llun_1.0 genome (GCA_000648945.1) Resource Title: Latrodectus hesperus (Western black widow spider) annotation. File Name: LHES.tar.gzResource Description: Functional annotation for LHES-BCM_version_0.5.3 protein set from Lhes_1.0 genome (GCA_000697925.1) Resource Title: Eurytemora affinis (calanoid copepod) annotation. File Name: EAFF.tar.gzResource Description: Functional annotation for EAFF-BCM_version_0.5.3 protein set from Eaff_1.0 genome (GCA_000591075.1) Resource Title: Copidosoma floridanum (parasitoid wasp) annotation. File Name: CFLO.tar.gzResource Description: Functional annotation for CFLO-BCM_version_0.5.3 protein set based on Cflo_1.0 genome (GCA_000648655.1) Resource Title: Ceratitis capitata (Mediterranean fruit fly) annotation. File Name: CCAP.tar.gzResource Description: Functional annotation for Ccap-OGSv1 protein set based on Ccap_1.1 assembly (GCA_000347755.2) Resource Title: Athalia rosae (turnip sawfly) annotation. File Name: AROS.tar.gzResource Description: Functional annotation for AROS-BCM_version_0.5.3 protein set based on Aros_1.0 genome (GCA_000344095.1)Resource Title: Agrilus planipennis (emerald ash borer) annotation. File Name: APLA.tar.gzResource Description: Functional annotation for APLA-BCM_version_0.5.3 protein set based on Apla_1.0 genome (GCA_000699045.1)

Search
Clear search
Close search
Google apps
Main menu