100+ datasets found

d
The metagenome sequencing data have been deposited in the European...
datasets.ai
catalog.data.gov
0
Updated Aug 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Environmental Protection Agency (2024). The metagenome sequencing data have been deposited in the European Nucleotide Archive (ENA). [Dataset]. https://datasets.ai/datasets/the-metagenome-sequencing-data-have-been-deposited-in-the-european-nucleotide-archive-ena-8ad17
Explore at:
0Available download formats
Dataset updated
Aug 12, 2024
Dataset authored and provided by
U.S. Environmental Protection Agency
Description
The raw sequencing data for this study have been deposited in the European Nucleotide Archive (ENA) at EMBL-EBI under accession number PRJEB40814 with the following BioSample numbers: SAMEA7465213 (sample DWDS A1), SAMEA7465214 (DWDS A2), SAMEA7465217 (DWDS B1), SAMEA7465218 (DWDS B2), SAMEA7465220 (DWDS C1), SAMEA7465221 (DWDS C2), SAMEA7465222 (DWDS D1), SAMEA7465223 (DWDS D2), SAMEA7465226 (DWDS E1), and SAMEA7465227 (DWDS E2).

This dataset is associated with the following publication: Gomez-Alvarez, V., S. Siponen, A. Kauppinen, A. Hokajarvi , A. Tiwari, A. Sarekoski, I.T. Miettinen, E. Torvinen, and T. Pitkanen. A comparative analysis employing a gene- and genome-centric metagenomic approach reveals changes in composition, function, and activity in waterworks with different treatment processes and source water in Finland. WATER RESEARCH. Elsevier Science Ltd, New York, NY, USA, 229: 119495, (2023).
d
European Nucleotide Archive (ENA)
dknet.org
rrid.site
+2more
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). European Nucleotide Archive (ENA) [Dataset]. http://identifiers.org/RRID:SCR_006515
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_006515
Dataset updated
Jan 29, 2022
Description
Public archive providing a comprehensive record of the world''''s nucleotide sequencing information, covering raw sequencing data, sequence assembly information and functional annotation. All submitted data, once public, will be exchanged with the NCBI and DDBJ as part of the INSDC data exchange agreement. The European Nucleotide Archive (ENA) captures and presents information relating to experimental workflows that are based around nucleotide sequencing. A typical workflow includes the isolation and preparation of material for sequencing, a run of a sequencing machine in which sequencing data are produced and a subsequent bioinformatic analysis pipeline. ENA records this information in a data model that covers input information (sample, experimental setup, machine configuration), output machine data (sequence traces, reads and quality scores) and interpreted information (assembly, mapping, functional annotation). Data arrive at ENA from a variety of sources including submissions of raw data, assembled sequences and annotation from small-scale sequencing efforts, data provision from the major European sequencing centers and routine and comprehensive exchange with their partners in the International Nucleotide Sequence Database Collaboration (INSDC). Provision of nucleotide sequence data to ENA or its INSDC partners has become a central and mandatory step in the dissemination of research findings to the scientific community. ENA works with publishers of scientific literature and funding bodies to ensure compliance with these principles and to provide optimal submission systems and data access tools that work seamlessly with the published literature. ENA is made up of a number of distinct databases that includes the EMBL Nucleotide Sequence Database (Embl-Bank), the newly established Sequence Read Archive (SRA) and the Trace Archive. The main tool for downloading ENA data is the ENA Browser, which is available through REST URLs for easy programmatic use. All ENA data are available through the ENA Browser. Note: EMBL Nucleotide Sequence Database (EMBL-Bank) is entirely included within this resource.
r
GenBank
rrid.site
dknet.org
+1more
Updated Jul 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). GenBank [Dataset]. http://identifiers.org/RRID:SCR_002760
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_002760
Dataset updated
Jul 27, 2025
Description
NIH genetic sequence database that provides annotated collection of all publicly available DNA sequences for almost 280 000 formally described species (Jan 2014) .These sequences are obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects, including whole-genome shotgun (WGS) and environmental sampling projects. Most submissions are made using web-based BankIt or standalone Sequin programs, and GenBank staff assigns accession numbers upon data receipt. It is part of International Nucleotide Sequence Database Collaboration and daily data exchange with European Nucleotide Archive (ENA) and DNA Data Bank of Japan (DDBJ) ensures worldwide coverage. GenBank is accessible through NCBI Entrez retrieval system, which integrates data from major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of GenBank database are available by FTP.
The OHEJP BeONE Project – Salmonella enterica genome assembly dataset
zenodo.org
data.niaid.nih.gov
+1more
bin, zip
Updated Jul 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Verónica Mixão; Verónica Mixão; Miguel Pinto; Miguel Pinto; João Paulo Gomes; João Paulo Gomes; Daniel Sobral; Daniel Sobral; Holger Brendebach; Holger Brendebach; Carlus Deneke; Carlus Deneke; Simon Tausch; Simon Tausch; Adriano Di Pasquale; Adriano Di Pasquale; Claudia Swart-Coipan; Claudia Swart-Coipan; Ewelina Iwan; Jörg Linde; Jörg Linde; Karin Lagesen; Karin Lagesen; Liljana Petrovska; Liljana Petrovska; Mohammed Umaer Naseer; Rolf Sommer Kaas; Rolf Sommer Kaas; Sandra Simon; Katrine Joensen; Katrine Joensen; Kristoffer Kiil; Sofie Nielsen; Sofie Nielsen; Vítor Borges; Vítor Borges; INSA; APHA; BfR; DTU; FLI; IZSAM; NIPH; NVI; PIWET; RIVM; RKI; SSI; Ewelina Iwan; Mohammed Umaer Naseer; Sandra Simon; Kristoffer Kiil; INSA; APHA; BfR; DTU; FLI; IZSAM; NIPH; NVI; PIWET; RIVM; RKI; SSI (2023). The OHEJP BeONE Project – Salmonella enterica genome assembly dataset [Dataset]. http://doi.org/10.5281/zenodo.7802723
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7802723
Dataset updated
Jul 24, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Verónica Mixão; Verónica Mixão; Miguel Pinto; Miguel Pinto; João Paulo Gomes; João Paulo Gomes; Daniel Sobral; Daniel Sobral; Holger Brendebach; Holger Brendebach; Carlus Deneke; Carlus Deneke; Simon Tausch; Simon Tausch; Adriano Di Pasquale; Adriano Di Pasquale; Claudia Swart-Coipan; Claudia Swart-Coipan; Ewelina Iwan; Jörg Linde; Jörg Linde; Karin Lagesen; Karin Lagesen; Liljana Petrovska; Liljana Petrovska; Mohammed Umaer Naseer; Rolf Sommer Kaas; Rolf Sommer Kaas; Sandra Simon; Katrine Joensen; Katrine Joensen; Kristoffer Kiil; Sofie Nielsen; Sofie Nielsen; Vítor Borges; Vítor Borges; INSA; APHA; BfR; DTU; FLI; IZSAM; NIPH; NVI; PIWET; RIVM; RKI; SSI; Ewelina Iwan; Mohammed Umaer Naseer; Sandra Simon; Kristoffer Kiil; INSA; APHA; BfR; DTU; FLI; IZSAM; NIPH; NVI; PIWET; RIVM; RKI; SSI
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset

This dataset comprises the genome assemblies of 1,540 Salmonella enterica samples collected by the BeONE Consortium on behalf of the One Health European Joint Programme “BeONE: Building Integrative Tools for One Health Surveillance” (https://onehealthejp.eu/jrp-beone/). Additionally, a complementary dataset is also made available (https://zenodo.org/record/7119735), comprising genome assemblies of 1,434 S. enterica samples selected among the Whole-Genome Sequencing (WGS) data publicly available in the European Nucleotide Archive (ENA) or in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA).

File “BeONE_Se_metadata.xlsx” contains the genome assembly statistics for each isolate, including European Nucleotide Archive accession numbers, in-silico Multi Locus Sequence Type and Serotype, and information regarding year of sampling, country and source.

The archive “BeONE_Se_assemblies.zip” contains all the genome assemblies (.fasta format) of each isolate presented in the metadata file.

Dataset selection and curation

This anonymized dataset of S. enterica genome assemblies was generated using Next Generation Sequencing data collected within the BeONE Consortium available at the European Nucleotide Archive under BioProject Accession Number PRJEB57179. Read quality control, trimming and assembly were performed with Aquamis v1.3.9 (Deneke et al. 2021) using default parameters. Assembly quality control (QC), including contamination assessment, as well as MLST ST determination were performed with the same pipeline. All genome assemblies passing the QC were included in the final dataset. Among the others, we noticed that a considerable proportion of assemblies was flagged as “QC fail” exclusively due to the “NumContamSNVs” parameter, suggesting that this setting might have been too strict. After manual inspection of a random subset, assemblies for which the percentage of reads corresponding to the correct species was >98% were recovered and integrated in the final dataset (those samples are labeled in the Metadata file). In total, 1,540 isolates passed the dataset curation step and were included in the final dataset. In-silico serotyping was performed with SeqSero2 v1.2.1 (Zhang et al. 2019).

Funding

This work was supported by funding from the European Union’s Horizon 2020 Research and Innovation programme under grant agreement No 773830: One Health European Joint Programme.

Acknowledgements

We thank the National Distributed Computing Infrastructure of Portugal (INCD) for providing the necessary resources to run the genome assemblies. INCD was funded by FCT and FEDER under the project 22153-01/SAICT/2016.
f
Table_1_The Viscum album Gene Space database.xlsx
datasetcatalog.nlm.nih.gov
frontiersin.figshare.com
Updated Jul 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rugen, Nils; Senkler, Michael; Küster, Helge; Schröder, Lucie; Hohnjec, Natalija; Braun, Hans-Peter; Rupp, Oliver; Goesmann, Alexander (2023). Table_1_The Viscum album Gene Space database.xlsx [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001049440
Explore at:
Dataset updated
Jul 6, 2023
Authors
Rugen, Nils; Senkler, Michael; Küster, Helge; Schröder, Lucie; Hohnjec, Natalija; Braun, Hans-Peter; Rupp, Oliver; Goesmann, Alexander
Description
The hemiparasitic flowering plant Viscum album (European mistletoe) is known for its very special life cycle, extraordinary biochemical properties, and extremely large genome. The size of its genome is estimated to be 30 times larger than the human genome and 600 times larger than the genome of the model plant Arabidopsis thaliana. To achieve insights into the Gene Space of the genome, which is defined as the space including and surrounding protein-coding regions, a transcriptome project based on PacBio sequencing has recently been conducted. A database resulting from this project contains sequences of 39,092 different open reading frames encoding 32,064 distinct proteins. Based on ‘Benchmarking Universal Single-Copy Orthologs’ (BUSCO) analysis, the completeness of the database was estimated to be in the range of 78%. To further develop this database, we performed a transcriptome project of V. album organs harvested in summer and winter based on Illumina sequencing. Data from both sequencing strategies were combined. The new V. album Gene Space database II (VaGs II) contains 90,039 sequences and has a completeness of 93% as revealed by BUSCO analysis. Sequences from other organisms, particularly fungi, which are known to colonize mistletoe leaves, have been removed. To evaluate the quality of the new database, proteome data of a mitochondrial fraction of V. album were re-analyzed. Compared to the original evaluation published five years ago, nearly 1000 additional proteins could be identified in the mitochondrial fraction, providing new insights into the Oxidative Phosphorylation System of V. album. The VaGs II database is available at https://viscumalbum.pflanzenproteomik.de/. Furthermore, all V. album sequences have been uploaded at the European Nucleotide Archive (ENA).
Z
The OHEJP BeONE Project – Escherichia coli genome assembly dataset
data.niaid.nih.gov
data.europa.eu
Updated Jul 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
INSA (2023). The OHEJP BeONE Project – Escherichia coli genome assembly dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7267844
Explore at:
Dataset updated
Jul 24, 2023
Dataset provided by
IZSAM
FLI
Swart-Coipan, Claudia
Borges, Vítor
APHA
BfR
RKI
Pinto, Miguel
Kiil, Kristoffer
Mixão, Verónica
RIVM
Nielsen, Sofie
Iwan, Ewelina
Simon, Sandra
Tausch, Simon
SSI
Sommer Kaas, Rolf
Linde, Jörg
NVI
Sobral, Daniel
DTU
Joensen, Katrine
Gomes, João Paulo
Lagesen, Karin
INSA
Di Pasquale, Adriano
Brendebach, Holger
PIWET
Deneke, Carlus
NIPH
Petrovska, Liljana
Umaer Naseer, Mohammed
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset

This dataset comprises the genome assemblies of 308 Escherichia coli samples collected by the BeONE Consortium on behalf of the One Health European Joint Programme “BeONE: Building Integrative Tools for One Health Surveillance” (https://onehealthejp.eu/jrp-beone/). Additionally, a complementary dataset is also made available (https://zenodo.org/record/7120057), comprising genome assemblies of 1,999 E. coli samples selected among the Whole-Genome Sequencing (WGS) data publicly available in the European Nucleotide Archive (ENA) or in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA).

File “BeONE_Ec_metadata.xlsx” contains the genome assembly statistics for each isolate, including European Nucleotide Archive accession numbers, in-silico Multi Locus Sequence Type and Serotype, and information regarding year of sampling, country and source.

The archive “BeONE_Ec_assemblies.zip” contains all the genome assemblies (.fasta format) of each isolate presented in the metadata file.

Dataset selection and curation

This anonymized dataset of E. coli genome assemblies was generated using Next Generation Sequencing data collected within the BeONE Consortium available at the European Nucleotide Archive under BioProject Accession Number PRJEB57098. Read quality control, trimming and assembly were performed with Aquamis v1.3.9 (Deneke et al. 2021) using default parameters. Assembly quality control (QC), including contamination assessment, as well as MLST ST determination were performed with the same pipeline. All genome assemblies passing the QC were included in the final dataset. Among the others, we noticed that a considerable proportion of assemblies was flagged as “QC fail” exclusively due to the “NumContamSNVs” parameter, suggesting that this setting might have been too strict. After manual inspection of a random subset, assemblies for which the percentage of reads corresponding to the correct species was >98% were recovered and integrated in the final dataset (those samples are labeled in the Metadata file). In total, 308 isolates passed the dataset curation step and were included in the final dataset. In-silico serotyping was performed with seq_typing v2.2.

Funding

This work was supported by funding from the European Union’s Horizon 2020 Research and Innovation programme under grant agreement No 773830: One Health European Joint Programme.

Acknowledgements

We thank the National Distributed Computing Infrastructure of Portugal (INCD) for providing the necessary resources to run the genome assemblies. INCD was funded by FCT and FEDER under the project 22153-01/SAICT/2016.
f
EukRef: Phylogenetic curation of ribosomal RNA to enhance understanding of...
figshare.com
datasetcatalog.nlm.nih.gov
xlsx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javier del Campo; Martin Kolisko; Vittorio Boscaro; Luciana F. Santoferrara; Serafim Nenarokov; Ramon Massana; Laure Guillou; Alastair Simpson; Cedric Berney; Colomban de Vargas; Matthew W. Brown; Patrick J. Keeling; Laura Wegener Parfrey (2023). EukRef: Phylogenetic curation of ribosomal RNA to enhance understanding of eukaryotic diversity and distribution [Dataset]. http://doi.org/10.1371/journal.pbio.2005849
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pbio.2005849
Dataset updated
May 31, 2023
Dataset provided by
PLOS Biology
Authors
Javier del Campo; Martin Kolisko; Vittorio Boscaro; Luciana F. Santoferrara; Serafim Nenarokov; Ramon Massana; Laure Guillou; Alastair Simpson; Cedric Berney; Colomban de Vargas; Matthew W. Brown; Patrick J. Keeling; Laura Wegener Parfrey
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Environmental sequencing has greatly expanded our knowledge of micro-eukaryotic diversity and ecology by revealing previously unknown lineages and their distribution. However, the value of these data is critically dependent on the quality of the reference databases used to assign an identity to environmental sequences. Existing databases contain errors and struggle to keep pace with rapidly changing eukaryotic taxonomy, the influx of novel diversity, and computational challenges related to assembling the high-quality alignments and trees needed for accurate characterization of lineage diversity. EukRef (eukref.org) is an ongoing community-driven initiative that addresses these challenges by bringing together taxonomists with expertise spanning the eukaryotic tree of life and microbial ecologists, who use environmental sequence data to develop reliable reference databases across the diversity of microbial eukaryotes. EukRef organizes and facilitates rigorous mining and annotation of sequence data by providing protocols, guidelines, and tools. The EukRef pipeline and tools allow users interested in a particular group of microbial eukaryotes to retrieve all sequences belonging to that group from International Nucleotide Sequence Database Collaboration (INSDC) (GenBank, the European Nucleotide Archive [ENA], or the DNA DataBank of Japan [DDBJ]), to place those sequences in a phylogenetic tree, and to curate taxonomic and environmental information for the group. We provide guidelines to facilitate the process and to standardize taxonomic annotations. The final outputs of this process are (1) a reference tree and alignment, (2) a reference sequence database, including taxonomic and environmental information, and (3) a list of putative chimeras and other artifactual sequences. These products will be useful for the broad community as they become publicly available (at eukref.org) and are shared with existing reference databases.
Z
The OHEJP BeONE Project – Listeria monocytogenes genome assembly dataset
data.niaid.nih.gov
openagrar.de
+1more
Updated Jul 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simon, Sandra (2023). The OHEJP BeONE Project – Listeria monocytogenes genome assembly dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7267486
Explore at:
Dataset updated
Jul 24, 2023
Dataset provided by
IZSAM
FLI
Swart-Coipan, Claudia
Borges, Vítor
APHA
BfR
RKI
Pinto, Miguel
Kiil, Kristoffer
Mixão, Verónica
RIVM
Nielsen, Sofie
Iwan, Ewelina
Simon, Sandra
Tausch, Simon
SSI
Sommer Kaas, Rolf
Linde, Jörg
NVI
Sobral, Daniel
DTU
Joensen, Katrine
Gomes, João Paulo
Lagesen, Karin
INSA
Di Pasquale, Adriano
Brendebach, Holger
PIWET
Deneke, Carlus
NIPH
Petrovska, Liljana
Umaer Naseer, Mohammed
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset

This dataset comprises the genome assemblies of 1,426 Listeria monocytogenes samples collected by the BeONE Consortium on behalf of the One Health European Joint Programme “BeONE: Building Integrative Tools for One Health Surveillance” (https://onehealthejp.eu/jrp-beone/). Additionally, a complementary dataset is also made available (https://zenodo.org/record/7116878), comprising genome assemblies of 1,874 L. monocytogenes samples selected among the Whole-Genome Sequencing (WGS) data publicly available in the European Nucleotide Archive (ENA) or in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA).

File “BeONE_Lm_metadata.xlsx” contains the genome assembly statistics for each isolate, including European Nucleotide Archive accession numbers and in-silico Multi Locus Sequence Type, and information regarding year of sampling, country and source.

The archive “BeONE_Lm_assemblies.zip” contains all the genome assemblies (.fasta format) of each isolate presented in the metadata file.

Dataset selection and curation

This anonymized dataset of L. monocytogenes genome assemblies was generated using Next Generation Sequencing data collected within the BeONE Consortium available at the European Nucleotide Archive under BioProject Accession Number PRJEB57166. Read quality control, trimming and assembly were performed with Aquamis v1.3.9 (Deneke et al. 2021) using default parameters. Assembly quality control (QC), including contamination assessment, as well as MLST ST determination were performed with the same pipeline. All genome assemblies passing the QC were included in the final dataset. Among the others, we noticed that a considerable proportion of assemblies was flagged as “QC fail” exclusively due to the “NumContamSNVs” parameter, suggesting that this setting might have been too strict. After manual inspection of a random subset, assemblies for which the percentage of reads corresponding to the correct species was >98% were recovered and integrated in the final dataset (those samples are labeled in the Metadata file). In total, 1,426 isolates passed the dataset curation step and were included in the final dataset.

Funding

This work was supported by funding from the European Union’s Horizon 2020 Research and Innovation programme under grant agreement No 773830: One Health European Joint Programme.

Acknowledgements

We thank the National Distributed Computing Infrastructure of Portugal (INCD) for providing the necessary resources to run the genome assemblies. INCD was funded by FCT and FEDER under the project 22153-01/SAICT/2016.
The OHEJP BeONE Project – Campylobacter jejuni genome assembly dataset
data.europa.eu
data.niaid.nih.gov
unknown
Updated May 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2024). The OHEJP BeONE Project – Campylobacter jejuni genome assembly dataset [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-7802717?locale=ga
Explore at:
unknown(74818)Available download formats
Dataset updated
May 16, 2024
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset This dataset comprises the genome assemblies of 610 Campylobacter jejuni samples collected by the BeONE Consortium on behalf of the One Health European Joint Programme “BeONE: Building Integrative Tools for One Health Surveillance” (https://onehealthejp.eu/jrp-beone/). Additionally, a complementary dataset is also made available (https://zenodo.org/record/7120166), comprising genome assemblies of 3,076 C. jejuni samples selected among the Whole-Genome Sequencing (WGS) data publicly available in the European Nucleotide Archive (ENA) or in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). File “BeONE_Cj_metadata.xlsx” contains the genome assembly statistics for each isolate, including European Nucleotide Archive accession numbers and in-silico Multi Locus Sequence Type, and information regarding year of sampling, country and source. The archive “BeONE_Cj_assemblies.zip” contains all the genome assemblies (.fasta format) of each isolate presented in the metadata file. Dataset selection and curation This anonymized dataset of C. jejuni genome assemblies was generated using Next Generation Sequencing data collected within the BeONE Consortium available at the European Nucleotide Archive under BioProject Accession Number PRJEB57119. Read quality control, trimming and assembly were performed with Aquamis v1.3.9 (Deneke et al. 2021) using default parameters. Assembly quality control (QC), including contamination assessment, as well as MLST ST determination were performed with the same pipeline. All genome assemblies passing the QC were included in the final dataset. Among the others, we noticed that a considerable proportion of assemblies was flagged as “QC fail” exclusively due to the “NumContamSNVs” parameter, suggesting that this setting might have been too strict. After manual inspection of a random subset, assemblies for which the percentage of reads corresponding to the correct species was >98% were recovered and integrated in the final dataset (those samples are labeled in the Metadata file). In total, 610 isolates passed the dataset curation step and were included in the final dataset. Funding This work was supported by funding from the European Union’s Horizon 2020 Research and Innovation programme under grant agreement No 773830: One Health European Joint Programme. Acknowledgements We thank the National Distributed Computing Infrastructure of Portugal (INCD) for providing the necessary resources to run the genome assemblies. INCD was funded by FCT and FEDER under the project 22153-01/SAICT/2016.
f
EMBL2checklists: A Python package to facilitate the user-friendly submission...
plos.figshare.com
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Gruenstaeudl; Yannick Hartmaring (2023). EMBL2checklists: A Python package to facilitate the user-friendly submission of plant and fungal DNA barcoding sequences to ENA [Dataset]. http://doi.org/10.1371/journal.pone.0210347
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0210347
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Michael Gruenstaeudl; Yannick Hartmaring
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundThe submission of DNA sequences to public sequence databases is an essential, but insufficiently automated step in the process of generating and disseminating novel DNA sequence data. Despite the centrality of database submissions to biological research, the range of available software tools that facilitate the preparation of sequence data for database submissions is low, especially for sequences generated via plant and fungal DNA barcoding. Current submission procedures can be complex and prohibitively time expensive for any but a small number of input sequences. A user-friendly software tool is needed that streamlines the file preparation for database submissions of DNA sequences that are commonly generated in plant and fungal DNA barcoding.MethodsA Python package was developed that converts DNA sequences from the common EMBL and GenBank flat file formats to submission-ready, tab-delimited spreadsheets (so-called ‘checklists’) for a subsequent upload to the annotated sequence section of the European Nucleotide Archive (ENA). The software tool, titled ‘EMBL2checklists’, automatically converts DNA sequences, their annotation features, and associated metadata into the idiosyncratic format of marker-specific ENA checklists and, thus, generates files that can be uploaded via the interactive Webin submission system of ENA.ResultsEMBL2checklists provides a simple, platform-independent tool that automates the conversion of common DNA barcoding sequences into easily editable spreadsheets that require no further processing but their upload to ENA via the interactive Webin submission system. The software is equipped with an intuitive graphical as well as an efficient command-line interface for its operation. The utility of the software is illustrated by its application in four recent investigations, including plant phylogenetic and fungal metagenomic studies.DiscussionEMBL2checklists bridges the gap between common software suites for DNA sequence assembly and annotation and the interactive data submission process of ENA. It represents an easy-to-use solution for plant and fungal biologists without bioinformatics expertise to generate submission-ready checklists from common DNA sequence data. It allows the post-processing of checklists as well as work-sharing during the submission process and solves a critical bottleneck in the effort to increase participation in public data sharing.
r
Nucleotide (DNA / RNA) and Protein sequences from the Australian dwelling...
researchdata.edu.au
Updated Jul 20, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
QFAB Bioinformatics (2012). Nucleotide (DNA / RNA) and Protein sequences from the Australian dwelling species Egernia margaretae [Dataset]. https://researchdata.edu.au/nucleotide-dna-rna-egernia-margaretae/53836
Explore at:
Dataset updated
Jul 20, 2012
Dataset provided by
QFAB
Authors
QFAB Bioinformatics
Area covered
Australia
Description
This data collection contains all currently published nucleotide (DNA/RNA) and protein sequences from the Australian dwelling organism Egernia margaretae.

The nucleotide (DNA/RNA) and protein sequences have been sourced through the European Nucleotide Archive (ENA) and Universal Protein Resource (UniProt), databases that contains comprehensive sets of nucleotide (DNA/RNA) and protein sequences from all organisms that have been published by the International Research Community.

The identification of the species Egernia margaretae as an Australian dwelling organism has been achieved by accessing the Australian Plant Census (APC) or Australian Faunal Directory (AFD) through the Atlas of Living Australia.
Genome assemblies and respective wg/cgMLST profiles of a diverse dataset...
zenodo.org
data.niaid.nih.gov
+1more
zip
Updated Jul 24, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Verónica Mixão; Verónica Mixão; Holger Brendebach; Miguel Pinto; Daniel Sobral; João Paulo Gomes; Carlus Deneke; Simon Tausch; Vítor Borges; Holger Brendebach; Miguel Pinto; Daniel Sobral; João Paulo Gomes; Carlus Deneke; Simon Tausch; Vítor Borges (2023). Genome assemblies and respective wg/cgMLST profiles of a diverse dataset comprising 1,999 Escherichia coli isolates [Dataset]. http://doi.org/10.5281/zenodo.7120058
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7120058
Dataset updated
Jul 24, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Verónica Mixão; Verónica Mixão; Holger Brendebach; Miguel Pinto; Daniel Sobral; João Paulo Gomes; Carlus Deneke; Simon Tausch; Vítor Borges; Holger Brendebach; Miguel Pinto; Daniel Sobral; João Paulo Gomes; Carlus Deneke; Simon Tausch; Vítor Borges
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset

This dataset comprises the genome assemblies and respective 7,601-loci whole-genome (wg) Multiple Locus Sequence Type (MLST) profiles [INNUENDO schema (Llarena et al. 2018) available in chewie-NS (Mamede et al. 2022)] of a final set of 1,999 Escherichia coli samples selected among the Whole-Genome Sequencing (WGS) data publicly available in the European Nucleotide Archive (ENA) or in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) at the beginning of the analysis (November 2021). This set of samples was carefully selected to cover a wide genetic diversity (assessed in terms of serotype). In total, 411 different serotypes are represented in this dataset, with O157:H7 being the most represented one, corresponding to 37.1% of the dataset.

File “Ec_metadata.xlsx” contains metadata information for each isolate, including ENA/SRA accession number, BioProject and in-silico MLST ST and serotype.

The directory “assemblies/” contains all the genome assemblies (.fasta format) of each isolate presented in the metadata file.

The file “profiles/Ec_profiles_wgMLST.tsv” corresponds to a tab separated file with the 7,601-loci wgMLST profiles of each isolate presented in the metadata file. The files “profiles/Ec_profiles_cgMLST_95.tsv”, “profiles/Ec_profiles_cgMLST_98.tsv” and “profiles/Ec_profiles_cgMLST_100.tsv” correspond to a 2,826-loci, 2,704-loci and 465-loci cgMLST profiles of each isolate presented in the metadata file, respectively. These profiles were determined as explained below.

Dataset selection and curation

With the objective of creating a diverse dataset of E. coli genome assemblies, we collected information about the genetic diversity (serotype) of the isolates available at Enterobase database in the beginning of this analysis (November 2021) and in other previous works. Based on this information, we selected an initial dataset comprising 2,688 samples associated with three BioProjects (PRJNA230969, PRJEB27020 and PRJNA248042). Their WGS data was downloaded from ENA/SRA with fastq-dl v1.0.6. Read quality control, trimming and assembly were performed with the Aquamis v1.3.9 (Deneke et al. 2021) using default parameters. Assembly quality control (QC), including contamination assessment, as well as MLST ST determination were performed with the same pipeline. All genome assemblies passing the QC were included in the final dataset. Among the others, we noticed that a considerable proportion of assemblies was flagged as “QC fail” exclusively due to the “NumContamSNVs” parameter, suggesting that this setting might have been too strict. After manual inspection of a random subset, assemblies for which the percentage of reads corresponding to the correct species was >98% were recovered and integrated in the final dataset (those samples are labeled in the Metadata file). In total, 1,999 isolates passed this curation step and were included in the final dataset. In-silico serotyping was performed with seq_typing v2.2. wgMLST profiles of each of these isolates were determined with chewBBACA v2.8.5 (Silva et al. 2018), using the 7,601-loci INNUENDO schema available in chewie-NS (Llarena et al. 2018; Mamede et al. 2022) and downloaded on May 31st, 2022. Three cgMLST schemas were obtained with ReporTree v1.0.0 (Mixão et al. 2022) using the 7,601-loci wgMLST profiles of the 1,999 isolates as input and setting distinct “--site-inclusion” thresholds: 0.95, 0.98 and 1.0 (i.e., keep schema loci called in at least 95%, 98% and 100% of the samples, resulting in a 2,826-loci, 2,704-loci and 465-loci allelic matrices, respectively).
Genome assemblies and respective cgMLST profiles of a diverse dataset...
data.europa.eu
data.niaid.nih.gov
unknown
Updated Jul 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). Genome assemblies and respective cgMLST profiles of a diverse dataset comprising 1,874 Listeria monocytogenes isolates [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-7230003?locale=bg
Explore at:
unknown(183550)Available download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset This dataset comprises the genome assemblies and respective 1,748-loci core-genome (cg) Multiple Locus Sequence Type (MLST) profiles [Pasteur schema (Moura et al. 2016) available in chewie-NS (Mamede et al. 2022)] of a final set of 1,874 Listeria monocytogenes samples selected among the Whole-Genome Sequencing (WGS) data publicly available in the European Nucleotide Archive (ENA) or in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) at the beginning of the analysis (November 2021). This set of samples was carefully selected to cover a wide genetic diversity (assessed in terms of Sequence Type [ST]). In total, 204 different STs are represented in this dataset, with ST121, ST6, ST9, ST1 and ST155 being in the top 5 and, together, corresponding to 37.9% of the dataset. File “Lm_metadata.xlsx” contains metadata information for each isolate, including ENA/SRA accession number, BioProject and in-silico MLST ST. The directory “assemblies/” contains all the genome assemblies (.fasta format) of each isolate presented in the metadata file. The file “profiles/Lm_profile.tsv” corresponds to a tab separated file with the 1,748-loci cgMLST profile of each isolate presented in the metadata file. These profiles were determined as explained below. Dataset selection and curation With the objective of creating a diverse dataset of L. monocytogenes genome assemblies, we collected information about the genetic diversity (STs) of the isolates available at BIGSdb-Lm database in the beginning of this analysis (November 2021) and in other previous works. Based on this information, we selected an initial dataset comprising 1,957 samples associated with three previous studies (Moura et al. 2016; Maury et al. 2017; Painset et al. 2019). Their WGS data was downloaded from ENA/SRA with fastq-dl v1.0.6. Read quality control, trimming and assembly were performed with the Aquamis v1.3.9 (Deneke et al. 2021) using default parameters. Assembly quality control (QC), including contamination assessment, as well as MLST ST determination were performed with the same pipeline. All genome assemblies passing the QC were included in the final dataset. Among the others, we noticed that a considerable proportion of assemblies was flagged as “QC fail” exclusively due to the “NumContamSNVs” parameter, suggesting that this setting might have been too strict. After manual inspection of a random subset, assemblies for which the percentage of reads corresponding to the correct species was >98% were recovered and integrated in the final dataset (those samples are labeled in the Metadata file). In total, 1,874 isolates passed the dataset curation step and were included in the final dataset. cgMLST profiles of each of these isolates were determined with chewBBACA v2.8.5 (Silva et al. 2018), using the 1,748-loci Pasteur schema (Moura et al. 2016) available in chewie-NS (Mamede et al. 2022) and downloaded on June 23rd, 2022. Acknowledgements We thank the National Distributed Computing Infrastructure of Portugal (INCD) for providing the necessary resources to run the genome assemblies. INCD was funded by FCT and FEDER under the project 22153-01/SAICT/2016.
r
Nucleotide (DNA / RNA) and Protein sequences from the Australian dwelling...
researchdata.edu.au
Updated Jul 20, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
QFAB Bioinformatics (2012). Nucleotide (DNA / RNA) and Protein sequences from the Australian dwelling species Egernia luctuosa [Dataset]. https://researchdata.edu.au/nucleotide-dna-rna-egernia-luctuosa/52503
Explore at:
Dataset updated
Jul 20, 2012
Dataset provided by
QFAB
Authors
QFAB Bioinformatics
Area covered
Australia
Description
This data collection contains all currently published nucleotide (DNA/RNA) and protein sequences from the Australian dwelling organism Egernia luctuosa.

The nucleotide (DNA/RNA) and protein sequences have been sourced through the European Nucleotide Archive (ENA) and Universal Protein Resource (UniProt), databases that contains comprehensive sets of nucleotide (DNA/RNA) and protein sequences from all organisms that have been published by the International Research Community.

The identification of the species Egernia luctuosa as an Australian dwelling organism has been achieved by accessing the Australian Plant Census (APC) or Australian Faunal Directory (AFD) through the Atlas of Living Australia.
Additional material for article "Exploring bacterial diversity via a curated...
figshare.com
zip
Updated Sep 17, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Grace Blackwell; Martin Hunt; Kerri Malone; Leandro Lima; Gal Horesh; Blaise T. F. Alako; Nicholas R. Thomson; Zamin Iqbal (2021). Additional material for article "Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences" [Dataset]. http://doi.org/10.6084/m9.figshare.16437939.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.16437939.v1
Dataset updated
Sep 17, 2021
Dataset provided by
Figsharehttp://figshare.com/
Authors
Grace Blackwell; Martin Hunt; Kerri Malone; Leandro Lima; Gal Horesh; Blaise T. F. Alako; Nicholas R. Thomson; Zamin Iqbal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The open sharing of genomic data provides an incredibly rich resource for the study of bacterial evolution and function, and even anthropogenic activities such as the widespread use of antimicrobials. However, these data consist of genomes assembled with different tools and levels of quality-checking, and of large volumes of completely unprocessed raw sequence data. In both cases, considerable computational effort is required before biological questions can be addressed. Here, we assembled and characterised 661,405 bacterial genomes retrieved from the European Nucleotide Archive (ENA) in November of 2018 using a uniform standardised approach. Of these, 311,006 did not previously have an assembly. We produced a searchable COBS index, facilitating the easy interrogation of the entire dataset for a specific sequence (e.g. gene, mutation or plasmid). Additional MinHash and pp-sketch indices support genome-wide comparisons and estimations of genomic distance. Combined, this resource will allow data to be easily subset and searched, phylogenetic relationships between genomes to be quickly elucidated, and hypotheses rapidly generated and tested. We believe this combination of uniform processing and variety of search/filter functionalities will make this a resource of very wide utility. In terms of diversity within the data, a breakdown of the 639,981 high quality genomes emphasised the uneven species composition of the ENA/public databases, with just 20 of the total 2,336 species making up 90% of the genomes. The over-represented species tend to be acute/common human pathogens, aligning with research priorities at different levels from individual interests to funding bodies and national and global public health agencies.
Screening of AMR-related genes in the genomes of Vibrio parahaemolyticus...
zenodo.org
bin, csv, pdf
Updated Jul 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jaime Martinez-Urtaza; Jaime Martinez-Urtaza; Jordi Manuel Cabrera-Gumbau; Jordi Manuel Cabrera-Gumbau (2024). Screening of AMR-related genes in the genomes of Vibrio parahaemolyticus strains isolated in Europe from clinical, environmental and other sources [Dataset]. http://doi.org/10.5281/zenodo.12514500
Explore at:
bin, csv, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.12514500
Dataset updated
Jul 23, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jaime Martinez-Urtaza; Jaime Martinez-Urtaza; Jordi Manuel Cabrera-Gumbau; Jordi Manuel Cabrera-Gumbau
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The distribution of antimicrobial resistance (AMR) genes for the EU and European Free Trade Association (EFTA) countries data was obtained from the global Vibrio parahaemolyticus genomes based on a collection of nearly 10,000 genomes. Some of the strains are from the collection of prof. Jaime Martinez-Urtaza (Department of Genetics and Microbiology, Universitat Autònoma de Barcelona) or are part of ongoing studies to expand the genome collection; other genomes were retrieved from the European Nucleotide Archive (ENA at https://www.ebi.ac.uk/ena/browser/home) and the National Center for Biotechnology Information (NCBI) [GenBank at https://www.ncbi.nlm.nih.gov/genbank/; RefSeq at https://www.ncbi.nlm.nih.gov/refseq/; SRA at https://www.ncbi.nlm.nih.gov/sra]. For detection of AMR genes, a resistance genes detection pipeline based on one of the standard databases (CARD database at https://card.mcmaster.ca/) was used. The phylogenetic tree was prepared and includes the reference genome from Japan "Osaka" as reference. The RIMD 2210633 strain has been added as the global reference strain which has been historically used for all the phylogenetic analysis of V. parahaemolyticus. The metadata includes the source of the strain, i.e., country, origin (clinical, environmental or unclear), date of isolation, and subtype. The antibiotic-resistant genes are shown as present, absent or not applicable. To build the ARGs European V. parahaemolyticus tree, the Parsnp tool, a fast core-genome multi-aligner and SNP detector, from the Harvest suite was used (Treangen et al., 2014). Parsnp calculates the MUMi distances between the reference genome (RIMD_2210633) and each one of the 152 genomes used in this study. The resulting Newick formatted core genome SNP tree was then uploaded onto the webtool I-Tol (Letunic and Bork, 2021), midpoint rooted and the metadata of the samples was incorporated.

The accession IDs for the genomes included in the metadata are accessible in the following databases according to the first characters:
* GCA: GenBank (https://www.ncbi.nlm.nih.gov/genbank/)
* GCF: RefSeq (https://www.ncbi.nlm.nih.gov/refseq/)
* ERR: ENA (https://www.ebi.ac.uk/ena/browser/home)
* SRR: SRA (https://www.ncbi.nlm.nih.gov/sra)

References

Letunic I and Bork P, 2021. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res, 49:W293-w296. doi: 10.1093/nar/gkab301

Treangen TJ, Ondov BD, Koren S and Phillippy AM, 2014. The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes. Genome Biol, 15:524. doi: 10.1186/s13059-014-0524-x
Z
Data from: AmelHap pilot: raw data
data.niaid.nih.gov
ekoizpen-zientifikoa.ehu.eus
Updated Jun 22, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Barnett, Mark (2022). AmelHap pilot: raw data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6563236
Explore at:
Dataset updated
Jun 22, 2022
Dataset provided by
Talenti, Andrea
Barnett, Mark
Wragg, David
Parejo, Melanie
Vignal, Alain
Richardson, Matthew
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Honey bee Apis mellifera drones are typically haploid, developing from an unfertilized egg, inheriting only their queen’s alleles and none from the many drones she mated with. Being haploid, the ordered combination or ‘phase’ of alleles is known, making drones a valuable haplotype resource. We collated whole genome sequence data for 688 drones, including 45 newly sequenced Scottish drones, which collectively represent 13 countries, 7 subspecies and various hybrids strains. After alignment to the reference assembly Amel_Hav3.1, and haploid variant calling, we identified 18.9M variants.

Whole-genome sequencing data underpinning the dataset is available from the European Nucleotide Archive (ENA), https://www.ebi.ac.uk/ena, with the project accession codes: PRJEB16533, PRJNA311274, PRJNA363032, PRJNA516678, PRJNA544324, and PRJEB39369.

Sequencing reads were aligned to the Amel_HAv3.1 reference genome using BWA-MEM v0.7.17. Reads were sorted with SAMtools v1.9 and duplicates marked (MarkDuplicates) with GATK v4.0.11.0. Variants for each sample were called using GATK’s HaplotypeCaller with the following non-default parameters --ERC GVCF, --sample-ploidy 1 and -A AlleleFraction. Joint variant calling was performed across all samples using GATK’s GenomicDBImport and GenotypeGVCFs with --sample-ploidy 1 and a window size of 2.5 Mb.

This dataset is unfiltered, and contains all variants regardless of quality or call rate.
e
Supplemental data from the genome assembly and annotation of the Clouded...
data.europa.eu
researchdata.se
unknown
Updated Jun 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Uppsala universitet (2024). Supplemental data from the genome assembly and annotation of the Clouded Apollo Butterfly (Parnassius mnemosyne) [Dataset]. https://data.europa.eu/data/datasets/https-doi-org-10-17044-scilifelab-25908748?locale=pl
Explore at:
unknownAvailable download formats
Dataset updated
Jun 25, 2024
Dataset authored and provided by
Uppsala universitet
Description
This dataset contains supplementary data from the genome sequencing of the Clouded Apollo Butterfly (Parnassius mnemosyne), published in:

Höglund, J., Dias, G., Olsen, R. A., Soares, A., Bunikis, I., Talla, V., & Backström, N. (2024). A Chromosome-Level Genome Assembly and Annotation for the Clouded Apollo Butterfly (Parnassius mnemosyne): A Species of Global Conservation Concern. Genome Biology and Evolution, 16(2), evae031. https://doi.org/10.1093/gbe/evae031

Previous data from the project has been deposited at the European Nucleotide Archive (ENA) in the umbrella project PRJEB76269 (https://www.ebi.ac.uk/ena/browser/view/PRJEB76269) .

The data contained in this archive at SciLifeLab Data Repository describe the genome assembly (ENA accession: GCA_963668995.1 (https://www.ebi.ac.uk/ena/browser/view/GCA_963668995.1) ), and the mitochondrial genome assembly (ENA accession: OZ075093.1 (https://www.ebi.ac.uk/ena/browser/view/OZ075093.1) ).

Below follows a brief description of each file. The information on the methods used to generate the files was adapted from Höglund et al. 2024.

pmne_functional_edit1.gff.gz contains the functional annotation (protein coding genes) of the primary genome assembly (GCA_963668995.1 (https://www.ebi.ac.uk/ena/browser/view/GCA_963668995.1) ). This is the original file that was submitted to ENA. A derived version of the file is available from NCBI; the NCBI version was generated from the EMBL records of each annotated gene and differs in that it for instance use a different naming scheme for the seqid column and the locus tags. The NCBI version is available at this link (https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/963/668/995/GCA_963668995.1_Parnassius_mnemosyne_n_2023_11/GCA_963668995.1_Parnassius_mnemosyne_n_2023_11_genomic.gff.gz) .

The genes were predicted using BRAKER (v3.03), GALBA (v1.0.6), and GeneMarkS-T (v5.1). The resulting gene models were combined and filtered using TSEBRA (version: long_reads branch commit 1f2614). The combined gene model was functionally annotated by the NBIS nextflow pipeline v2.0.0 (https://github.com/NBISweden).

pmne_Illumina_RNAseq_StringTie_sorted-transcripts_match.gff.gz contains a transcript assembly of the Illumina RNAseq reads (ENA accession: ERX11559451 (https://www.ebi.ac.uk/ena/browser/view/ERX11559451) ). The reads were aligned to the genome with HiSat2 (v2.1.0) and then assembled with StringTie (v2.2.1).

pmne_mtdna.gff.gz contains the functional annotation of the mitochondrial genome assembly (ENA accession: OZ075093.1 (https://www.ebi.ac.uk/ena/browser/view/OZ075093.1) ). This is the original file that was submitted to ENA. The annotation was generated using MitoFinder (v1.4.1).

pmne_ncRNAs.gff.gz contains the annotation of putative non-coding RNA (ncRNA) genes. The prediction was done with Infernal (v1.1.4) and the Rfam (v14.1) covariance models.

pmne_tRNAs_and_pseudogenes.gff.gz contains the annotation of putative tRNA genes and pseudogenes. The prediction was done with tRNAscan-SE (v2.0.12).

pmne_PacBio_isoseq.sorted.bam contains the PacBio IsoSeq transcripts (ENA accession: ERX11559436 (https://www.ebi.ac.uk/ena/browser/view/ERX11559436) ) aligned to the primary genome assembly.

pmne_repeat_library.fa.gz contains the nucleotide sequences of the prediced repeats in fasta format. The prediction was done with RepeatModeler2 (v2.0.2a).

Available variablesFor a description of the column headers of the files, please see the following links to the documentation of the different file formats.

The GFF3 format (.gff) is described here: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

The BAM format (.bam) is a compressed version of the SAM format, both of which are described here: https://samtools.github.io/hts-specs/SAMv1.pdf

The fasta (.fa) format is described here: https://www.ncbi.nlm.nih.gov/genbank/fastaformat/

ContactFor questions about this dataset, please contact: jacob.hoglund@ebc.uu.se niclas.backstrom@ebc.uu.se
r
Nucleotide (DNA / RNA) and Protein sequences from the Australian research...
researchdata.edu.au
Updated Jul 20, 2012
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
QFAB Bioinformatics (2012). Nucleotide (DNA / RNA) and Protein sequences from the Australian research institution Western Health [Dataset]. https://researchdata.edu.au/nucleotide-dna-rna-western-health/53125
Explore at:
Dataset updated
Jul 20, 2012
Dataset provided by
QFAB
Authors
QFAB Bioinformatics
Description
This data collection contains all currently published nucleotide (DNA/RNA) and protein sequences from the Australian research institution Western Health.The nucleotide (DNA/RNA) and protein sequences have been sourced through the European Nucleotide Archive (ENA) and Universal Protein Resource (UniProt), databases that contains comprehensive sets of nucleotide (DNA/RNA) and protein sequences from all organisms that have been published by the International Research Community.
r
Nucleotide (DNA / RNA) and Protein sequences from the Australian research...
researchdata.edu.au
Updated Jul 20, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
QFAB Bioinformatics (2012). Nucleotide (DNA / RNA) and Protein sequences from the Australian research institution University of Notre Dame [Dataset]. https://researchdata.edu.au/nucleotide-dna-rna-notre-dame/56616
Explore at:
Dataset updated
Jul 20, 2012
Dataset provided by
QFAB
Authors
QFAB Bioinformatics
Description
This data collection contains all currently published nucleotide (DNA/RNA) and protein sequences from the Australian research institution University of Notre Dame.The nucleotide (DNA/RNA) and protein sequences have been sourced through the European Nucleotide Archive (ENA) and Universal Protein Resource (UniProt), databases that contains comprehensive sets of nucleotide (DNA/RNA) and protein sequences from all organisms that have been published by the International Research Community.

Facebook

Twitter

Click to copy link

Link copied

Cite

U.S. Environmental Protection Agency (2024). The metagenome sequencing data have been deposited in the European Nucleotide Archive (ENA). [Dataset]. https://datasets.ai/datasets/the-metagenome-sequencing-data-have-been-deposited-in-the-european-nucleotide-archive-ena-8ad17

The metagenome sequencing data have been deposited in the European Nucleotide Archive (ENA).

Explore at:

0Available download formats

Dataset updated

Aug 12, 2024

Dataset authored and provided by

U.S. Environmental Protection Agency

Description

The raw sequencing data for this study have been deposited in the European Nucleotide Archive (ENA) at EMBL-EBI under accession number PRJEB40814 with the following BioSample numbers: SAMEA7465213 (sample DWDS A1), SAMEA7465214 (DWDS A2), SAMEA7465217 (DWDS B1), SAMEA7465218 (DWDS B2), SAMEA7465220 (DWDS C1), SAMEA7465221 (DWDS C2), SAMEA7465222 (DWDS D1), SAMEA7465223 (DWDS D2), SAMEA7465226 (DWDS E1), and SAMEA7465227 (DWDS E2).

This dataset is associated with the following publication: Gomez-Alvarez, V., S. Siponen, A. Kauppinen, A. Hokajarvi , A. Tiwari, A. Sarekoski, I.T. Miettinen, E. Torvinen, and T. Pitkanen. A comparative analysis employing a gene- and genome-centric metagenomic approach reveals changes in composition, function, and activity in waterworks with different treatment processes and source water in Finland. WATER RESEARCH. Elsevier Science Ltd, New York, NY, USA, 229: 119495, (2023).

Clear search

Close search

Google apps

Main menu

The metagenome sequencing data have been deposited in the European...

European Nucleotide Archive (ENA)

GenBank

The OHEJP BeONE Project – Salmonella enterica genome assembly dataset

Table_1_The Viscum album Gene Space database.xlsx

The OHEJP BeONE Project – Escherichia coli genome assembly dataset

EukRef: Phylogenetic curation of ribosomal RNA to enhance understanding of...

The OHEJP BeONE Project – Listeria monocytogenes genome assembly dataset

The OHEJP BeONE Project – Campylobacter jejuni genome assembly dataset

EMBL2checklists: A Python package to facilitate the user-friendly submission...

Nucleotide (DNA / RNA) and Protein sequences from the Australian dwelling...

Genome assemblies and respective wg/cgMLST profiles of a diverse dataset...

Genome assemblies and respective cgMLST profiles of a diverse dataset...

Nucleotide (DNA / RNA) and Protein sequences from the Australian dwelling...

Additional material for article "Exploring bacterial diversity via a curated...

Screening of AMR-related genes in the genomes of Vibrio parahaemolyticus...

Data from: AmelHap pilot: raw data

Supplemental data from the genome assembly and annotation of the Clouded...

Nucleotide (DNA / RNA) and Protein sequences from the Australian research...

Nucleotide (DNA / RNA) and Protein sequences from the Australian research...

The metagenome sequencing data have been deposited in the European Nucleotide Archive (ENA).See More Versions

The metagenome sequencing data have been deposited in the European Nucleotide Archive (ENA).