Public archive providing a comprehensive record of the world''''s nucleotide sequencing information, covering raw sequencing data, sequence assembly information and functional annotation. All submitted data, once public, will be exchanged with the NCBI and DDBJ as part of the INSDC data exchange agreement. The European Nucleotide Archive (ENA) captures and presents information relating to experimental workflows that are based around nucleotide sequencing. A typical workflow includes the isolation and preparation of material for sequencing, a run of a sequencing machine in which sequencing data are produced and a subsequent bioinformatic analysis pipeline. ENA records this information in a data model that covers input information (sample, experimental setup, machine configuration), output machine data (sequence traces, reads and quality scores) and interpreted information (assembly, mapping, functional annotation). Data arrive at ENA from a variety of sources including submissions of raw data, assembled sequences and annotation from small-scale sequencing efforts, data provision from the major European sequencing centers and routine and comprehensive exchange with their partners in the International Nucleotide Sequence Database Collaboration (INSDC). Provision of nucleotide sequence data to ENA or its INSDC partners has become a central and mandatory step in the dissemination of research findings to the scientific community. ENA works with publishers of scientific literature and funding bodies to ensure compliance with these principles and to provide optimal submission systems and data access tools that work seamlessly with the published literature. ENA is made up of a number of distinct databases that includes the EMBL Nucleotide Sequence Database (Embl-Bank), the newly established Sequence Read Archive (SRA) and the Trace Archive. The main tool for downloading ENA data is the ENA Browser, which is available through REST URLs for easy programmatic use. All ENA data are available through the ENA Browser. Note: EMBL Nucleotide Sequence Database (EMBL-Bank) is entirely included within this resource.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains supplementary data from the genome sequencing of the Clouded Apollo Butterfly (Parnassius mnemosyne), published in:
Höglund, J., Dias, G., Olsen, R. A., Soares, A., Bunikis, I., Talla, V., & Backström, N. (2024). A Chromosome-Level Genome Assembly and Annotation for the Clouded Apollo Butterfly (Parnassius mnemosyne): A Species of Global Conservation Concern. Genome Biology and Evolution, 16(2), evae031. https://doi.org/10.1093/gbe/evae031
Previous data from the project has been deposited at the European Nucleotide Archive (ENA) in the umbrella project PRJEB76269 (https://www.ebi.ac.uk/ena/browser/view/PRJEB76269) .
The data contained in this archive at SciLifeLab Data Repository describe the genome assembly (ENA accession: GCA_963668995.1 (https://www.ebi.ac.uk/ena/browser/view/GCA_963668995.1) ), and the mitochondrial genome assembly (ENA accession: OZ075093.1 (https://www.ebi.ac.uk/ena/browser/view/OZ075093.1) ).
Below follows a brief description of each file. The information on the methods used to generate the files was adapted from Höglund et al. 2024.
The genes were predicted using BRAKER (v3.03), GALBA (v1.0.6), and GeneMarkS-T (v5.1). The resulting gene models were combined and filtered using TSEBRA (version: long_reads branch commit 1f2614). The combined gene model was functionally annotated by the NBIS nextflow pipeline v2.0.0 (https://github.com/NBISweden).
pmne_Illumina_RNAseq_StringTie_sorted-transcripts_match.gff.gz contains a transcript assembly of the Illumina RNAseq reads (ENA accession: ERX11559451 (https://www.ebi.ac.uk/ena/browser/view/ERX11559451) ). The reads were aligned to the genome with HiSat2 (v2.1.0) and then assembled with StringTie (v2.2.1).
pmne_mtdna.gff.gz contains the functional annotation of the mitochondrial genome assembly (ENA accession: OZ075093.1 (https://www.ebi.ac.uk/ena/browser/view/OZ075093.1) ). This is the original file that was submitted to ENA. The annotation was generated using MitoFinder (v1.4.1).
pmne_ncRNAs.gff.gz contains the annotation of putative non-coding RNA (ncRNA) genes. The prediction was done with Infernal (v1.1.4) and the Rfam (v14.1) covariance models.
pmne_tRNAs_and_pseudogenes.gff.gz contains the annotation of putative tRNA genes and pseudogenes. The prediction was done with tRNAscan-SE (v2.0.12).
pmne_PacBio_isoseq.sorted.bam contains the PacBio IsoSeq transcripts (ENA accession: ERX11559436 (https://www.ebi.ac.uk/ena/browser/view/ERX11559436) ) aligned to the primary genome assembly.
pmne_repeat_library.fa.gz contains the nucleotide sequences of the prediced repeats in fasta format. The prediction was done with RepeatModeler2 (v2.0.2a).
Available variablesFor a description of the column headers of the files, please see the following links to the documentation of the different file formats.
The GFF3 format (.gff) is described here: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
The BAM format (.bam) is a compressed version of the SAM format, both of which are described here: https://samtools.github.io/hts-specs/SAMv1.pdf
The fasta (.fa) format is described here: https://www.ncbi.nlm.nih.gov/genbank/fastaformat/
ContactFor questions about this dataset, please contact: jacob.hoglund@ebc.uu.se niclas.backstrom@ebc.uu.se
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The present data files are the source files of the annotation output from the whole genome sequencing of rare actinobacteria, Barrientosiimonas humi gen. nov., sp. nov. 39T from Antarctica.
The dataset of the whole-genome sequence of B. humi had been deposited in European Nucleotide Archive (ENA) repository under the accession number PRJEB44986 / ERP129097, direct URL to data: https://www.ebi.ac.uk/ena/browser/view/PRJEB44986
https://ega-archive.org/dacs/EGAC00001001105https://ega-archive.org/dacs/EGAC00001001105
This Dataset is currently hosted by the European Nucleotide Archive. To access the data contained within the Dataset please follow the link below: https://www.ebi.ac.uk/ena/browser/view/PRJEB39323 Dataset consists of 20 snRNA-seq bam files from 10X v2. 5 samples from postmortem white matter tissue from non-neurological controls and15 samples from different MS lesions from the white matter tissue of 4 postmortem progressive MS patients.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains organelle genome sequences of globally collected Brassica accesions, in which the chloroplast genomes consists of 1,327 natural and 31 synthetic B. napus, 90 B.rapa and 107 B. oleracea accessions, and the mitochondrial genomes consists of 1,457 natural and 31 synthetic B. napus, 183 B.rapa and 104 B. oleracea accessions. The genome sequencing data of natural rapeseed accessions were obtained from the NCBI database under SRP155312, PRJNA430009 and PRJNA358784. The raw sequnceing data of 20 synthetic B. napus accessions can be found in European Nucleotide Archive (https://www.ebi.ac.uk/ena/browser/home) under PRJEB5974 and PRJEB6069. The raw sequences of B. rapa and B. oleracea can be found in the NCBI database under BioProject accession PRJNA312457. After quality checking, we first mapped reads to the published cp and mt genomes of six Brassica species. The mapped paired-end reads were next extracted and de novo assembled for the cp and mt genomes by NOVOPlasty and ARC software (http://ibest.github.io/ARC/), respectively.
The EBI genomes pages give access to a large number of complete genomes including bacteria, archaea, viruses, phages, plasmids, viroids and eukaryotes. Methods using whole genome shotgun data are used to gain a large amount of genome coverage for an organism. WGS data for a growing number of organisms are being submitted to DDBJ/EMBL/GenBank. Genome entries have been listed in their appropriate category which may be browsed using the website navigation tool bar on the left. While organelles are all listed in a separate category, any from Eukaryota with chromosome entries are also listed in the Eukaryota page. Within each page, entries are grouped and sorted at the species level with links to the taxonomy page for that species separating each group. Within each species, entries whose source organism has been categorized further are grouped and numbered accordingly. Links are made to: * taxonomy * complete EMBL flatfile * CON files * lists of CON segments * Project * Proteomes pages * FASTA file of Proteins * list of Proteins
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset
This dataset comprises the genome assemblies of 1,540 Salmonella enterica samples collected by the BeONE Consortium on behalf of the One Health European Joint Programme “BeONE: Building Integrative Tools for One Health Surveillance” (https://onehealthejp.eu/jrp-beone/). Additionally, a complementary dataset is also made available (https://zenodo.org/record/7119735), comprising genome assemblies of 1,434 S. enterica samples selected among the Whole-Genome Sequencing (WGS) data publicly available in the European Nucleotide Archive (ENA) or in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA).
File “BeONE_Se_metadata.xlsx” contains the genome assembly statistics for each isolate, including European Nucleotide Archive accession numbers, in-silico Multi Locus Sequence Type and Serotype, and information regarding year of sampling, country and source.
The archive “BeONE_Se_assemblies.zip” contains all the genome assemblies (.fasta format) of each isolate presented in the metadata file.
Dataset selection and curation
This anonymized dataset of S. enterica genome assemblies was generated using Next Generation Sequencing data collected within the BeONE Consortium available at the European Nucleotide Archive under BioProject Accession Number PRJEB57179. Read quality control, trimming and assembly were performed with Aquamis v1.3.9 (Deneke et al. 2021) using default parameters. Assembly quality control (QC), including contamination assessment, as well as MLST ST determination were performed with the same pipeline. All genome assemblies passing the QC were included in the final dataset. Among the others, we noticed that a considerable proportion of assemblies was flagged as “QC fail” exclusively due to the “NumContamSNVs” parameter, suggesting that this setting might have been too strict. After manual inspection of a random subset, assemblies for which the percentage of reads corresponding to the correct species was >98% were recovered and integrated in the final dataset (those samples are labeled in the Metadata file). In total, 1,540 isolates passed the dataset curation step and were included in the final dataset. In-silico serotyping was performed with SeqSero2 v1.2.1 (Zhang et al. 2019).
Funding
This work was supported by funding from the European Union’s Horizon 2020 Research and Innovation programme under grant agreement No 773830: One Health European Joint Programme.
Acknowledgements
We thank the National Distributed Computing Infrastructure of Portugal (INCD) for providing the necessary resources to run the genome assemblies. INCD was funded by FCT and FEDER under the project 22153-01/SAICT/2016.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
JRP24-FBZSH9-BEONE WP1 deliverable 1.2.
WP Leader: Vítor Borges (INSA)
Other contributors: Verónica Mixão (INSA), Miguel Pinto (INSA), Holger Brendebach (BfR), Simon Tausch (BfR), Carlus Deneke (BfR), Karin Lagesen (NVI)
In order to contribute to the accomplishment of specific objectives of the BeOne project, WP1-T2 compiled an anonymized dataset (including sequencing reads and respective metadata) aiming to capture the genomic diversity within the populations of Listeria monocytogenes, Salmonella enterica, Escherichia coli (STEC) and Campylobacter jejuni. This dataset counts with data shared by the BeOne partners and comprises a total of 3,884 isolates, from which the anonymized sequencing reads were released in the European Nucleotide Archive (ENA) and the anonymized genome assemblies in the Zenodo repository [1,426 L. monocytogenes (accession: PRJEB57166 and 10.5281/zenodo.7267486); 1,540 S. enterica (accession: PRJEB57179 and 10.5281/zenodo.7267785); 308 E. coli (accession: PRJEB57098 and10.5281/zenodo.7267844); 610 C. jejuni (accession: PRJEB57119 and 10.5281/zenodo.7267879)].
As a complement to the BeOne dataset, additional samples were carefully selected among the WGS data publicly available at the beginning of the analysis (November 2021) in ENA or the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA), in order to ensure the representativeness of the genomic diversity within public databases (assessed in terms of sequence type or serotype, depending on the species). In the end, a so-called “public dataset” with the 8,383 samples that passed the curation step was released in Zenodo repository [1,874 L. monocytogenes (accession: 10.5281/zenodo.7116878); 1,434 S. enterica (accession: 10.5281/zenodo.7119735), 1,999 E. coli (accession: 10.5281/zenodo.7120057); 3,076 C. jejuni (accession: 10.5281/zenodo.7120166)].
Ecological restoration and plant re-introductions aim to create plant populations that are genetically similar to natural populations to preserve the regional gene pool, yet genetically diverse to allow adaptation to a changing environment. For this purpose, seeds for restoration are increasingly sourced from multiple populations in the target region. However, it has only rarely been tested whether using regional seed indeed leads to genetically diverse restored populations which are genetically similar to natural populations. We used single nucleotide polymorphism (SNP) markers to investigate genetic diversity within and differentiation among populations of Centaurea jacea and Betonica officinalis on restored and natural meadows in the White Carpathians, Czech Republic. The restoration took place 20 years ago using regional seeds propagated from a mix of multiple regional source populations. We included original regional seeds in our analysis to compare the restored populations with th..., Please refer to the methods section and supplementary information of: Höfner, J., Klein-Raufhake, T., Lampei, C., Mudrak, O., Bucharova, A. and Durka, A. (2021) ‘Populations restored using regional seed are genetically diverse and similar to natural populations in the region’, accepted in Journal of Applied Ecology, These .vcf files represent the stage after filtering with 'vcftools' and tools from the 'vcflib' library and before import into R. These vcfs are derived from the raw sequencing data available on EMBL's European Nucleotide Archive (ENA) under accession number PRJEB45358 (https://www.ebi.ac.uk/ena/browser/view/PRJEB45358) and are thought to facilitate work with this data set.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Processed RNAseq data used for ICB_Riaz data object.
Publication: https://pubmed.ncbi.nlm.nih.gov/29033130/.
Raw data obtained from https://www.ebi.ac.uk/ena/browser/view/PRJNA356761?show=reads.
Processed with https://github.com/LupienLab/kallisto_snakemake/tree/main/Run_Kallisto
Dataset details: https://predictio.ca/dataset/12.
This dataset contains files reconstructing single-cell data presented in 'Reference transcriptomics of porcine peripheral immune cells created through bulk and single-cell RNA sequencing' by Herrera-Uribe & Wiarda et al. 2021. Samples of peripheral blood mononuclear cells (PBMCs) were collected from seven pigs and processed for single-cell RNA sequencing (scRNA-seq) in order to provide a reference annotation of porcine immune cell transcriptomics at enhanced, single-cell resolution. Analysis of single-cell data allowed identification of 36 cell clusters that were further classified into 13 cell types, including monocytes, dendritic cells, B cells, antibody-secreting cells, numerous populations of T cells, NK cells, and erythrocytes. Files may be used to reconstruct the data as presented in the manuscript, allowing for individual query by other users. Scripts for original data analysis are available at https://github.com/USDA-FSEPRU/PorcinePBMCs_bulkRNAseq_scRNAseq. Raw data are available at https://www.ebi.ac.uk/ena/browser/view/PRJEB43826. Funding for this dataset was also provided by NRSP8: National Animal Genome Research Program (https://www.nimss.org/projects/view/mrp/outline/18464). Resources in this dataset:Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells 10X Format. File Name: PBMC7_AllCells.zipResource Description: Zipped folder containing PBMC counts matrix, gene names, and cell IDs. Files are as follows: matrix of gene counts* (matrix.mtx.gx) gene names (features.tsv.gz) cell IDs (barcodes.tsv.gz) *The ‘raw’ count matrix is actually gene counts obtained following ambient RNA removal. During ambient RNA removal, we specified to calculate non-integer count estimations, so most gene counts are actually non-integer values in this matrix but should still be treated as raw/unnormalized data that requires further normalization/transformation. Data can be read into R using the function Read10X().Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells Metadata. File Name: PBMC7_AllCells_meta.csvResource Description: .csv file containing metadata for cells included in the final dataset. Metadata columns include: nCount_RNA = the number of transcripts detected in a cell nFeature_RNA = the number of genes detected in a cell Loupe = cell barcodes; correspond to the cell IDs found in the .h5Seurat and 10X formatted objects for all cells prcntMito = percent mitochondrial reads in a cell Scrublet = doublet probability score assigned to a cell seurat_clusters = cluster ID assigned to a cell PaperIDs = sample ID for a cell celltypes = cell type ID assigned to a cellResource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells PCA Coordinates. File Name: PBMC7_AllCells_PCAcoord.csvResource Description: .csv file containing first 100 PCA coordinates for cells. Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells t-SNE Coordinates. File Name: PBMC7_AllCells_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for all cells.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells UMAP Coordinates. File Name: PBMC7_AllCells_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for all cells.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells t-SNE Coordinates. File Name: PBMC7_CD4only_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells UMAP Coordinates. File Name: PBMC7_CD4only_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells UMAP Coordinates. File Name: PBMC7_GDonly_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells t-SNE Coordinates. File Name: PBMC7_GDonly_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gene Annotation Information. File Name: UnfilteredGeneInfo.txtResource Description: .txt file containing gene nomenclature information used to assign gene names in the dataset. 'Name' column corresponds to the name assigned to a feature in the dataset.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells H5Seurat. File Name: PBMC7.tarResource Description: .h5Seurat object of all cells in PBMC dataset. File needs to be untarred, then read into R using function LoadH5Seurat().
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Benchmark of 5S, 16S, 23S rRNA
secondary structures taken from the CRW database https://crw-site.chemistry.gatech.edu/
Each molecule is available in bpseq, ct and dot-bracket-letter (db) format. For each format a version without header/additional information/comments is available in the corresponding bpseq-nH, ct-nH, db-nH folders.
In the files Archaea.xlsx, Bacteria.xlsx and Eukaryota.xslx the molecules in the benchmark are listed together with their Organism Name, ID and Phylogenetic classification (up to Order) according to the European Nucleotide Archive (ENA) taxonomy https://www.ebi.ac.uk/ena/browser/home
The accession number is available from the headers of the bpseq and ct formats.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Generally, Rattus norvegicus' miRNA repertoire falls short compared to the other rodent model organism, Mus musculus.
To extend the miRNA catalogue in Rattus norvegicus, we utilized Infernal v1.1 (Nawrocki and Eddy, 2013) to derive potential rat miRNA candidates starting from all available mammalian miRNA families in miRBase. We utilized MIRfix (Yazbeck et al., 2019) to curate the extended miRNA datasets automatically. Subsequent manual inspection and curation of miRNA alignments resulted in a reliable and comprehensive update to the rat miRNA annotation.
Key facts of the extended miRNA repertoire
342 miRNA families (40 novel families)
549 miRNA sequences (56 novel miRNAs)
11 corrected annotated miRNAs
European Nucleotide Archive
The 56 novel sequences not listed in miRBase before have been submitted to the European Nucleotide Archive at EMBL-EBI.They are accessible with the accession numbers OZ078105 - OZ078160.The sequences will be permanently available from the ENA browser at http://www.ebi.ac.uk/ena/data/view/.
An overview of all sequences is given here: http://www.ebi.ac.uk/ena/data/view/OZ078105-OZ078160.
https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
Assembly of Merlot PacBio Hifi reads using hifiasm 0.13 software with the trio binning option. The reads are stored at ENA here: https://www.ebi.ac.uk/ena/browser/view/PRJEB59893 The run ERR10930361 is PacBio Hifi reads from the Merlot mother Magdeleine noire des Charentes The run ERR10930362 is PacBio Hifi reads from the Merlot father Cabernet franc The run ERR10930363 is PacBio Hifi reads from Merlot leaves The run ERR10930364 is PacBio Hifi reads from Merlot roots
https://choosealicense.com/licenses/openrail/https://choosealicense.com/licenses/openrail/
PRJDB9111 https://www.ebi.ac.uk/ena/browser/view/PRJDB9111 To generate RNA aptamers against human integrin alphaV beta3, we have performed the high-throughput systematic evolution of ligands by exponential enrichment (HT-SELEX). Of the six performed rounds, the rounds 3 to 6 have been sequenced.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This entry includes the IMGT High-VQuest output files that were used as input to the TRIP tool regarding1. The scalability experiments (IDs are BC23-OSR052411, BC23-OSR052411-OSR081811, OSR052311-OSR081811 and OSR052411-OSR052311-OSR081811). The corresponding raw FASTQ files are available here (https://www.ebi.ac.uk/ena/browser/view/PRJEB29674).2. The comparison experiments (IDs are T3304, T3396 and T3397). Raw TR sequence data can be found under accession number SRR3737053 in GenBank sequence database (www.ncbi.nlm.nih.gov/genbank/).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example dataset input for the Immunoglobulin Intraclonal Diversification Analysis (IgIDivA) tool. (Publication of IgIDivA under revision)
The data was retrieved from ENA (https://www.ebi.ac.uk/ena/browser/view/PRJEB36589?show=reads) under the accession number PRJEB36589, and subsequently processed with IMGT/HighV-QUEST (https://www.imgt.org/HighV-QUEST/home.action) and tripr (https://bioconductor.org/packages/release/bioc/html/tripr.html).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is underlying the scientific publication titled "Enhanced Susceptibility to Tomato Chlorosis Virus (ToCV) in Hsp90- and Sgt1-Silenced Plants: Insights from Gene Expression Dynamics", published in the Viruses journal. The dataset includes a time-course transcriptome analysis using RNA-seq of naïve (no whitefly and no virus), mock (non-viruliferous whiteflies) and ToCV (ToCV_viruliferous whiteflies)-treated tomato samples at 2, 7, and 14 days post-infection (dpi) and viral small RNAs derived from Tomato plants infected with ToCV at 14 dpi. The dataset provided here has been deposited in full by the authors in the European Nucleotide Archive (ENA) at EMBL-EBI under accession number PRJEB67704 (https://www.ebi.ac.uk/ena/browser/view/PRJEB67704The provided information in the dataset are further discussed and interpreted in detail, as well as their subsequent results, in the scientific publication. This research was conducted within the VIRTIGATION project, which is part of the EU Open Research Data pilot. This project has received funding from the European Union's Horizon 2020 research and innovation program under grant agreement No. 101000570.
https://choosealicense.com/licenses/openrail/https://choosealicense.com/licenses/openrail/
PRJEB3289 https://www.ebi.ac.uk/ena/browser/view/PRJEB3289 Data that has been generated by HT-SELEX experiments (see Jolma et al. 2010. PMID: 20378718 for description of method) that has been now used to generate transcription factor binding specificity models for most of the high confidence human transcription factors. Sequence data is composed of reads generated with Illumina Genome Analyzer IIX and HiSeq2000 instruments. Samples are composed of single read sequencing of synthetic DNA fragments with a fixed length randomized region or samples derived from such a initial library by selection with a sequence specific DNA binding protein. Originally multiple samples with different "barcode" tag sequences were run on the same Illumina sequencing lane but the released files have been already de-multiplexed, and the constant regions and "barcodes" of each sequence have been cut out of the sequencing reads to facilitate the use of data. Some of the files are composed of reads from multiple different sequencing lanes and due to this each of the names of the individual reads have been edited to show the flowcell and lane that was used to generate it. Barcodes and oligonucleotide designs are indicated in the names of individual entries. Depending of the selection ligand design, the sequences in each of these fastq-files are either 14, 20, 30 or 40 bases long and had different flanking regions in both sides of the sequence. Each run entry is named in either of the following ways: Example 1) "BCL6B_DBD_AC_TGCGGG20NGA_1", where name is composed of following fields ProteinName_CloneType_Batch_BarcodeDesign_SelectionCycle. This experiment used barcode ligand TGCGGG20NGA, where both of the variable flanking constant regions are indicated as they were on the original sequence-reads. This ligand has been selected for one round of HT-SELEX using recombinant protein that contained the DNA binding domain of human transcription factor BCL6B. It also tells that the experiment was performed on batch of experiments named as "AC". Example 2) 0_TGCGGG20NGA_0 where name is composed of (zero)_BarcodeDesign_(zero) These sequences have been generated from sequencing of the initial non-selected pool. Same initial pools have been used in multiple experiments that were on different batches, thus for example this background sequence pool is the shared background for all of the following samples. BCL6B_DBD_AC_TGCGGG20NGA_1, ZNF784_full_AE_TGCGGG20NGA_3, DLX6_DBD_Y_TGCGGG20NGA_4 and MSX2_DBD_W_TGCGGG20NGA_2
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The distribution of antimicrobial resistance (AMR) genes for the EU and European Free Trade Association (EFTA) countries data was obtained from the global Vibrio parahaemolyticus genomes based on a collection of nearly 10,000 genomes. Some of the strains are from the collection of prof. Jaime Martinez-Urtaza (Department of Genetics and Microbiology, Universitat Autònoma de Barcelona) or are part of ongoing studies to expand the genome collection; other genomes were retrieved from the European Nucleotide Archive (ENA at https://www.ebi.ac.uk/ena/browser/home) and the National Center for Biotechnology Information (NCBI) [GenBank at https://www.ncbi.nlm.nih.gov/genbank/; RefSeq at https://www.ncbi.nlm.nih.gov/refseq/; SRA at https://www.ncbi.nlm.nih.gov/sra]. For detection of AMR genes, a resistance genes detection pipeline based on one of the standard databases (CARD database at https://card.mcmaster.ca/) was used. The phylogenetic tree was prepared and includes the reference genome from Japan "Osaka" as reference. The RIMD 2210633 strain has been added as the global reference strain which has been historically used for all the phylogenetic analysis of V. parahaemolyticus. The metadata includes the source of the strain, i.e., country, origin (clinical, environmental or unclear), date of isolation, and subtype. The antibiotic-resistant genes are shown as present, absent or not applicable. To build the ARGs European V. parahaemolyticus tree, the Parsnp tool, a fast core-genome multi-aligner and SNP detector, from the Harvest suite was used (Treangen et al., 2014). Parsnp calculates the MUMi distances between the reference genome (RIMD_2210633) and each one of the 152 genomes used in this study. The resulting Newick formatted core genome SNP tree was then uploaded onto the webtool I-Tol (Letunic and Bork, 2021), midpoint rooted and the metadata of the samples was incorporated.
The accession IDs for the genomes included in the metadata are accessible in the following databases according to the first characters:
* GCA: GenBank (https://www.ncbi.nlm.nih.gov/genbank/)
* GCF: RefSeq (https://www.ncbi.nlm.nih.gov/refseq/)
* ERR: ENA (https://www.ebi.ac.uk/ena/browser/home)
* SRR: SRA (https://www.ncbi.nlm.nih.gov/sra)
References
Letunic I and Bork P, 2021. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res, 49:W293-w296. doi: 10.1093/nar/gkab301
Treangen TJ, Ondov BD, Koren S and Phillippy AM, 2014. The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes. Genome Biol, 15:524. doi: 10.1186/s13059-014-0524-x
Public archive providing a comprehensive record of the world''''s nucleotide sequencing information, covering raw sequencing data, sequence assembly information and functional annotation. All submitted data, once public, will be exchanged with the NCBI and DDBJ as part of the INSDC data exchange agreement. The European Nucleotide Archive (ENA) captures and presents information relating to experimental workflows that are based around nucleotide sequencing. A typical workflow includes the isolation and preparation of material for sequencing, a run of a sequencing machine in which sequencing data are produced and a subsequent bioinformatic analysis pipeline. ENA records this information in a data model that covers input information (sample, experimental setup, machine configuration), output machine data (sequence traces, reads and quality scores) and interpreted information (assembly, mapping, functional annotation). Data arrive at ENA from a variety of sources including submissions of raw data, assembled sequences and annotation from small-scale sequencing efforts, data provision from the major European sequencing centers and routine and comprehensive exchange with their partners in the International Nucleotide Sequence Database Collaboration (INSDC). Provision of nucleotide sequence data to ENA or its INSDC partners has become a central and mandatory step in the dissemination of research findings to the scientific community. ENA works with publishers of scientific literature and funding bodies to ensure compliance with these principles and to provide optimal submission systems and data access tools that work seamlessly with the published literature. ENA is made up of a number of distinct databases that includes the EMBL Nucleotide Sequence Database (Embl-Bank), the newly established Sequence Read Archive (SRA) and the Trace Archive. The main tool for downloading ENA data is the ENA Browser, which is available through REST URLs for easy programmatic use. All ENA data are available through the ENA Browser. Note: EMBL Nucleotide Sequence Database (EMBL-Bank) is entirely included within this resource.