98 datasets found
  1. d

    Sequence Read Archive (SRA)

    • catalog.data.gov
    Updated Jun 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Library of Medicine (2025). Sequence Read Archive (SRA) [Dataset]. https://catalog.data.gov/dataset/sequence-read-archive-sra-54e4a
    Explore at:
    Dataset updated
    Jun 19, 2025
    Dataset provided by
    National Library of Medicine
    Description

    The Sequence Read Archive (SRA) stores sequencing data from the next generation of sequencing platforms including Roche 454 GS System®, Illumina Genome Analyzer®, Life Technologies AB SOLiD System®, Helicos Biosciences Heliscope®, Complete Genomics®, and Pacific Biosciences SMRT®.

  2. f

    Species included in the analysis, including environment (freshwater [FW] or...

    • datasetcatalog.nlm.nih.gov
    Updated Dec 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Barts, Nick; Wilson, Elizabeth J.; Tobler, Michael; Greenway, Ryan; Coffin, John L.; Johnson, James B.; Kelley, Joanna L.; Peña, Carlos M. Rodríguez (2024). Species included in the analysis, including environment (freshwater [FW] or saltwater [SW]), collection location, sample size (N), NCBI Sequence Read Archive (SRA) accession numbers, and study reference. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001362851
    Explore at:
    Dataset updated
    Dec 5, 2024
    Authors
    Barts, Nick; Wilson, Elizabeth J.; Tobler, Michael; Greenway, Ryan; Coffin, John L.; Johnson, James B.; Kelley, Joanna L.; Peña, Carlos M. Rodríguez
    Description

    Species included in the analysis, including environment (freshwater [FW] or saltwater [SW]), collection location, sample size (N), NCBI Sequence Read Archive (SRA) accession numbers, and study reference.

  3. Top 50 conserved aging predictive genes.

    • plos.figshare.com
    xlsx
    Updated Jun 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joe L. Webb; Simon M. Moe; Andrew K. Bolstad; Elizabeth M. McNeill (2023). Top 50 conserved aging predictive genes. [Dataset]. http://doi.org/10.1371/journal.pone.0255085.s004
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 9, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Joe L. Webb; Simon M. Moe; Andrew K. Bolstad; Elizabeth M. McNeill
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This table describes whether previous reports exist linking these genes to aging or neurodegeneration phenotypes in Human or another model organism. (XLSX)

  4. z

    Genome assemblies and respective wg/cgMLST profiles of a diverse dataset...

    • zenodo.org
    xlsx, zip
    Updated Sep 28, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Verónica Mixão; Holger Brendebach; Miguel Pinto; Daniel Sobral; João Paulo Gomes; Carlus Deneke; Simon Tausch; Vítor Borges (2022). Genome assemblies and respective wg/cgMLST profiles of a diverse dataset comprising 1,434 Salmonella enterica isolates [Dataset]. http://doi.org/10.5281/zenodo.7230091
    Explore at:
    zip, xlsxAvailable download formats
    Dataset updated
    Sep 28, 2022
    Dataset provided by
    Genomics and Bioinformatics Unit, Department of Infectious Diseases, National Institute of Health Doutor Ricardo Jorge (INSA), Lisbon, Portugal
    Department Biological Safety, German Federal Institute for Risk Assessment, Berlin, Germany
    Authors
    Verónica Mixão; Holger Brendebach; Miguel Pinto; Daniel Sobral; João Paulo Gomes; Carlus Deneke; Simon Tausch; Vítor Borges
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset

    This dataset comprises the genome assemblies and respective 8,558-loci whole-genome (wg) Multiple Locus Sequence Type (MLST) profiles [INNUENDO schema (Llarena et al. 2018) available in chewie-NS (Mamede et al. 2022)] of a final set of 1,434 Salmonella enterica samples selected among the Whole-Genome Sequencing (WGS) data publicly available in the European Nucleotide Archive (ENA) or in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) at the beginning of the analysis (November 2021). This set of samples was carefully selected to cover a wide genetic diversity (assessed in terms of serotype). In total, 125 different serotypes are represented in this dataset, with Typhimurium (including monophasic), Enteritidis and Infantis being the most represented ones and, together, corresponding to 56.2% of the dataset.

    File “Se_metadata.xlsx” contains metadata information for each isolate, including ENA/SRA accession number, BioProject and in-silico MLST ST and serotype.

    The directory “assemblies/” contains all the genome assemblies (.fasta format) of each isolate presented in the metadata file.

    The file “profiles/Se_profiles_wgMLST.tsv” corresponds to a tab separated file with the 8,558-loci wgMLST profiles of each isolate presented in the metadata file. The files “profiles/Se_profiles_cgMLST_95.tsv”, “profiles/Se_profiles_cgMLST_98.tsv” and “profiles/Se_profiles_cgMLST_100.tsv” correspond to a 3,261-loci, 3,179-loci and 874-loci cgMLST profiles of each isolate presented in the metadata file, respectively. These profiles were determined as explained below.

    Dataset selection and curation

    With the objective of creating a diverse dataset of S. enterica genome assemblies, we collected information about the genetic diversity (serotype) of the isolates available at Enterobase database in the beginning of this analysis (November 2021) and in other previous works. Based on this information, we selected an initial dataset comprising 1,779 samples associated with four BioProjects (PRJEB16326, PRJEB20997, PRJEB30335 and PRJEB39988). Their WGS data was downloaded from ENA/SRA with fastq-dl v1.0.6. Read quality control, trimming and assembly were performed with the Aquamis v1.3.9 (Deneke et al. 2021) using default parameters. Assembly quality control (QC), including contamination assessment, as well as MLST ST determination were performed with the same pipeline. All genome assemblies passing the QC were included in the final dataset. Among the others, we noticed that a considerable proportion of assemblies was flagged as “QC fail” exclusively due to the “NumContamSNVs” parameter, suggesting that this setting might have been too strict. After manual inspection of a random subset, assemblies for which the percentage of reads corresponding to the correct species was >98% were recovered and integrated in the final dataset (those samples are labeled in the Metadata file). In total, 1,434 isolates passed this curation step and were included in the final dataset. In-silico serotyping was performed with SeqSero2 v1.2.1 (Zhang et al. 2019). wgMLST profiles of each of these isolates were determined with chewBBACA v2.8.5 (Silva et al. 2018), using the 8,558-loci INNUENDO schema available in chewie-NS (Llarena et al. 2018; Mamede et al. 2022) and downloaded on May 31st, 2022. Three cgMLST schemas were obtained with ReporTree v1.0.0 (Mixão et al. 2022) using the 8,558-loci wgMLST profiles of the 1,434 isolates as input and setting distinct “--site-inclusion” thresholds: 0.95, 0.98 and 1.0 (i.e., keep schema loci called in at least 95%, 98% and 100% of the samples, resulting in a 3,261-loci, 3,179-loci and 874-loci allelic matrices, respectively).

  5. Sequencing Data Set of Sediment Layers

    • catalog.data.gov
    Updated May 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2021). Sequencing Data Set of Sediment Layers [Dataset]. https://catalog.data.gov/dataset/sequencing-data-set-of-sediment-layers
    Explore at:
    Dataset updated
    May 17, 2021
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    A table (DP_SRA.xlsx) contains rows as sample and columns as entries representing the biosample accession number (NCBI), collection (date), library strategy, target (source), and sequencing (technology) for each individual sample. The zip file (Genome_Set01.zip) contain nine (9) fasta file (DP_bin_02.fasta, DP_bin_04.fasta, DP_bin_09.fasta, DP_bin_10.fasta, DP_bin_14.fasta, DP_bin_15.fasta, DP_bin_16a.fasta, DP_bin_20.fasta, DP_bin_23.fasta) with the contig sequences (i.e. binning) for each metagenome-assembled genomes (MAGs). These data are available from the NCBI Sequence Read Archive (SRA) under the BioProject (https://www.ncbi.nlm.nih.gov/bioproject) with accession number PRJNA646252 and the following BioSample numbers: SAMN15536103 to SAMN15536108. This dataset is associated with the following publication: Gomez-Alvarez, V., H. Liu, J. Pressman, and D. Wahman. Metagenomic Profile of Microbial Communities in a Drinking Water Storage Tank Sediment after Sequential Exposure to Monochloramine, Free Chlorine, and Monochloramine. ENVIRONMENTAL SCIENCE & TECHNOLOGY. American Chemical Society, Washington, DC, USA, 1(5): 1283-1294, (2021).

  6. Genome assemblies and respective wg/cgMLST profiles of a diverse dataset...

    • zenodo.org
    bin, zip
    Updated Jul 24, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Verónica Mixão; Verónica Mixão; Holger Brendebach; Miguel Pinto; Daniel Sobral; João Paulo Gomes; Carlus Deneke; Simon Tausch; Vítor Borges; Holger Brendebach; Miguel Pinto; Daniel Sobral; João Paulo Gomes; Carlus Deneke; Simon Tausch; Vítor Borges (2023). Genome assemblies and respective wg/cgMLST profiles of a diverse dataset comprising 1,999 Escherichia coli isolates [Dataset]. http://doi.org/10.5281/zenodo.7230102
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Jul 24, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Verónica Mixão; Verónica Mixão; Holger Brendebach; Miguel Pinto; Daniel Sobral; João Paulo Gomes; Carlus Deneke; Simon Tausch; Vítor Borges; Holger Brendebach; Miguel Pinto; Daniel Sobral; João Paulo Gomes; Carlus Deneke; Simon Tausch; Vítor Borges
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset

    This dataset comprises the genome assemblies and respective 7,601-loci whole-genome (wg) Multiple Locus Sequence Type (MLST) profiles [INNUENDO schema (Llarena et al. 2018) available in chewie-NS (Mamede et al. 2022)] of a final set of 1,999 Escherichia coli samples selected among the Whole-Genome Sequencing (WGS) data publicly available in the European Nucleotide Archive (ENA) or in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) at the beginning of the analysis (November 2021). This set of samples was carefully selected to cover a wide genetic diversity (assessed in terms of serotype). In total, 411 different serotypes are represented in this dataset, with O157:H7 being the most represented one, corresponding to 37.1% of the dataset.

    File “Ec_metadata.xlsx” contains metadata information for each isolate, including ENA/SRA accession number, BioProject and in-silico MLST ST and serotype.

    The directory “assemblies/” contains all the genome assemblies (.fasta format) of each isolate presented in the metadata file.

    The file “profiles/Ec_profiles_wgMLST.tsv” corresponds to a tab separated file with the 7,601-loci wgMLST profiles of each isolate presented in the metadata file. The files “profiles/Ec_profiles_cgMLST_95.tsv”, “profiles/Ec_profiles_cgMLST_98.tsv” and “profiles/Ec_profiles_cgMLST_100.tsv” correspond to a 2,826-loci, 2,704-loci and 465-loci cgMLST profiles of each isolate presented in the metadata file, respectively. These profiles were determined as explained below.

    Dataset selection and curation

    With the objective of creating a diverse dataset of E. coli genome assemblies, we collected information about the genetic diversity (serotype) of the isolates available at Enterobase database in the beginning of this analysis (November 2021) and in other previous works. Based on this information, we selected an initial dataset comprising 2,688 samples associated with three BioProjects (PRJNA230969, PRJEB27020 and PRJNA248042). Their WGS data was downloaded from ENA/SRA with fastq-dl v1.0.6. Read quality control, trimming and assembly were performed with the Aquamis v1.3.9 (Deneke et al. 2021) using default parameters. Assembly quality control (QC), including contamination assessment, as well as MLST ST determination were performed with the same pipeline. All genome assemblies passing the QC were included in the final dataset. Among the others, we noticed that a considerable proportion of assemblies was flagged as “QC fail” exclusively due to the “NumContamSNVs” parameter, suggesting that this setting might have been too strict. After manual inspection of a random subset, assemblies for which the percentage of reads corresponding to the correct species was >98% were recovered and integrated in the final dataset (those samples are labeled in the Metadata file). In total, 1,999 isolates passed this curation step and were included in the final dataset. In-silico serotyping was performed with seq_typing v2.2. wgMLST profiles of each of these isolates were determined with chewBBACA v2.8.5 (Silva et al. 2018), using the 7,601-loci INNUENDO schema available in chewie-NS (Llarena et al. 2018; Mamede et al. 2022) and downloaded on May 31st, 2022. Three cgMLST schemas were obtained with ReporTree v1.0.0 (Mixão et al. 2022) using the 7,601-loci wgMLST profiles of the 1,999 isolates as input and setting distinct “--site-inclusion” thresholds: 0.95, 0.98 and 1.0 (i.e., keep schema loci called in at least 95%, 98% and 100% of the samples, resulting in a 2,826-loci, 2,704-loci and 465-loci allelic matrices, respectively).

    Acknowledgements

    We thank the National Distributed Computing Infrastructure of Portugal (INCD) for providing the necessary resources to run the genome assemblies. INCD was funded by FCT and FEDER under the project 22153-01/SAICT/2016.

  7. Z

    Repository for Single Cell RNA Sequencing Analysis of The EMT6 Dataset

    • data.niaid.nih.gov
    Updated Nov 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hsu, Jonathan; Stoop, Allart (2023). Repository for Single Cell RNA Sequencing Analysis of The EMT6 Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10011621
    Explore at:
    Dataset updated
    Nov 20, 2023
    Authors
    Hsu, Jonathan; Stoop, Allart
    Description

    Table of Contents

    Main Description File Descriptions Linked Files Installation and Instructions

    1. Main Description

    This is the Zenodo repository for the manuscript titled "A TCR β chain-directed antibody-fusion molecule that activates and expands subsets of T cells and promotes antitumor activity.". The code included in the file titled marengo_code_for_paper_jan_2023.R was used to generate the figures from the single-cell RNA sequencing data. The following libraries are required for script execution:

    Seurat scReportoire ggplot2 stringr dplyr ggridges ggrepel ComplexHeatmap

    File Descriptions

    The code can be downloaded and opened in RStudios. The "marengo_code_for_paper_jan_2023.R" contains all the code needed to reproduce the figues in the paper The "Marengo_newID_March242023.rds" file is available at the following address: https://zenodo.org/badge/DOI/10.5281/zenodo.7566113.svg (Zenodo DOI: 10.5281/zenodo.7566113). The "all_res_deg_for_heat_updated_march2023.txt" file contains the unfiltered results from DGE anlaysis, also used to create the heatmap with DGE and volcano plots. The "genes_for_heatmap_fig5F.xlsx" contains the genes included in the heatmap in figure 5F.

    Linked Files

    This repository contains code for the analysis of single cell RNA-seq dataset. The dataset contains raw FASTQ files, as well as, the aligned files that were deposited in GEO. The "Rdata" or "Rds" file was deposited in Zenodo. Provided below are descriptions of the linked datasets:

    Gene Expression Omnibus (GEO) ID: GSE223311(https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE223311)

    Title: Gene expression profile at single cell level of CD4+ and CD8+ tumor infiltrating lymphocytes (TIL) originating from the EMT6 tumor model from mSTAR1302 treatment. Description: This submission contains the "matrix.mtx", "barcodes.tsv", and "genes.tsv" files for each replicate and condition, corresponding to the aligned files for single cell sequencing data. Submission type: Private. In order to gain access to the repository, you must use a reviewer token (https://www.ncbi.nlm.nih.gov/geo/info/reviewer.html).

    Sequence read archive (SRA) repository ID: SRX19088718 and SRX19088719

    Title: Gene expression profile at single cell level of CD4+ and CD8+ tumor infiltrating lymphocytes (TIL) originating from the EMT6 tumor model from mSTAR1302 treatment. Description: This submission contains the raw sequencing or .fastq.gz files, which are tab delimited text files. Submission type: Private. In order to gain access to the repository, you must use a reviewer token (https://www.ncbi.nlm.nih.gov/geo/info/reviewer.html).

    Zenodo DOI: 10.5281/zenodo.7566113(https://zenodo.org/record/7566113#.ZCcmvC2cbrJ)

    Title: A TCR β chain-directed antibody-fusion molecule that activates and expands subsets of T cells and promotes antitumor activity. Description: This submission contains the "Rdata" or ".Rds" file, which is an R object file. This is a necessary file to use the code. Submission type: Restricted Acess. In order to gain access to the repository, you must contact the author.

    Installation and Instructions

    The code included in this submission requires several essential packages, as listed above. Please follow these instructions for installation:

    Ensure you have R version 4.1.2 or higher for compatibility.

    Although it is not essential, you can use R-Studios (Version 2022.12.0+353 (2022.12.0+353)) for accessing and executing the code.

    1. Download the *"Rdata" or ".Rds" file from Zenodo (https://zenodo.org/record/7566113#.ZCcmvC2cbrJ) (Zenodo DOI: 10.5281/zenodo.7566113).
    2. Open R-Studios (https://www.rstudio.com/tags/rstudio-ide/) or a similar integrated development environment (IDE) for R.
    3. Set your working directory to where the following files are located:

    marengo_code_for_paper_jan_2023.R Install_Packages.R Marengo_newID_March242023.rds genes_for_heatmap_fig5F.xlsx all_res_deg_for_heat_updated_march2023.txt

    You can use the following code to set the working directory in R:

    setwd(directory)

    1. Open the file titled "Install_Packages.R" and execute it in R IDE. This script will attempt to install all the necessary pacakges, and its dependencies in order to set up an environment where the code in "marengo_code_for_paper_jan_2023.R" can be executed.
    2. Once the "Install_Packages.R" script has been successfully executed, re-start R-Studios or your IDE of choice.
    3. Open the file "marengo_code_for_paper_jan_2023.R" file in R-studios or your IDE of choice.
    4. Execute commands in the file titled "marengo_code_for_paper_jan_2023.R" in R-Studios or your IDE of choice to generate the plots.
  8. M

    Bioinformatics Services Market Grows from USD 2.9 Billion to 10.7 Billion by...

    • media.market.us
    Updated Oct 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market.us Media (2025). Bioinformatics Services Market Grows from USD 2.9 Billion to 10.7 Billion by 2033 [Dataset]. https://media.market.us/bioinformatics-services-market-news-2025/
    Explore at:
    Dataset updated
    Oct 8, 2025
    Dataset authored and provided by
    Market.us Media
    License

    https://media.market.us/privacy-policyhttps://media.market.us/privacy-policy

    Time period covered
    2022 - 2032
    Description

    Overview

    The Global Bioinformatics Services Market is projected to reach USD 10.7 billion by 2033, growing from USD 2.9 billion in 2023 at a CAGR of 13.9%. Growth is being driven by the rapid expansion of genomic and health data generation across research institutions, healthcare systems, and public-health agencies. The World Health Organization’s Global Genomic Surveillance Strategy has positioned bioinformatics as a core element in detecting and responding to health threats. This policy direction is reinforcing global demand for scalable analytical platforms, secure data sharing, and sustainable workflow solutions.

    A fundamental growth catalyst is the declining cost of sequencing. According to the U.S. National Human Genome Research Institute, the cost per genome has decreased sharply since the late 2000s. As sequencing becomes more affordable, the number of samples increases, driving demand for downstream data storage, processing, and interpretation. Consequently, outsourcing bioinformatics tasks to specialized service providers has become more common and cost-effective.

    Another major factor supporting market expansion is the rise in publicly available genomic data. The NIH Sequence Read Archive (SRA) surpassed 50 petabases of data by early 2024, requiring large-scale indexing, quality control, and reanalysis. This massive data load necessitates professional expertise and infrastructure, which are primarily offered by bioinformatics service companies.

    The integration of genomics into healthcare systems is further strengthening market growth. The NHS Genomic Medicine Service in England is expanding clinical genomics applications in oncology and rare disease management. This transition creates sustained demand for validated bioinformatics pipelines, variant curation, and clinical reporting services. Healthcare institutions increasingly depend on external service providers for secure, clinical-grade analysis pipelines and data governance compliance, ensuring both accuracy and confidentiality in genomic interpretation.

    Emerging Opportunities and Regional Investments

    Public health initiatives and global investments are enhancing the bioinformatics services landscape. Programs like the U.S. CDC’s Advanced Molecular Detection and ECDC’s sequencing integration are driving large-scale genomic surveillance. These initiatives require ongoing analysis, pipeline standardization, and data-platform management, which are largely delivered through external service providers. As countries institutionalize sequencing, recurring demand for bioinformatics workflows and analytic services is expected to persist.

    In low- and middle-income countries, international investment is expanding market opportunities. The World Bank’s genomic capacity-building programs in Africa are fostering sequencing and analytics infrastructure. These efforts include bioinformatics training and workflow design, ensuring long-term sustainability. Such projects significantly widen the global serviceable market for bioinformatics expertise. Similarly, large-scale national genomic initiatives like the NIH All of Us program generate billions of variants that require harmonization, annotation, and interpretation, sustaining demand for cloud-based data management and analytic platforms.

    The growing focus on antimicrobial resistance (AMR) is also fueling bioinformatics adoption. Under WHO’s GLASS platform, countries are integrating whole-genome sequencing into AMR surveillance. This expansion is creating consistent demand for quality assurance, centralized analysis hubs, and workflow optimization. Furthermore, data governance reforms by the OECD and other regulatory bodies are facilitating secure secondary use of genomic data, promoting trust in data sharing and collaboration.

    Strategic public funding further strengthens the market outlook. Horizon Europe’s Health Work Programme (2025) and NHGRI’s technology initiatives continue to fund large-scale, data-driven research, ensuring a steady flow of contracts for bioinformatics firms. Workforce development is also improving, with national systems such as NHS England expanding bioinformatics training. This capacity building not only supports in-house analytics but also increases outsourcing to handle peak workloads and specialized computational tasks.

    In conclusion, the bioinformatics services market is benefiting from multiple converging factors—technological affordability, global health investments, regulatory clarity, and expanding data ecosystems. These structural developments are shaping a resilient, long-term demand environment for scalable, compliant, and high-quality bioinformatics services worldwide.

    https://market.us/wp-content/uploads/2022/06/Bioinformatics-Services-Market-Size-Forecast-2.jpg" alt="Bioinformatics Services Market Size Forecast">

  9. e

    Catalog of NCBI sequence read archive (SRA) data for salamanders at the...

    • portal.edirepository.org
    csv
    Updated Apr 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brett Addis; Madaline Cochrane; Winsor Lowe (2024). Catalog of NCBI sequence read archive (SRA) data for salamanders at the Hubbard Brook Experimental Forest 2012-2021 [Dataset]. http://doi.org/10.6073/pasta/6df7199d751ec81315395a042cbd8083
    Explore at:
    csv(312227 byte), csv(220695 byte), csv(282251 byte)Available download formats
    Dataset updated
    Apr 9, 2024
    Dataset provided by
    EDI
    Authors
    Brett Addis; Madaline Cochrane; Winsor Lowe
    Time period covered
    2012 - 2021
    Area covered
    Variables measured
    strain, ecotype, isolate, lat_lon, cultivar, organism, Accession, BioProject, env_medium, sample_URL, and 8 more
    Description

    This project was designed to describe fine-scale population genetic differentiation of the stream salamander Gryinophilus porphyriticus among five study streams in the Hubbard Brook Experimental Forest. The data are paired with intensive capture-recapture data to assess direct fitness effects of individual genetic diversity, including effects of individual multilocus heterozygosity on stage-specific survival probabilities.

       This dataset publishes a manifest of the genomic sequence reads submitted to the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). These samples are published at NCBI under the BioProject ID 1090913 (https://www.ncbi.nlm.nih.gov/bioproject/1090913). The tables here include sample metadata and the NCBI URLs to each sample.
    
       These data were gathered as part of the Hubbard Brook Ecosystem Study (HBES). The HBES is a collaborative effort at the Hubbard Brook Experimental Forest, which is operated and maintained by the USDA Forest Service, Northern Research Station.
    
  10. d

    Chromosome assembly and preliminary gene and repeat annotations for Myzomela...

    • datadryad.org
    zip
    Updated Jul 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elsie Shogren; Jason Sardell; Christina Muirhead; Emiliano Martí; Elizabeth Cooper; Robert Moyle; Daven Presgraves; Albert J. Uy (2024). Chromosome assembly and preliminary gene and repeat annotations for Myzomela tristrami reference genome [Dataset]. http://doi.org/10.5061/dryad.612jm64c9
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 27, 2024
    Dataset provided by
    Dryad
    Authors
    Elsie Shogren; Jason Sardell; Christina Muirhead; Emiliano Martí; Elizabeth Cooper; Robert Moyle; Daven Presgraves; Albert J. Uy
    Time period covered
    Jul 15, 2024
    Description

    Chromosome assembly and preliminary gene and repeat annotations for Myzomela tristrami reference genome

    I. Files (GENOME) Mt_v1.0_MAIN.fa.gz Primary genome, (largely) scaffolded to chromosome-level, plus other primary assembled contigs Mt_v1.0_MAIN.gff.gz Simple gene annotations for primary genome, annotated using GeMoMa v1.8 and a zebra finch (bTaeGut1.4.pri) annotation reference Mt_v1.0_extra.fa.gz Additional contigs, not for use in most analyses but some may be of interest This set is a combination of hand-identified haplotigs of the main genome, and assembler-identified "alternate" (haplotig) contigs (ORIGINAL_ASSEMBLY_CONTIGS) Mt_hifi.asm.p.fa.gz "primary" assembly contigs, output from hifiasm (v0.13-r308) Mt_hifi.asm.a.fa.gz "alternate" assembly contigs, output from hifiasm (v0.13-r308) (REPEAT_MASKING) TElib_Myzo_preliminary.fa.gz Preliminary Myzomela-tuned TE/repeat library, generated using RepeatModeler (v.2) Mt_v1.0_MAIN_RM_sites_to_filter.txt List of sites masked by RepeatM...

  11. o

    Repository for the single cell RNA sequencing data analysis for the human...

    • explore.openaire.eu
    Updated Aug 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonathan; Andrew; Pierre; Allart; Adrian (2023). Repository for the single cell RNA sequencing data analysis for the human manuscript. [Dataset]. http://doi.org/10.5281/zenodo.8286134
    Explore at:
    Dataset updated
    Aug 26, 2023
    Authors
    Jonathan; Andrew; Pierre; Allart; Adrian
    Description

    This is the GitHub repository for the single cell RNA sequencing data analysis for the human manuscript. The following essential libraries are required for script execution: Seurat scReportoire ggplot2 dplyr ggridges ggrepel ComplexHeatmap Linked File: -------------------------------------- This repository contains code for the analysis of single cell RNA-seq dataset. The dataset contains raw FASTQ files, as well as, the aligned files that were deposited in GEO. Provided below are descriptions of the linked datasets: 1. Gene Expression Omnibus (GEO) ID: GSE229626 - Title: Gene expression profile at single cell level of human T cells stimulated via antibodies against the T Cell Receptor (TCR) - Description: This submission contains the matrix.mtx, barcodes.tsv, and genes.tsv files for each replicate and condition, corresponding to the aligned files for single cell sequencing data. - Submission type: Private. In order to gain access to the repository, you must use a "reviewer token"(https://www.ncbi.nlm.nih.gov/geo/info/reviewer.html). 2. Sequence read archive (SRA) repository - Title: Gene expression profile at single cell level of human T cells stimulated via antibodies against the T Cell Receptor (TCR) - Description: This submission contains the "raw sequencing" or .fastq.gz files, which are tab delimited text files. - Submission type: Private. In order to gain access to the repository, you must use a "reviewer token" (https://www.ncbi.nlm.nih.gov/geo/info/reviewer.html). Please note that since the GSE submission is private, the raw data deposited at SRA may not be accessible until the embargo on GSE229626 has been lifted. Installation and Instructions -------------------------------------- The code included in this submission requires several essential packages, as listed above. Please follow these instructions for installation: > Ensure you have R version 4.1.2 or higher for compatibility. > Although it is not essential, you can use R-Studios (Version 2022.12.0+353 (2022.12.0+353)) for accessing and executing the code. The following code can be used to set working directory in R: > setwd(directory) Steps: 1. Download the "Human_code_April2023.R" and "Install_Packages.R" R scripts, and the processed data from GSE229626. 2. Open "R-Studios"(https://www.rstudio.com/tags/rstudio-ide/) or a similar integrated development environment (IDE) for R. 3. Set your working directory to where the following files are located: - Human_code_April2023.R - Install_Packages.R 4. Open the file titled Install_Packages.R and execute it in R IDE. This script will attempt to install all the necessary pacakges, and its dependencies. 5. Open the Human_code_April2023.R R script and execute commands as necessary.

  12. b

    Data relating to RNA sequence accessions at NCBI from Ross Sea...

    • bco-dmo.org
    csv
    Updated May 17, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rebecca J. Gast (2018). Data relating to RNA sequence accessions at NCBI from Ross Sea Dinoflagellates, Phaeocystis antarctica, Pyramimons tychotreta, and Micromonas polaris (CCMP 2099) (Kleptoplasty project) [Dataset]. https://www.bco-dmo.org/dataset/728427
    Explore at:
    csv(16.59 KB)Available download formats
    Dataset updated
    May 17, 2018
    Dataset provided by
    Biological and Chemical Data Management Office
    Authors
    Rebecca J. Gast
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Dec 1, 1997 - Apr 7, 1998
    Area covered
    Variables measured
    lat, lon, temp, depth, isolate, Organism, BioSample, SRA_Study, replicate, Assay_Type, and 13 more
    Measurement technique
    Automated DNA Sequencer
    Description

    This dataset contains data related to RNA sequence genetic accessions at the National Center for Biotechnology Information (NCBI) including information about the host organism, collection location, and collection date.

    The accessions are the unprocessed Illumina MiSeq reads for the Ross Sea Dinoflagellate RNA-Seq experiments, Phaeocystis antarctica RNA-Seq experiments, and Pyramimons tychotreta & Micromonas polaris (CCMP 2099) mixotrophy experiments.

    Pyramimonas tychotreta & Micromonas polaris (CCMP 2099) mixotrophy RNA sequences are available through the NCBI Sequence Read Archive (SRA) under the SRA accession number SRP090401 (BioProject PRJNA342459)

    Ross Sea Dinoflagellate RNA sequences are available through the NCBI Sequence Read Archive (SRA) under the accession number SRP132912 (BioProject PRJNA428208).

    Phaeocystis antarctica RNA sequences are available through the NCBI Sequence Read Archive (SRA) under the accession number SRP133243 (BioProject PRJNA434497).

  13. g

    Whole genome sequencing of three North American large-bodied birds

    • gimi9.com
    Updated Oct 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Whole genome sequencing of three North American large-bodied birds [Dataset]. https://gimi9.com/dataset/data-gov_whole-genome-sequencing-of-three-north-american-large-bodied-birds/
    Explore at:
    Dataset updated
    Oct 26, 2023
    Area covered
    United States
    Description

    The data release details the samples, methods, and raw data used to generate high-quality genome assemblies for greater sage-grouse (Centrocercus urophasianus), white-tailed ptarmigan (Lagopus leucura), and trumpeter swan (Cygnus buccinator). The raw data have been deposited in the Sequence Read Archive (SRA) of the National Center for Biotechnology Information (NCBI), the authoritative repository for public biological sequence data, and are not included in this data release. Instead, the accessions that link to those data via the NCBI portal (www.ncbi.nlm.nih.gov) are provided herein. The release consists of a single file, sample.metadata.txt, which maps NCBI accessions to the samples sequenced and the different types of sequencing performed to generate the assemblies and annotate their gene features.

  14. Pseudomonas sp. HOU2 predicted gene sequences

    • figshare.com
    txt
    Updated Jul 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Van Hong Thi Dao; Son Truong Dinh (2024). Pseudomonas sp. HOU2 predicted gene sequences [Dataset]. http://doi.org/10.6084/m9.figshare.26325310.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jul 22, 2024
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Van Hong Thi Dao; Son Truong Dinh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These whole genome of Pseudomonas sp. HOU2 were analyzed by RAST (Rapid Annotation using Subsystem Technology) (https://rast.nmpdr.org/) on 18 July 2024 with the following selected options to get the predicted HOU2 gene sequences. Genetic code: 11Annotation scheme: RASTtkPreserve gene calls: noAutomatically fix errors: yesFix frameshifts: yesBackfill gaps: yesNCBI Sequence Read Archive of Pseudomonas sp. HOU2 is SRR29666724 (https://www.ncbi.nlm.nih.gov/sra/SRR29666724)NCBI complete genome of Pseudomonas sp. HOU2 is CP160398.1 (https://www.ncbi.nlm.nih.gov/nuccore/CP160398)

  15. u

    Data from: Metagenomic and near full-length 16S rRNA sequence data in...

    • agdatacommons.nal.usda.gov
    bin
    Updated Feb 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Phillip R. Myer; MinSeok Kim; Harvey C. Freetly; Timothy P.L. Smith (2024). Data from: Metagenomic and near full-length 16S rRNA sequence data in support of the phylogenetic analysis of the rumen bacterial community in steers [Dataset]. https://agdatacommons.nal.usda.gov/articles/dataset/Data_from_Metagenomic_and_near_full-length_16S_rRNA_sequence_data_in_support_of_the_phylogenetic_analysis_of_the_rumen_bacterial_community_in_steers/24852534
    Explore at:
    binAvailable download formats
    Dataset updated
    Feb 9, 2024
    Dataset provided by
    Data in Brief
    Authors
    Phillip R. Myer; MinSeok Kim; Harvey C. Freetly; Timothy P.L. Smith
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    Amplicon sequencing utilizing next-generation platforms has significantly transformed how research is conducted, specifically microbial ecology. However, primer and sequencing platform biases can confound or change the way scientists interpret these data. The Pacific Biosciences RSII instrument may also preferentially load smaller fragments, which may also be a function of PCR product exhaustion during sequencing. To further examine theses biases, data is provided from 16S rRNA rumen community analyses. Specifically, data from the relative phylum-level abundances for the ruminal bacterial community are provided to determine between-sample variability. Direct sequencing of metagenomic DNA was conducted to circumvent primer-associated biases in 16S rRNA reads and rarefaction curves were generated to demonstrate adequate coverage of each amplicon. PCR products were also subjected to reduced amplification and pooling to reduce the likelihood of PCR product exhaustion during sequencing on the Pacific Biosciences platform. The taxonomic profiles for the relative phylum-level and genus-level abundance of rumen microbiota as a function of PCR pooling for sequencing on the Pacific Biosciences RSII platform were provided. Data is within this article and raw ruminal MiSeq sequence data is available from the NCBI Sequence Read Archive (SRA Accession SRP047292). Additional descriptive information is associated with NCBI BioProject PRJNA261425. http://www.ncbi.nlm.nih.gov/bioproject/PRJNA261425/ Resources in this dataset:Resource Title: NCBI Sequence Read Archive (SRA Accession SRP047292). File Name: Web Page, url: https://www.ncbi.nlm.nih.gov/sra/SRX704260 1 ILLUMINA (Illumina MiSeq) run: 978,195 spots, 532.9M bases, 311.6Mb downloads.

  16. D

    Replication Data for: Changes in DNA Methylation During Anoxia and...

    • dataverse.azure.uit.no
    • search.dataone.org
    txt
    Updated Aug 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Magdalena Winklhofer; Magdalena Winklhofer; Øivind Andersen; Sjannie Lefevre; Sjannie Lefevre; Øivind Andersen (2025). Replication Data for: Changes in DNA Methylation During Anoxia and Reoxygenation in Crucian Carp Brain [Dataset]. http://doi.org/10.18710/GSHJEB
    Explore at:
    txt(29643726), txt(29646762), txt(29645800), txt(29639614), txt(29648691), txt(29642203), txt(29644166), txt(242971), txt(2026), txt(14315), txt(17150), txt(91177), txt(5837), txt(29645960), txt(29648493), txt(29651345), txt(282436), txt(227822), txt(115430), txt(3415), txt(29648375), txt(300150), txt(29647183), txt(13029)Available download formats
    Dataset updated
    Aug 5, 2025
    Dataset provided by
    DataverseNO
    Authors
    Magdalena Winklhofer; Magdalena Winklhofer; Øivind Andersen; Sjannie Lefevre; Sjannie Lefevre; Øivind Andersen
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Dataset funded by
    Research Council of Norway
    Sigma2
    Description

    This analysis contained the identification of DNA methylation sites in the context of CpG islands and differentially methylated regions (DMRs) with MethylScore. Further the mRNA sequencing data were analyzed to differentially expressed genes. Differentially expressed genes and identified DMRs were correlated. Finally DMRs and their expression changes were characterized in their genomic context. All raw sequencing data used as input for analyses to obtain the data in this repository are deposited in the NCBI Sequence Read Archive (SRA) under BioProject ID PRJNA1163668 (http://www.ncbi.nlm.nih.gov/bioproject/1163668). The genome assembly and annotation data used here were obtained from DataverseNO (https://doi.org/10.18710/GXMSUH). This genome assembly is based on the raw sequencing data deposited under BioProject ID PRJNA1119394 (http://www.ncbi.nlm.nih.gov/bioproject/1119394). Together these data were used to identify CpG sites genome wide and further identify differentially methylated regions. The corresponding mRNA was utilized to identify transcriptional changes and enable a comparison of differentially methylated genes with differentially expressed genes. Scripts are available in the GitHub repository WholeGenomeBisulphiteSequencing (https://github.com/MagdalenaWinklhofer/WholeGenomeBisulphiteSequencing.git).

  17. Aging correlated genes.

    • plos.figshare.com
    xlsx
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joe L. Webb; Simon M. Moe; Andrew K. Bolstad; Elizabeth M. McNeill (2023). Aging correlated genes. [Dataset]. http://doi.org/10.1371/journal.pone.0255085.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Joe L. Webb; Simon M. Moe; Andrew K. Bolstad; Elizabeth M. McNeill
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This table depicts the aging correlated genes for humans and flies sorted according to their correlation coefficient. (XLSX)

  18. f

    List of whole genome resequenced datasets available on Sequence Read...

    • datasetcatalog.nlm.nih.gov
    Updated Jan 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Forcina, Giovanni; Sadanandan, Keren R.; Wu, Meng Yue; Low, Gabriel Weijie; Baldwin, Maude W.; Wu, Shaoyuan; Rheindt, Frank E.; van Grouw, Hein; Edwards, Scott V.; Gwee, Chyi Yin (2023). List of whole genome resequenced datasets available on Sequence Read Archive, European Nucleotide Archive or chickenSD for Gallus gallus and used in this study. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001056782
    Explore at:
    Dataset updated
    Jan 19, 2023
    Authors
    Forcina, Giovanni; Sadanandan, Keren R.; Wu, Meng Yue; Low, Gabriel Weijie; Baldwin, Maude W.; Wu, Shaoyuan; Rheindt, Frank E.; van Grouw, Hein; Edwards, Scott V.; Gwee, Chyi Yin
    Description

    Local chickens that do not confer to a breed are labelled as “village”. (XLSX)

  19. Genome assemblies and respective wg/cgMLST profiles of a diverse dataset...

    • zenodo.org
    bin, zip
    Updated Jul 24, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Verónica Mixão; Verónica Mixão; Holger Brendebach; Miguel Pinto; Daniel Sobral; João Paulo Gomes; Carlus Deneke; Simon Tausch; Vítor Borges; Holger Brendebach; Miguel Pinto; Daniel Sobral; João Paulo Gomes; Carlus Deneke; Simon Tausch; Vítor Borges (2023). Genome assemblies and respective wg/cgMLST profiles of a diverse dataset comprising 3,076 Campylobacter jejuni isolates [Dataset]. http://doi.org/10.5281/zenodo.7230105
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    Jul 24, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Verónica Mixão; Verónica Mixão; Holger Brendebach; Miguel Pinto; Daniel Sobral; João Paulo Gomes; Carlus Deneke; Simon Tausch; Vítor Borges; Holger Brendebach; Miguel Pinto; Daniel Sobral; João Paulo Gomes; Carlus Deneke; Simon Tausch; Vítor Borges
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset

    This dataset comprises the genome assemblies and respective 2,794-loci whole-genome (wg) Multiple Locus Sequence Type (MLST) profiles [INNUENDO schema (Llarena et al. 2018) available in chewie-NS (Mamede et al. 2022)] of a final set of 3,076 Campylobacter jejuni samples selected among the Whole-Genome Sequencing (WGS) data publicly available in the European Nucleotide Archive (ENA) or in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) at the beginning of the analysis (November 2021). This set of samples was carefully selected to cover a wide genetic diversity (assessed in terms of Sequence Type [ST]). In total, 476 different STs are represented in this dataset, with ST21, ST50, ST48, ST45 and ST257 being the most represented ones and, together, corresponding to 29.1% of the dataset.

    File “Cj_metadata.xlsx” contains metadata information for each isolate, including ENA/SRA accession number, BioProject and in-silico MLST ST.

    The directory “assemblies/” contains all the genome assemblies (.fasta format) of each isolate presented in the metadata file.

    The file “profiles/Cj_profiles_wgMLST.tsv” corresponds to a tab separated file with the 2,794-loci wgMLST profiles of each solate presented in the metadata file. The files “profiles/Cj_profiles_cgMLST_95.tsv”, “profiles/Cj_profiles_cgMLST_98.tsv” and “profiles/Cj_profiles_cgMLST_100.tsv” correspond to a 1,012-loci, 987-loci and 29-loci cgMLST profiles of each isolate presented in the metadata file, respectively. These profiles were determined as explained below.

    Dataset selection and curation

    With the objective of creating a diverse dataset of C. jejuni genome assemblies, we collected information about the genetic diversity (serotype) of the isolates available at PubMLST database in the beginning of this analysis (November 2021) and in other previous works. Based on this information, we selected an initial dataset comprising 3,539 samples. The majority of them are associated with the INNUENDO project (Llarena et al. 2018). The remaining ones are associated with five BioProjects (PRJEB31119, PRJEB38253, PRJEB40238, PRJEB4165 and PRJNA350537). Their WGS data was downloaded from ENA/SRA with fastq-dl v1.0.6. Read quality control, trimming and assembly were performed with the Aquamis v1.3.9 (Deneke et al. 2021) using default parameters. Assembly quality control (QC), including contamination assessment, as well as MLST ST determination were performed with the same pipeline. All genome assemblies passing the QC were included in the final dataset. Among the others, we noticed that a considerable proportion of assemblies was flagged as “QC fail” exclusively due to the “NumContamSNVs” parameter, suggesting that this setting might have been too strict. After manual inspection of a random subset, assemblies for which the percentage of reads corresponding to the correct species was >98% were recovered and integrated in the final dataset (those samples are labeled in the Metadata file). In total, 3,076 isolates passed this curation step and were included in the final dataset. wgMLST profiles of each of these isolates were determined with chewBBACA v2.8.5 (Silva et al. 2018), using the 2,794-loci INNUENDO schema available in chewie-NS (Llarena et al. 2018; Mamede et al. 2022) and downloaded on May 31st, 2022. Three cgMLST schemas were obtained with ReporTree v1.0.0 (Mixão et al. 2022) using the 2,794-loci wgMLST profiles of the 3,076 isolates as input and setting distinct “--site-inclusion” thresholds: 0.95, 0.98 and 1.0 (i.e., keep schema loci called in at least 95%, 98% and 100% of the samples, resulting in a 1,012-loci, 987-loci and 29-loci allelic matrices, respectively).

    Acknowledgements

    We thank the National Distributed Computing Infrastructure of Portugal (INCD) for providing the necessary resources to run the genome assemblies. INCD was funded by FCT and FEDER under the project 22153-01/SAICT/2016.

  20. n

    Improving the efficiency of single cell genome sequencing based on...

    • data.niaid.nih.gov
    zip
    Updated Jan 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jing Tu; Zengyan Yang; Na Lu; Zuhong Lu (2022). Improving the efficiency of single cell genome sequencing based on overlapping pooling strategy [Dataset]. http://doi.org/10.5061/dryad.v6wwpzgwr
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2022
    Dataset provided by
    Southeast University
    Authors
    Jing Tu; Zengyan Yang; Na Lu; Zuhong Lu
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Single cell genome sequencing has become a useful tool in medicine and biology studies. However, an independent library is required for each cell in single cell genome sequencing, so that the cost grows in step with the number of cells. In this study, we report a study on efficient single-cell copy number variation (CNV) analysis based on overlapping pooling strategy together with branch and bound (B&B) algorithm. Single cells are overlapped pooled before sequencing, and later are assorted into specific types by estimating their CNV patterns by B&B algorithm. Instead of constructing libraries for each cell, a library is required only for each pool. As long as the number of pools is smaller than the cells, fewer libraries are needed, and a lower cost is spent. Through computer simulations, we overlapping pooled 80 cells into 40 and 27 pools and classified them into cell types based on CNV pattern. The results showed that 84% cells in 40 pools and 76.5% cells in 27 pools were correctly classified on average, while only half or one-third of the sequencing libraries are required. Combining with traditional approaches, our method is expected to significantly improve the efficiency of single cell genome sequencing. Methods The dataset contains the statistics of the sequencing data and the copy number profiles of the single cells.

    The single-cell sequencing data of 80 single cells from 7 tumor patients with Triple-Negative Breast Cancer (TNBC) were downloaded in FASTQ format from National Center for Biotechnology Information (NCBI) [15] under Sequence Read Archive (SRA) accessions SRP064210.

    Performed basic statistics on BAM files mapped to the human genome hg19. We followed the protocol put forward by Baslan et al. to obtain the copy number profile of single cells.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
National Library of Medicine (2025). Sequence Read Archive (SRA) [Dataset]. https://catalog.data.gov/dataset/sequence-read-archive-sra-54e4a

Sequence Read Archive (SRA)

Explore at:
Dataset updated
Jun 19, 2025
Dataset provided by
National Library of Medicine
Description

The Sequence Read Archive (SRA) stores sequencing data from the next generation of sequencing platforms including Roche 454 GS System®, Illumina Genome Analyzer®, Life Technologies AB SOLiD System®, Helicos Biosciences Heliscope®, Complete Genomics®, and Pacific Biosciences SMRT®.

Search
Clear search
Close search
Google apps
Main menu