35 datasets found
  1. f

    Gap filling of real genomics data based on k-means clustering of populations...

    • catalog.eoxhub.fairicube.eu
    bin, data
    Updated May 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Gap filling of real genomics data based on k-means clustering of populations [Dataset]. https://catalog.eoxhub.fairicube.eu/collections/ML%20collection/items/LFT4VLQJOC
    Explore at:
    data, binAvailable download formats
    Dataset updated
    May 14, 2025
    License

    https://spdx.org/licenses/MIT.htmlhttps://spdx.org/licenses/MIT.html

    Time period covered
    May 14, 2025
    Area covered
    Earth
    Description

    Gap filling of real genomics data based on k-means clustering of populations

  2. f

    Apply elbow method and k-means clustering to gap fill genomics data

    • catalog.eoxhub.fairicube.eu
    bin, data
    Updated May 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Apply elbow method and k-means clustering to gap fill genomics data [Dataset]. https://catalog.eoxhub.fairicube.eu/collections/ML%20collection/items/RPNGH2B5U0
    Explore at:
    bin, dataAvailable download formats
    Dataset updated
    May 14, 2025
    License

    https://spdx.org/licenses/MIT.htmlhttps://spdx.org/licenses/MIT.html

    Time period covered
    May 14, 2025
    Area covered
    Earth
    Description

    Apply elbow method and k-means clustering to gap fill genomics data

  3. f

    Data from: RSeqFlow-OlivePollen10 - OE-kmeans.zip: Enrichment results for...

    • figshare.com
    zip
    Updated Aug 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M. Gonzalo Claros (2023). RSeqFlow-OlivePollen10 - OE-kmeans.zip: Enrichment results for k-means clusters. [Dataset]. http://doi.org/10.6084/m9.figshare.23587269.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 4, 2023
    Dataset provided by
    figshare
    Authors
    M. Gonzalo Claros
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A compressed file containing the biological processes and KEGG pathways for each of the three clusters obtained using k-means as tab separated lists. Columns in each file are the same as in RSeqFlow-OlivePollen8 for biological processes (files starting by BP-) and RSeqFlow-OlivePollen9 for KEGG pathways (files starting by KEGG-).

  4. o

    Data from: A universal probe set for targeted sequencing of 353 nuclear...

    • explore.openaire.eu
    • search.dataone.org
    • +2more
    Updated Dec 4, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthew G. Johnson; Lisa Pokorny; Steven Dodsworth; Laura R. Botigue; Robyn S. Cowan; Alison Devault; Wolf L. Eiserhardt; Niroshini Epitawalage; Félix Forest; Jan T. Kim; James Leebens-Mack; Ilia J. Leitch; Olivier Maurin; Doug Soltis; Pamela S. Soltis; Gane Ka-Shu Wong; William J. Baker; Norman Wickett (2018). Data from: A universal probe set for targeted sequencing of 353 nuclear genes from any flowering plant designed using k-medoids clustering [Dataset]. http://doi.org/10.5061/dryad.s3h9r6j
    Explore at:
    Dataset updated
    Dec 4, 2018
    Authors
    Matthew G. Johnson; Lisa Pokorny; Steven Dodsworth; Laura R. Botigue; Robyn S. Cowan; Alison Devault; Wolf L. Eiserhardt; Niroshini Epitawalage; Félix Forest; Jan T. Kim; James Leebens-Mack; Ilia J. Leitch; Olivier Maurin; Doug Soltis; Pamela S. Soltis; Gane Ka-Shu Wong; William J. Baker; Norman Wickett
    Description

    Sequencing of target-enriched libraries is an efficient and cost-effective method for obtaining DNA sequence data from hundreds of nuclear loci for phylogeny reconstruction. Much of the cost of developing targeted sequencing approaches is associated with the generation of preliminary data needed for the identification of orthologous loci for probe design. In plants, identifying orthologous loci has proven difficult due to a large number of whole-genome duplication events, especially in the angiosperms (flowering plants). We used multiple sequence alignments from over 600 angiosperms for 353 putatively single-copy protein-coding genes identified by the One Thousand Plant Transcriptomes Initiative to design a set of targeted sequencing probes for phylogenetic studies of any angiosperm group. To maximize the phylogenetic potential of the probes while minimizing the cost of production, we introduce a k-medoids clustering approach to identify the minimum number of sequences necessary to represent each coding sequence in the final probe set. Using this method, five to 15 representative sequences were selected per orthologous locus, representing the sequence diversity of angiosperms more efficiently than if probes were designed using available sequenced genomes alone. To test our approximately 80,000 probes, we hybridized libraries from 42 species spanning all higher-order groups of angiosperms, with a focus on taxa not present in the sequence alignments used to design the probes. Out of a possible 353 coding sequences, we recovered an average of 283 per species and at least 100 in all species. Differences among taxa in sequence recovery could not be explained by relatedness to the representative taxa selected for probe design, suggesting that there is no phylogenetic bias in the probe set. Our probe set, which targeted 260 kbp of coding sequence, achieved a median recovery of 137 kbp per taxon in coding regions, a maximum recovery of 250 kbp, and an additional median of 212 kbp per taxon in flanking non-coding regions across all species. These results suggest that the Angiosperms353 probe set described here is effective for any group of flowering plants and would be useful for phylogenetic studies from the species level to higher-order groups, including the entire angiosperm clade itself. Supplemental Table 1: Variable Sites in Selected GeneraVariable sites in Angiosperms353 genes within the genera Oenothera, Linum, Neurachne, and PortulacaVariableSites.xlsxVoucher_tableVoucher table for samples used to test Angiosperm353 dataset.onekp_only_angios_degappedMultiple sequence alignments for genes used in Angiosperms353 probe design (FASTA format). Original alignments contained up to 1400 sequences, which have been reduced to remove non-angiosperms. All gap-only sites have been removed.onekp_only_angios_pdistancePairwise matrices of sequence dissimilarity for angiosperm sequences used to design the Angiosperms353 probe set.Angiosperms353_extracted_seqeuences.tarSequences extracted from Angiosperms353 via HybPiper for 42 species of angiosperms. Includes: coding sequences ("exon"), non-coding ("intron") sequences and concatenated coding/non-coding ("supercontig") sequences.SuppFig1_RecoveryVSGenomeSizeSupplemental figure showing the relationship between genome size and gene recovery rate using the Angiosperms353 target capture set on 42 angiosperm species.Angiosperms353 Gene AnnotationsAll genes in the Angiosperms353 target capture set annotated using Arabidopsis orthologs.Supplementary_File_1.tar.gz

  5. Additional file 2 of Predicting clinical outcomes in neuroblastoma with...

    • springernature.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ilyes Baali; D Alp Emre Acar; Tunde W. Aderinwale; Saber HafezQorani; Hilal Kazan (2023). Additional file 2 of Predicting clinical outcomes in neuroblastoma with genomic data integration [Dataset]. http://doi.org/10.6084/m9.figshare.7145084.v1
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Ilyes Baali; D Alp Emre Acar; Tunde W. Aderinwale; Saber HafezQorani; Hilal Kazan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Differential Expression Results. This spreadsheet contains the results of limma analysis on RNA-seq-MAV data. The data for the top 500 genes with smallest adjusted p-values are included. The columns indicate gene ID, log fold change, average expression of the gene, t-statistic, p-value, adjusted p-value, B-statistic and a column that indicates whether this gene has been found to be associated with neuroblastoma in literature. (XLS 110 kb)

  6. f

    Additional file 11: of The molecular landscape of premenopausal breast...

    • springernature.figshare.com
    xlsx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Serena Liao; Ryan Hartmaier; Kandace McGuire; Shannon Puhalla; Soumya Luthra; Uma Chandran; Tianzhou Ma; Rohit Bhargava; Francesmary Modugno; Nancy Davidson; Steve Benz; Adrian Lee; George Tseng; Steffi Oesterreich (2023). Additional file 11: of The molecular landscape of premenopausal breast cancer [Dataset]. http://doi.org/10.6084/m9.figshare.c.3644942_D1.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    figshare
    Authors
    Serena Liao; Ryan Hartmaier; Kandace McGuire; Shannon Puhalla; Soumya Luthra; Uma Chandran; Tianzhou Ma; Rohit Bhargava; Francesmary Modugno; Nancy Davidson; Steve Benz; Adrian Lee; George Tseng; Steffi Oesterreich
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Gene list for clustering premenopausal (preM) estrogen receptor-positive (ER+) tumors. Table S1. Gene list selected by sparse k-means algorithm in The Cancer Genome Atlas (TCGA) data. Table S2. Genes selected based on TCGA data that are also in the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) data for validation. Table S3. Fixed number of genes (n = 21), gene list being selected from sparse k-means in TCGA. Table S4. Gene list selected by semi-supervised algorithm in METABRIC. Table S5. Fixed number of genes (n = 21), gene list being selected by semi-supervised algorithm in TCGA. Table S6. Genes (n = 28) in the LumA cluster that are significantly different between clusters 1 and 3. (XLSX 37 kb)

  7. u

    Data from: Condition‐dependent co‐regulation of genomic clusters of...

    • agdatacommons.nal.usda.gov
    bin
    Updated Feb 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mélanie Massonnet; Abraham Morales‐Cruz; Rosa Figueroa‐Balderas; Daniel P. Lawrence; Kendra Baumgartner; Dario Cantu (2024). Data from: Condition‐dependent co‐regulation of genomic clusters of virulence factors in the grapevine trunk pathogen Neofusicoccum parvum [Dataset]. http://doi.org/10.1111/mpp.12491
    Explore at:
    binAvailable download formats
    Dataset updated
    Feb 13, 2024
    Dataset provided by
    Molecular Plant Pathology
    Authors
    Mélanie Massonnet; Abraham Morales‐Cruz; Rosa Figueroa‐Balderas; Daniel P. Lawrence; Kendra Baumgartner; Dario Cantu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The ascomycete Neofusicoccum parvum, one of the causal agents of Botryosphaeria dieback, is a destructive wood‐infecting fungus and a serious threat to grape production worldwide. The capability to colonize woody tissue, combined with the secretion of phytotoxic compounds, is thought to underlie its pathogenicity and virulence. Here, we describe the repertoire of virulence factors and their transcriptional dynamics as the fungus feeds on different substrates and colonizes the woody stem. We assembled and annotated a highly contiguous genome using single‐molecule real‐time DNA sequencing. Transcriptome profiling by RNA sequencing determined the genome‐wide patterns of expression of virulence factors both in vitro (potato dextrose agar or medium amended with grape wood as substrate) and in planta. Pairwise statistical testing of differential expression, followed by co‐expression network analysis, revealed that physically clustered genes coding for putative virulence functions were induced depending on the substrate or stage of plant infection. Co‐expressed gene clusters were significantly enriched not only in genes associated with secondary metabolism, but also in those associated with cell wall degradation, suggesting that dynamic co‐regulation of transcriptional networks contributes to multiple aspects of N. parvum virulence. In most of the co‐expressed clusters, all genes shared at least a common motif in their promoter region, indicative of co‐regulation by the same transcription factor. Co‐expression analysis also identified chromatin regulators with correlated expression with inducible clusters of virulence factors, suggesting a complex, multi‐layered regulation of the virulence repertoire of N. parvum. Resources in this dataset:Resource Title: Link to Supporting Information. File Name: Web Page, url: https://bsppjournals.onlinelibrary.wiley.com/doi/10.1111/mpp.12491#support-information-section Link to Supporting Information at Molecular Plant Pathology. Files are: Appendix S1 Supplementary tables and figures - Download

    Table S1: Statistics and SRA accession numbers of PacBio and Illumina genome sequences of N. parvum UCD646So. Table S2: Comparison of repeat content between assemblies generated with PacBio (N. parvum isolate UCD646So) and Illumina reads (N. parvum isolate UCR-NP2; Blanco-Ulate et al., 2013). Table S3: Comparison of the predicted proteomes in N. parvum isolate UCD646So and N. parvum isolate UCR-NP2 (Blanco-Ulate et al., 2013). Table S4: Gene space completeness estimations using CEGMA (Parra et al., 2009) and BUSCO (Simão et al., 2015). Table S5: N. parvum CAZymes families involved in plant cell wall degradation. Table S6: Summary of the major putative virulence categories of differentially expressed genes. Table S7: Summary of RNA-seq data and mapping metrics. Fig. S1: (A) Contig length distribution (log10 scale) over the N. parvum genome in the assemblies generated using PacBio reads and Illumina reads. (B) Dot plot showing the nucmer alignments between the contigs of the N. parvum UCD646So and N. parvum UCR-NP2 genomes. Fig. S2: Graphical representation of telomere sequences found at the ends of the N. parvum contigs. Figure was prepared using WebLogo (Crooks et al., 2004). Fig. S3: Number of reads mapped onto N. parvum UCD646So transcriptome per sample in the in planta (A) and in vitro (B) experiments. Fig. S4: Hierarchical clustering analysis of the 78 DE genes during N. parvum infections of grapevine woody stems, using Pearson’s correlation distance (MeV; Saeed et al., 2003). Fig. S5: Identification of putatively constitutively expressed genes during N. parvum stem infections using Pearson correlation (R) coefficient and coefficient of variation (CV) cutoffs. Fig. S6: Estimation of most appropriate number of clusters for k-means clustering. Line plot shows “Figure of merit value (FOM; y-axis) values” in function of the number of clusters. (1-20 clusters, 100 iterations) (MeV v.4.9; Saeed et al., 2003).

    Appendix S2 Genome assemblies and protein‐coding gene coordinates - Download Appendix S3 Functional annotations - Download Excel (.xlsx) file. Appendix S4 Normalized RNA‐sequencing counts - Download Normalized RNA‐sequencing counts in the in vitro (A) and in planta (B) experiments, list of genes up‐regulated in the presence of wood (C) and exclusively expressed in planta (D), and groups of co‐expressed genes during Neofusicoccum parvum colonization obtained by both K‐means and hierarchical clustering analysis (E). Gene co‐expression modules obtained from Weighted Gene Co‐expression Network Analysis (WGCNA) and the corresponding degree of connectivity in the unweighted network (F), genomic clusters identified among the gene co‐expression modules (G), network properties of the gene co‐expression modules (H) and transcription factor‐coding genes and PHD finger domain‐containing protein genes identified among the most highly connected genes (5%) (I). Appendix S5 Shared motifs showing similarity to yeast motifs - Download Shared motifs showing similarity to yeast motifs (MacIsaac_v1 database) and Saccharomyces cerevisiae motifs and motif‐associated proteins (ScAPs) (SCPD database) (E 

  8. n

    The heritability of size in a wild annual plant population with hierarchical...

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Schoen (2024). The heritability of size in a wild annual plant population with hierarchical size structure [Dataset]. http://doi.org/10.5061/dryad.xwdbrv1jg
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    McGill University
    Authors
    Daniel Schoen
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    The relative magnitude of additive genetic versus residual variation for fitness traits is important in models for predicting the rate of evolution and population persistence in response to changes in the environment. In many annual plants, lifetime reproductive fitness is correlated with end-of-season plant biomass, which can vary significantly from plant to plant in the same population. We measured end-of-season plant biomasses and obtained SNP genotypes of plants in a dense, natural population of the annual plant species Impatiens capensis with hierarchical size structure. These data were used to estimate the amount of heritable variation for position in the size hierarchy and for plant biomass. Additive genetic variance for position in the size hierarchy and plant biomass were both significantly different from zero. These results are discussed in relationship to theory for the heritability of fitness in natural populations and ecological factors that potentially influence heritable variation for fitness in this species. Methods Study population The study population of Impatiens capensis is in Glen Sutton, Quebec, Canada (45o 02’ 37” N, 72o 32’ 57” W). The plants occur in damp soil within an irregularly shaped area of ca. 150 m2, beneath a canopy of a mixed, mature deciduous-evergreen (Acer saccharum-Tsuga canadensis) forest. I. capensis plants in this population form nearly pure stands that emerge as a near continuous carpet of seedlings on the forest floor. The density of individuals remains high (ca. 200-250 per m2) at the end of the season. Early season seedling density was at least twice as high. In I. capensis, the Pearson correlation between chasmogamous seed production and end of season biomass is r = 0.95, and between overall seed production and biomass is r = 0.92 (Waller, 1979). Small plants have been shown to produce no chasmogamous flowers or fruit at all (Waller, 1979), thus making position in the size hierarchy an interesting fitness component for study. Impatiens capensis reference genome From a single plant collected in the study population, 10 g of young leaves were harvested and frozen in liquid nitrogen. Genomic DNA was extracted from this tissue and sequenced by Dovetail Genomics/Cantata Bio LLC (Scotts Valley, California). For each Dovetail Omni-C library, chromatin was fixed in place with formaldehyde in the nucleus and then extracted. Fixed chromatin was digested with DNAse I, chromatin ends were repaired and ligated to a biotinylated bridge adapter followed by proximity ligation of adapter containing ends (Jordan Zhang—Dovetail Genomics, pers comm.). After proximity ligation, crosslinks were reversed, and the DNA purified. Purified DNA was treated to remove biotin that was not internal to ligated fragments, and sequencing libraries were generated using NEBNext Ultra enzymes and Illumina-compatible adapters (Jordan Zhang—Dovetail Genomics, pers comm.). Biotin-containing fragments were isolated using streptavidin beads before PCR enrichment of each library. The library was sequenced on an Illumina HiSeqX platform to produce approximately 30x sequence coverage. The input de novo assembly and Dovetail OmniC library reads were used as input data for HiRise, a software pipeline designed specifically for using proximity ligation data to scaffold genome assemblies (Putnam et al., 2016). Dovetail OmniC library sequences were aligned to the draft input assembly using bwa (https://github.com/lh3/bwa). The separations of Dovetail OmniC read pairs mapped within draft scaffolds were analyzed by HiRise to produce a likelihood model for genomic distance between read pairs, and the model was used to identify and break putative mis-joins, to score prospective joins, and make joins (Jordan Zhang—Dovetail Genomics, pers comm.). Plant biomass distribution in a random sample At the end of the 2020 growing season for Impatiens capensis (15 September), we recorded plant height and numbers of leaves from 97 plants collected at random in three, haphazardly placed 1 m2 quadrats within the study population. All 97 plants were weighed fresh to the nearest 0.01 g, and linear regression was used to establish a predictive relationship for plant biomass based on plant height and leaf number per plant (r2 = 0.72, P < 0.001). Size (biomass) inequality was examined by ranking plants from lightest weight to heaviest and graphing cumulative biomass against rank order and calculating the Gini coefficient, a measure of inequality of resource distribution (Dorfman, 1979). K-means clustering (Pedregosa et al., 2011) of the biomasses of 97 randomly sampled plants was used to determine the level of support for distinct plant size clusters, as well as the biomass cutoff that separates the clusters. Genotyping-by-sequencing and population genetics analyses Leaf tissue from the small and large sampled plants was collected and preserved for DNA extraction using silica gel. DNA from samples of the preserved leaves was extracted and processed for genotyping-by-sequencing (GBS) (Elshire et al., 2011) at the University of Wisconsin Biotechnology Center. Genomic DNA was digested with the restriction enzyme ApeK1 and ligated to adapters and barcodes to create the GBS libraries. A NovaSeq6000 sequencer was used to obtain paired end (150 bp) sequence reads from the libraries. The depth of coverage was approximately 180x. Raw reads were demultiplexed and filtered for Illumina adapter sequences and PCR duplicates with process_radtags and clone_filter (STACKS version 2.60; Rochette, Rivera-Colon and Catchen, 2019). The demultiplexed and filtered reads were then aligned to the reference genome using the Burrows-Wheeler Alignment (Li and Durbin, 2009) tool as implemented in bwakit version 0.7.12 and converted to bam files using SAMtools version 1.13 (Lin et al., 2009). SNPs with quality scores > 30 were identified with gstacks (STACKS version 2.60) and further processed with populations (STACKS version 2.60), which was used to filter out loci not found in at least 95% of individuals in each of the five quadrats sampled and where the minimum allele frequency for the less common SNP allele was < 0.01. The STACKS populations program was used to calculate population genetics statistics (nucleotide diversity and Wright’s Fst), for reporting the results of SNP filtering, and to produce a vcf file of SNP genotypes for each sampled individual. Prior to the estimation of the genomic relationship matrix (GRM; see below) we applied a vcfR (version 1.12; Knaus & Grunwald, 2001) and SNPfiltR (version 1.01; DeRaad, 2022) R version 4.1.1 (R Core Team, 2021) to filter out SNPs with low read depths (< 7). To avoid spurious associations that might inflate relationship estimates of the GRM, SNPs with r2 (squared correlation coefficient between the alleles at two loci) > 0.05 and within 1000 kb windows were filtered out with LDAK (version 5.2; https://dougspeed.com/ldak; Speed et al., 2020), which used bed genotype files as input, created from the original vcf file using Plink (version 1.9, Chang et al., 2015). Analyses of genetic variance and heritable variation for position in the size hierarchy and plant biomass We treated size either as a threshold trait and analyzed it on a liability scale (small plants versus large plants) as is done in some agricultural genetic studies for traits such as secondary compound content, and disease resistance (Merrick et al., 2023), or we directly assayed biomass. We used four different methods to estimate heritability of plant size. These are implemented in the LDAK. The first two methods, Phenotype Correlation–Genotype Correlation (PCGC) and TetraHer, use a liability threshold model, such that the binary outcome (small versus large) indicate whether the unobserved liability is above or below a threshold. Both PCGC and TetraHer estimate heritability on the liability scale by measuring the extent that the estimated relatedness between pairs of individuals correlates with their estimated liabilities. PCGC considers all pairs of individuals, and measures pairwise relatedness based on allelic correlations (inferred from the GRM). By contrast, TetraHer considered only the 3,606 pairs of individuals identified as having at least 17.5% (IBD) by descent using the kinship analysis KING software (Manichaikul et al., 2010). Both methods adjust for ascertainment (i.e., the fact that our sample was enriched for large plants, relative to the natural population). Prevalence of large plants was determined as the proportion of large plants observed in the random sample of 97 plants, as classified by K-means clustering—see above). We additionally obtained estimates of heritability on the observed scale (a continuous variable). For this we used restricted maximum likelihood (REML) and QuantHer, which are similar to PCGC and TetraHer. We note that sampling weights from the smallest and largest plants for REML and PCGC constitutes “extreme sampling”, but this has been shown previously to have a minimal biasing effect on heritability estimation (Golan et al., 2014). Nevertheless, we also conducted our own simulation analyses to gauge the possible effects of sampling protocol on our own results (Supplementary Methods for details). References

    Anderson, J. T. (2016). Plant fitness in a rapidly changing world. New Phytologist 210:81-87. Bérénos, C., Ellis, P. A., Pilkington, J. G., & Pemberton, J. M. (2014). Estimating quantitative genetic parameters in wild populations: A comparison of pedigree and genomic approaches. Molecular Ecology, 23(14), 3434-3451. Bontemps, A., Lefèvre, F., Davi, H., & Oddou‐Muratorio, S. (2016). In situ marker‐based assessment of leaf trait evolutionary potential in a marginal European beech population. Journal of Evolutionary Biology, 29(3), 514-527. Burt, A. (1995). The evolution of fitness. Evolution, 49(1), 1-8. Castellanos, M. C., González‐Martínez, S. C., & Pausas, J. G.

  9. d

    Data for: The Precision Oncology Approach to Molecular Cancer Therapeutics...

    • search.dataone.org
    • data.mendeley.com
    Updated Sep 24, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kumar, Manish (2024). Data for: The Precision Oncology Approach to Molecular Cancer Therapeutics Targeting Oncogenic Signaling Pathways is a Means to an End [Dataset]. http://doi.org/10.7910/DVN/4MLCUG
    Explore at:
    Dataset updated
    Sep 24, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Kumar, Manish
    Description

    Cancer is a deadly genetic disease with diverse aspects of complexity, including cancer immune evasion, treatment resistance, and recurrence requiring optimized treatment to be cured. Molecular studies have revealed that tumors are profoundly heterogeneous in nature, leading to the complexity of cancer progression that is ultimately linked to its genetic machinery. It is important to note that patients with the same types of cancer respond differently to cancer treatments, indicating the need for patient-specific treatment options. This requires an in-depth genomic study of the patient's tumors to fully understand the driving factors of cancer for effective targeted therapy. Precision oncology has evolved as a form of cancer therapy focused on genetic profiling of tumors to identify molecular alterations involved in cancer development for tailored individualized treatment of the disease. Whole genome sequencing, tumor and cell-free DNA profiling, transcriptomics, proteomics and exploration of the cancer immune system form the basis of this field of cancer research and treatment. This article aims to briefly explain the foundations and frontiers of precision oncology in the context of ongoing technological advancements in related fields of study in to assess its scope and importance in achieving effective cure against cancer.

  10. d

    Transcription profiling of mouse erythroid progenitor (EPC) derived from...

    • datamed.org
    Updated Jun 10, 2011
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Diya Zhang (2011). Transcription profiling of mouse erythroid progenitor (EPC) derived from splenocytes at stage I, II and III [Dataset]. https://datamed.org/display-item.php?repository=0006&id=5913c5345152c62a9fc27c1d&query=UTP14A
    Explore at:
    Dataset updated
    Jun 10, 2011
    Authors
    Diya Zhang
    Description

    Expression profiles were analyzed (via Affymetrix MG-U74Av2 arrays) at three distinct stages of late erythroid progenitor cell development. Stages correspond to Kit(+)/EpoR-high/CD71(+)/Ter119(-) proerythroblasts [stage I]; Kit-low/EpoR(+)/CD71-high/Ter119(+) polychromatophilic erythroblasts [stage II]; and Kit(-)/EpoR-low/CD71-low/Ter119-high/lowFALS normoblasts [stage III]. Primary splenic (pro)erythroblasts were prepared at 80, 100 and 120 hours post thiamphenicol dosing, and were purified by MACS using a transgenic erythroid-restricted “EE” tag. For each stage, duplicate hybridizations were performed. GSM3371 and GSM3372 are results from stage I hybridizations; GSM3373 and GSM3374 are results for stage II; and GSM3375 and GSM3376 are results for stage III. The order of listed genes is organized based first on average p-value among all six hybridizations in ascending order, and second on fold-increases in expression levels from stage I to stage II (descending order). B] METHOD DETAILS (as a supplement to Zhang et al., submitted) B1-B6 below:; B.1/ ERYTHROID PROGENITOR CELL PREPARATIONS: Thiamphenicol (TAP) (Sigma, St. Louis, MO) was administered on day 1 to 8-week old Gata1-“EE” mice 1 as a subcutaneous implant in Spectra Por 2 tubing (Spectrum, Houston, TX) [2]. On days 2 through 4, mice were phlebotomized (80mL blood per day), and TAP implants were removed on day 6. Splenocytes were prepared at 80, 100, or 120 hours post TAP withdrawal by disruption of spleens in Dulbecco’s Modified Eagle’s Medium (DMEM) containing 2% fetal bovine serum (FBS) and 0.1 U/mL Epo. Cells were passed through a 70 um strainer (Falcon, Franklin Lakes, NJ), collected at 300 x g for 10 minutes, resuspended in 1 mL of phosphate-buffered saline (PBS, 140 mM NaCl, 2.7 mM KCl, 8.1 mM Na2HPO4, 1.2 mM KH2PO4, pH 7.4), exposed for 2 minutes to 9 mL of 0.8% NH4Cl, 0.01 mM Na2EDTA buffered at pH 7.5 with KHCO3. Cells then were collected through 50% FBS in PBS, and washed in DMEM. B.2/ MACS: Splenocytes (3x10^8 cells per 3 spleens per preparation) in 3 mL of 5mM Na2EDTA, 0.5 % BSA in PBS, pH 7.2 (PBE) were incubated for 5 minutes at 15°C with murine IgG Fc fragment (5 μg per mL) (Pierce, Rockford, IL), and pre-absorbed for 20 minutes at 15°C with anti-mouse IgG2a+2b magnetic microbeads (100 μL per mL) (Miltenyi, Bergisch Gladbach, Germany). Samples were passed through a magnetized MACS-LS column (non-specific adsorption) and unbound cells were collected, washed and resuspended (at 1 x 10^8 cells per mL) in 3 mL of PBE buffer. Cells then were incubated at 15°C for 15 minutes with monoclonal antibody EGFR.1 (5 μg per mL) (PharMingen, San Diego, CA), washed in 20 mL of 2°C PBE buffer, resuspended in PBE at 1 x 10^8 cells per mL, and incubated for 20 minutes at 15°C with anti-mouse IgG2a+2b magnetic microbeads (200 μL per mL). Bead-tagged (pro)erythroblasts were washed in 2°C PBE buffer, resuspended in 3 mL PBE, and applied to a MACS-LS column. Columns were washed six times with 3 mL of PBE, and EE-positive cells were recovered in 5 mL of PBE buffer. Overall, 6 x 10^6 stage I, II and III (pro)erythroblasts were purified from 12, 4, and 3 mice, respectively. B.3/ RNA ISOLATION AND ANALYSIS: Purified (pro)erythroblasts (6 x 10^6 cells) were lysed in 1 mL Trizol reagent (Life Technologies, Gaithersburg, MD). Chloroform (0.2 mL) was added and samples were vortexed and microcentrifuged at 7,500 rpm for 15 minutes at 4°C. The aqueous phase (approximately 0.5 mL) was recovered, and extracted using 0.5 mL of Trizol LS reagent plus 0.1 mL of chloroform. From the recovered epi-phase (750 μL), RNA was precipitated using isopropanol (750 μL), collected (10 minutes at 4°C, 12,000 rpm), washed with 80% ethanol, air-dried, and dissolved in 35 μl of 70°C DEPC-treated water. In Northern blotting, RNA samples were electrophoresed in 6% formaldehyde 1.2% agarose gels, blotted to Nytran membranes (Schleicher and Schuell, Keene, NH), and fixed (312 nm radiation for 3 minutes; 2 hours at 80°C under vacuum). In hybridizations, Redivue deoxyadenosine 5'-(α-32P)-triphosphate labeled probes were prepared by random priming (Prime-a-Gene System, Promega, Madison, WI) using 25 ng of the following cDNA fragments: murine Epo receptor, 880 bp Bgl II to Xba I fragment of pUC19ER-DpI3K [3]; murine beta-major globin, 1,100 bp Bgl II to Xba I fragment of pGEM7-betamaj-globin [4]; murine Kit, 1,500 bp BamH I to Nhe I fragment of pRep4DEB-cKit [5]; murine Gata-1, Xho I fragment of pBluescript-GATA-1 [6]; 32P-labeled probes were isolated using Sephadex G-50 microcolumns (Pharmacia Biotech, Piscataway, NJ). Hybridizations were for 4 hours at 68°C using 2 × 10^6 cpm of probe per mL in QuickHyb solution (Stratagene, La Jolla, CA). Membranes were exposed to X-OMAT AR film (Kodak, Rochester, NY) and were assayed quantitatively by phosphor imaging (Storm 860 system, Molecular Dynamics, Sunnyvale, CA). B.4/ BIOTIN-cRNA PREPARATION, AND ARRAY HYBRIDIZATIONS: First-strand cDNA was synthesized with a T7-oligo (dT) 24 primer (5'-GGC CAG TGA ATT GTA ATA CGA CTC ACT ATA GGG AGG CGG-(dT) 24-3') (Geneset, La Jolla CA) and Superscript II reverse transcriptase (Invitrogen, Carsbad, CA) (1 hour, 42°C). DNA polymerase I (cat# 18010-017, Invitrogen) and RNase H (cat# 18021-041, Invitrogen) were used in second strand syntheses (2 hours, 16°C). Products were extracted in Tris-phenol: chloroform: isoamyl (25:24:1) using a phase-lock gel (cat# 175850, Eppendorf 5-prime, Boulder, CO). Precipitation was with 0.5 volumes of 7.5M NH4Ac, 5 μg of glycogen and 2.5 volumes of -20°C absolute ethanol. Double stranded -cDNA (1 μg in 10 µl DEPC-treated water) was used per T7 RNA pol transcription (Enzo BioArray High Yield biotin-UTP and -CTP RNA Labeling Kit) (cat# 42655, Farmingdale, NY). Products were purified using QIAGEN RNeasy columns (cat# 72104) (Valencia, CA). For each sample (16 μg total RNA) approximately 60 μg of purified biotin-labeled cRNA was recovered. Labeled cRNA was fragmented for 35 min at 94°C in 500 mM potassium acetate (C2H3KO2), 150 mM magnesium acetate (C4H6MgO4), 200 mM Tris-acetate (C4H11NO3 • CH3COOH), pH 8.1. Each sample (diluted 1:10 in hybridization buffer) was first hybridized to a Test2 Genechip to confirm full-length transcript representation. Replicate hybridizations were to Affymetrix murine U74Av2 arrays, and data were analyzed using Affymetrix Genechip 5.1 software. B.5/ ANALYSES OF PROFILING OUTCOMES: Transcripts expressed (detected) at each stage were identified based on a detection p-value of <0.05 in one replicate, and <0.1 in the other. 3'/5' ratios of housekeeping genes confirmed uniform biotin cRNA syntheses. Pair-wise comparisons among stages included 4659 transcripts detected in at least one stage, and employed averages of log expression values from duplicate hybridizations for each gene. In these comparisons, modulation is defined by vertical distances from a regression (no-modulation) trendline. For the above 4659 detected transcripts, mean expression values were normalized to account for chip/sample variability by subtracting the stage mean, and dividing by the stage standard deviation. The three normalized values of each gene (one for each stage) then were standardized by subtracting the gene mean and dividing by the gene standard deviation. Non-modulated genes (less than two-fold change) then were removed from further consideration. Clustering of the resulting data was performed by K-means (S+ package), a partitioning algorithm that pursues a minimum of the within-clusters sums of squares (WSS, which measures square Euclidean distances between cluster points and the cluster average within each cluster) [7]. Principal component analysis (PCA) was used to visualize K-means clustering results. PCA is a common method for reduction of dimensionality.

  11. n

    Savanna monkey (Chlorocebus spp.) population genetics/genomics pipeline

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Sep 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christopher Schmitt; Christian Gagnon; Hannes Svardal; Anna Jasinska; Jennifer Danzy Cramer; Nelson Freimer; Paul Grobler; Trudy Turner (2022). Savanna monkey (Chlorocebus spp.) population genetics/genomics pipeline [Dataset]. http://doi.org/10.5061/dryad.k3j9kd59z
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 14, 2022
    Dataset provided by
    University of California, Los Angeles
    University of Wisconsin–Milwaukee
    Boston University
    Naturalis Biodiversity Center
    ROC USA
    University of the Free State
    University of Pittsburgh
    Authors
    Christopher Schmitt; Christian Gagnon; Hannes Svardal; Anna Jasinska; Jennifer Danzy Cramer; Nelson Freimer; Paul Grobler; Trudy Turner
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    In the last 300 thousand years, the genus Chlorocebus expanded from equatorial Africa into the southernmost latitudes of the continent, where colder climate was a likely driver of natural selection. We investigated population-level genetic variation in the mitochondrial uncoupling protein 1 (UCP1) gene region—implicated in non-shivering thermogenesis (NST)— in 73 wild savanna monkeys from three taxa representing this southern expansion (Chlorocebus pygerythrus hilgerti, Chlorocebus cynosuros and Chlorocebus pygerythrus pygerythrus) ranging from Kenya to South Africa. We found 17 SNPs with extended haplotype homozygosity consistent with positive selective sweeps, 10 of which show no significant linkage disequilibrium with each other. Phylogenetic generalized least-squares modelling with ecological covariates suggests that most derived allele frequencies are significantly associated with solar irradiance and winter precipitation, rather than overall low temperatures. This selection and association with irradiance are demonstrated by a relatively isolated population in the southern coastal belt of South Africa. We suggest that sunbathing behaviours common to savanna monkeys, in combination with the strength of solar irradiance, may mediate adaptations to thermal stress via NST among savanna monkeys. The variants we discovered all lie in non-coding regions, some with previously documented regulatory functions, calling for further validation and research. Methods Study Populations Data for this project were generated by the International Vervet Research Consortium from a sample of 163 savanna monkeys (Chlorocebus spp.) from 6 closely related taxa, captured as part of a larger sampling effort in 11 countries across Africa and the Caribbean, with particularly extensive sampling in South Africa (Jasinska et al., 2013; Svardal et al., 2017; Turner et al., 2019). We focused on 73 individuals from 3 vervet taxa: Chlorocebus pygerythrus hilgerti, Chlorocebus cynosuros, and Chlorocebus pygerythrus pygerythrus. In this paper, we adhere roughly to taxonomic distinctions made by Groves (2001), although recent nuclear and mitochondrial genomic evidence both suggest a relatively deep division suggesting a species-level distinction between C. p. hilgerti and C. p. pygerythrus (Svardal et al., 2017; Dolotovskaya et al., 2017), while evidence of gene flow between these two taxa and C. cynosuros suggest that the latter is nested within the larger C. pygerythrus clade (as first proposed by Dandelot, 1959). A taxonomic revision may be in order for the genus (Svardal et al., 2017), but is not within the scope of this paper. Each taxon was further subdivided into distinct populations, when possible, based on inferences from the whole-genome phylogeny (see below). Sequence Data Whole genome sequences were previously generated by collaborators at the McDonnell Genome Institute at Washington University in St. Louis (Warren et al., 2015; Svardal et al., 2017), and analyzed in this study using a publicly available variant call format (VCF) file generated at the Gregor Mendel Institute (Svardal et al., 2017). We used tabix (Li, 2011) in HTSlib v1.10.2 and vcftools (Danecek et al., 2011) to isolate a ~28 kb gene region around UCP1 (ChlSab1.1 positions 7:87,492,195-87,502,665), including 10 kb upstream and downstream of the coding region itself to capture potential cis-regulatory regions. Whole-Genome Phylogeny We used SNPRelate v. 1.24.0 (Zheng et al., 2012) to generate FASTA alignments from the VCF file, and phangorn (v2.5.5; Schliep, 2011), phytools (v0.6-99; Revell, 2012), and geiger (v2.0.6.4; Pennell et al., 2014) to construct a neighbor-joining tree with a Jukes-Cantor mutation model representing whole-genome phylogenetic relationships for all 163 individuals in our original sample. We pruned this phylogeny to drop tips not represented in the study population sample. Population Structure We used discriminant analyses of principal components (DAPCs) in adegenet v.2.1.2 (Jombart, 2008; Jombart & Ahmed, 2011) and calculated the Fixation Index (FST) using hierfstat v.0.04-22 (Goudet, 2005) to model population structure. To statistically validate the patterns noted, we ran analyses of molecular variance (AMOVA) in poppr v. 2.8.6 (Kanvar et al., 2014). In both DAPC and FST analyses we used sample population as a grouping variable, while in AMOVA we used population nested within taxon. In the FST analysis, we used the Nei87 setting, based on Nei’s (1987) method, to assess genetic distance between populations. We also ran principal component and admixture analysis using LEA v. 3.0.0 (Frichot & François, 2015), using the snmf function with standard settings to visualize entropy criterion values to define K levels of population differentiation. To model isolation by distance, we assessed the correlation between genetic and geographic distance matrices in our sample using Mantel tests implemented in vegan v. 2.5-6, using the Pearson method (Oksanen et al., 2019). We constructed our genetic distance matrix in poppr, also using Nei’s method (Nei, 1978), and our geographic distance matrix from GPS points associated with trapping locations using raster v. 3.4-5 (Hijmans & van Etten, 2012). Assessing Selection We calculated Hardy Weinberg Equilibrium (HWE) values with pegas (v. 0.12; Paradis, 2010) to identify candidate loci potentially experiencing selective forces, both within the whole southern expansion and in local populations. We assessed linkage disequilibrium (LD) with gpart v.1.6.0 (Kim and Yoo, 2020), using standard BigLD setting and the r2 method (to avoid the “ceiling effect” common to using D’ in small samples; Marroni et al., 2011). SNPs considered to be in LD with each other (r2 > 0.85) were subsumed into a single representative locus for downstream analyses. We calculated site-frequency spectrum statistics for the whole gene region including Tajima’s D, and Fu and Li’s D* and F* (PopGenome v.2.7.5; Pfeifer et al., 2014, pegas v.0.12; Paradis, 2010), as well as a sliding window Tajima’s D in vcftools with a window size of 500 bp. We compared UCP1 regional values of Tajima’s D and Fu and Li’s D* and F* to those from a sample of 1000 random, non-overlapping regions of equivalent size from vervet chromosome 7 to assess relative significance. To calculate integrated haplotype scores (iHS) and assess extended haplotype homozygosity (EHH) both across the whole gene region and for selected UCP1 loci, respectively, we inferred the ancestral allele sequence for our sample population using the program Est-sfs v.2.03 using Kimura’s mutation model (Keightley & Jackson, 2018). We chose the rhesus macaque reference (Macaca mulatta; BCM Mmul_8.0.1/rheMac8) as our outgroup, which we downloaded using biomaRt v. 2.44.4 (Durnick et al., 2009; Durnick et al., 2005). We aligned it to the vervet reference genome (Chlorocebus sabaeus; Chlorocebus_sabeus 1.1/chlSab2) using rMSA v. 0.99.0 (Hahsler & Manguy, 2020) and MAFFT v. 7.467 (Katoh & Standley, 2013) using standard settings and trimmed the macaque reference to the vervet extent visually in JalView v. 2.11.1.3 (Waterhouse et al., 2009). Ancestral alleles were assigned to the major vervet allele when the probability assigned by Est-sfs was above 0.70, and to the minor allele if below 0.30. When the probability was between these benchmarks, we assigned the ancestral allele to the allele shared by the two outgroups if it matched the population allele; when there was no concordance between the two outgroups (n = 5), we chose the Chlorocebus reference allele. We used rehh v.3.0.1 (Gautier et al., 2017) to estimate iHS and EHH using standard settings, but with a frequency bin of 0.15 to calculate iHS. We used a significance threshold of 2 for absolute iHS values (Voight et al., 2006) with a window size of 3000 bp and an overlap of 300 bp. We used a significance threshold value of 1.3 for considering individual SNPs for inclusion in further analyses. Ecological Covariates We used GPS coordinates recorded at each trapping location to download altitude, annual mean temperature, mean temperature of the coldest month, winter precipitation levels, and mean temperature of the wet season, among other ecological covariates, for each population for the 10-year period from 2005-2010 from the WorldClim2 online database (Fick & Hijmans, 2017). We also collected data on mean annual solar irradiance (measured in MJ/m2/day) for these points from the 10-year period from 2005-2010, originally generated by the NASA Langley Research Center (LaRC) POWER Project funded through the NASA Earth Science/Applied Science Program, using nasapower 3.0.1 (Sparks, 2018). We standardized all covariates using z-scores, and then used the package PerfomanceAnalytics v. 2.0.4 (Peterson et al., 2020) to reduce strongly correlated covariates. Modeling Allele Frequency by Ecological Covariates We used phylogenetic generalized least squares (PGLS) regression, implemented in the package nlme v. 3.1-150 (Pinheiro et al., 2021), to model variation in derived allele frequency of each target locus by geoclimatic variables including latitude, elevation, insolation/irradiation, annual mean temperature, mean temperature of the coldest month, and mean winter temperature. We incorporated our phylogenomic tree into our models using a Brownian correlation structure to account for average genetic distance across populations. We then used an information theoretic approach (Burnham & Anderson, 2002) to assess ecological covariate inclusion for each locus putatively experiencing selection, using the lowest Akaike Information Criterion modified for small sample sizes (AICc) to select the appropriate model.

  12. n

    Data from: Migration trajectories of the diamondback moth Plutella...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Aug 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ming‐Zhu Chen; Li‐Jun Cao; Bing‐Yan Li; Jin‐Cui Chen; Ya‐Jun Gong; Ary Anthony Hoffmann; Shu‐Jun Wei (2023). Migration trajectories of the diamondback moth Plutella xylostella in China inferred from population genomic variation [Dataset]. http://doi.org/10.5061/dryad.79cnp5htc
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 18, 2023
    Dataset provided by
    The University of Melbourne
    Beijing Academy of Agricultural and Forestry Sciences
    Authors
    Ming‐Zhu Chen; Li‐Jun Cao; Bing‐Yan Li; Jin‐Cui Chen; Ya‐Jun Gong; Ary Anthony Hoffmann; Shu‐Jun Wei
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Area covered
    China
    Description

    BACKGROUND:

    The diamondback moth (DBM), Plutella xylostella (Lepidoptera: Plutellidae), is a notorious pest of cruciferous plants. In temperate areas, annual populations of DBM originate from adult migrants. However, the source populations and migration trajectories of immigrants remain unclear. Here, we investigated migration trajectories of DBM in China with genome-wide single nucleotide polymorphisms (SNPs) genotyped using double-digest RAD (ddRAD) sequencing. We first analyzed patterns of spatial and temporal genetic structure among southern source and northern recipient populations, then inferred migration trajectories into northern regions using discriminant analysis of principal components (DAPC), assignment tests and spatial kinship patterns.

    RESULTS:

    Temporal genetic differentiation among populations was low, indicating sources of recipient populations and migration trajectories are stable. Spatial genetic structure indicated three genetic clusters in the southern source populations. Assignment tests linked northern populations to the Sichuan cluster, and central-eastern populations to the South and Yunnan clusters, indicating that Sichuan populations are sources of northern immigrants and South and Yunnan populations are sources of central-eastern populations. First-order (full-sib) and second-order (half-sib) kin pairs were always found within populations, but about 35-40% of third-order (cousin) pairs were found in different populations. Closely related individuals in different populations were in about 35-40% of cases found at distances of 900 to 1500 km, while some were separated by over 2000 km.

    CONCLUSION:

    This study unravels seasonal migration patterns in the DBM. We demonstrate how careful sampling and population genomic analyses can be combined to help understand cryptic migration patterns in insects.

    Methods Specimen collection and DNA extraction DBM were sampled from potential source population locations in the annual breeding area of southern China. DBM were collected from cabbage and oilseed rape fields, and all sampling was completed before the first observations of DBM in northern China between March and May 1, 2. In order to reduce the likelihood of sampling siblings within populations, third- and fourth-instar larvae of DBM were collected from about 20 sites at each sampling location, each at least 10 m apart. Putative immigrant male adults were collected in northern China by sex pheromone trapping before the presence of first-generation larvae. Trapping of male DBM was conducted in unplanted fields with no greenhouses within 500 m, to reduce the likelihood of trapping individuals overwintering in protected conditions. The distance between traps was at least 50 m. The development of one generation of DBM takes about 30 days in early spring 3. This strategy therefore restricted sampling of genetically related individuals to within three generations between source and recipient populations, and reduced the influence of genomic admixture between immigrants from different sources. This sampling was conducted in 2017 and again in 2018, to examine annual variation in migratory trajectories and temporal variation in population genetic structure. In total, samples were collected from 16 locations in 2017 and 17 locations in 2018, and in 2018 four locations were sampled across multiple months (Fig. 1, Table 1). Twenty individuals from each population (specimens collected at different times from the same location were considered as different populations) were used for genotyping. Genomic DNA for library preparation was extracted from individual specimens using DNeasy Blood and Tissue Kit (Qiagen, Germany). SNP genotyping The ddRAD libraries were prepared following a published protocol 4 for identifying SNPs. Briefly, 120 ng of extracted genomic DNA from each sample was digested by the restriction enzymes NlaIII and AciI (New England Biolabs, USA) 5. The 50 μL digestion reaction was run for 3 hours at 37 °C, followed by DNA cleaning using 1.5× volume of AMPure XP beads (Beckman Coulter, USA) instead of a heat kill step. Next, we ligated each sample to adapters barcoded with a combinatorial index at 16 °C overnight in a 40 μL ligation reaction, labeling each population with a 6-bp index and each individual with a unique 9-bp barcode. After ligation, we pooled uniquely barcoded samples into multiplexed libraries. Fragments between 380-540 bp were selected using BluePippin and a 2% gel cassette (Sage Sciences, USA). Finally, the pooled libraries were enriched with 12 amplification cycles on a Mastercycler Nexus Thermal Cycler (Eppendorf, Germany). PCR products were cleaned with 0.8× volume of beads. We used Qubit 3.0 (Life Invitrogen, USA) and Agilent 2100 Bioanalyzer (Agilent Technology, USA) to check the concentration and size distribution of enriched libraries, respectively. Pooled libraries were sequenced on an Illumina HiSeq 2500 platform to obtain 150-bp paired-end reads, at BerryGenomics Company (Beijing, China). The Stacks v2.3 pipeline 6 was used to call SNPs, linking to the DBM genome (GenBank assembly accession: GCA_000330985.1) as reference 7. FastQC v 0.11.5 was employed to assess read quality and check for adapter contamination 8. Sequence data was demultiplexed and trimmed using process_radtags in Stacks v2.3 6, 9. Low quality reads with a Phred score below 20 were removed as well as any reads with an uncalled base. Reads were trimmed to 140 bp in length. The remaining paired-end reads were aligned to the DBM genome 7 using Bowtie v2.3.5 10. Output reads for all individuals were imported into Stacks pipeline ref_map.pl to call SNPs, requiring a minimum of three identical reads to create a stack. SNPs were called using a maximum likelihood statistical model. Finally, we obtained a catalog with all possible loci and alleles. The exported loci were present in all populations, and in at least 75% of individuals per population. The exported SNPs for populations that were collected in both years were further filtered using the R package vcfR 11 and VCFtools v0.1.16 12 with the following criteria: SNPs with sequencing depth ≤ 3 and in the highest 0.1% depth were removed, as were SNPs with missingness in all samples ≥ 0.05 and those with minimum minor allele count ≤ 20. An additional data matrix was generated by retaining only SNPs separated by at least 500 bp, to reduce linkage among SNPs. Genetic diversity, population structure and assignment tests Global population differentiation was estimated using Weir and Cockerham’s FST with 99% confidence intervals (1000 bootstraps) in diveRsity version 1.9.90. Pairwise FST for all population pairs was estimated using GenePop version 4.7.2 13. Discriminant analysis of principal components (DAPC) was performed in the R package adegenet v2.1.1 14, with the optimal number of clusters determined by the Akaike information criterion (AIC). Assignment tests were performed in assignPOP v1.1.7 15. Source groups of ST (south) and SW (southwest, this group was divided into YN and SC groups in 2018) (see Table 1 and Fig. 1 for locations) were trained using the support vector machine algorithm to build predictive models. For training, we used either 25, 28, or 32 random individuals (2017 samples) or 13, 15 or 17 random individuals (2018 samples) from each group, and loci with the highest 60%, 80% or 100% FST values. Monte-Carlo cross-validation was performed by resampling each training set combination 1000 times. The ratio of assignment probability between the most-likely and second most-likely assigned groups was calculated for each individual 16. When an individual showed an assignment ratio smaller than 2 in more than 30% of the resampling analysis, it was considered unstable and removed in subsequent training. This allowed us to remove individuals from source populations that are not similar enough to other individuals in that source population, thus leaving a set of source populations each comprised of individuals distinctive from those in other populations. Immigrants from the CE (central) and NT (north) regions (see Table 1 and Fig. 1 for locations) were assigned to the trained groups using the support vector machine algorithm. Kinship analysis As a complement to assignment tests (but focusing on the individual level rather than the population level), we investigated spatial patterns of kinship within and between populations. Related individuals were identified following the method of Jasper, Schmidt, Ahmad, Sinkins and Hoffmann 17. First, Loiselle’s K was calculated for all individual pairs using SPAGeDi 18 . Kinship coefficients represent the probability that any allele scored in both individuals is identical by descent, with theoretical mean K values for each kinship category as follows: full‐siblings = 0.25, half-siblings = 0.125, full‐cousins = 0.0625, half‐cousins = 0.0313, second-cousins = 0.0156 and unrelated = 0. To allocate pairs of individuals to relatedness categories across three orders of kinship, maximum‐likelihood estimation in the program ML‐Relate 19 was used to identify first‐order (full‐sibling) and second‐order (half‐sibling) pairs. The K scores of pairs within the full‐sibling and half-sibling data sets were used to calculate standard deviations for these categories. Using the theoretical means and standard deviations of K, we randomly sampled 100,000 simulated K scores from each kinship category. In the initial pool of 40755 pairings (2017) and 89676 pairings (2018), ML‐Relate identified 33 (2017) and 36 (2018) full‐sibling and half‐sibling pairs. Assuming that the data contained twice as many first cousin (full and half) pairings as sibling (full and half) pairings, and twice as many second cousin pairings as first cousin pairings, final sampling distributions were developed as follows: 100,000 unrelated, 320 second-cousins, 80 full‐cousins, 80 half‐cousins, 40

  13. Cell and gene data for testicular single-cell RNA-Seq

    • figshare.com
    xlsx
    Updated Sep 18, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Soeren Lukassen; Elisabeth Bosch; Arif B. Ekici; Andreas Winterpacht (2019). Cell and gene data for testicular single-cell RNA-Seq [Dataset]. http://doi.org/10.6084/m9.figshare.6139469.v2
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Sep 18, 2019
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Soeren Lukassen; Elisabeth Bosch; Arif B. Ekici; Andreas Winterpacht
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Tables containing additional information on genes and cell obtained from single-cell RNA-Seq analysis of mouse testis data. SuppTableCells.xlsx contains the 10X Barcode as identifier, the replicate ID, position in the t-SNE plot, UMI and gene count per cell, the proportion of mitochondrial transcripts, cluster ID obtained by k-Means clustering (k=9), inferred cell type, and pseudotime information obtained using monocle and Scrat.SuppTableGenes contains the average expression value, fold-change compared to other cell types, and p-value relative to other cell types for each gene that was expressed in the dataset. Throughout the data, the following abbreviations for cell types are used: Spg=spermatogonia, SC=spermatocytes, RS=round spermatids, ES=elongating spermatids, CS=condensed/condensing spermatids. In cases where several clusters were identified per cell type, the earlier cluster was designated as 1.

  14. WilliamsAT_prePMID_HRI.tsv.gz

    • figshare.com
    application/gzip
    Updated Sep 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WilliamsAT_prePMID_HRI.tsv.gz [Dataset]. https://figshare.com/articles/dataset/WilliamsAT_prePMID_HRI_tsv_gz/16622062
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Sep 15, 2021
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Alexander Williams; Nick Shrine; Hardeep Naghra-van Gijzel; Joanna Betts; Edith Hessel; Catherine John; Richard Packer; Nicola Reeve; Astrid Yeo; Erik Abner; Bjørn Olav Åsvold; Juha Auvinen; Traci Bartz; Yuki Bradford; Ben Brumpton; Archie Campbell; Michael Cho; Su Chu; David Crosslin; QiPing Feng; Tõnu Esko; Sina Gharib; Caroline Hayward; Scott Hebbring; Kristian Hveem; Marjo-Riitta Jarvelin; Gail Jarvik; Sarah Landis; Eric Larson; Jiangyuan Liu; Ruth Loos; yuan luo; Arden Moscati; Hana Mullerova; Bahram Namjou; David Porteous; Jennifer Quint; Regeneron Genomics Center; Marylyn Ritchie; Eeva Sliz; Ian Stanaway; Laurent Thomas; James Wilson; Ian Hall; Louise Wain; David Michalovich; Martin Tobin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These summary statistics were derived from a genome-wide association study of hospitalised respiratory infections in UK Biobank.

    The file includes data for 52,488,101 unique genetic variants and contains the following columns: variant_id, chromosome, base_pair_location (GRCh37), effect_allele, other_allele, effect_allele_frequency, beta (log-odds ratio), standard_error, p_value.

    This study used anonymised data from UK Biobank, a cohort of 500,000 volunteer participants aged between 40 and 69 years recruited from across the United Kingdom (UK) between 2006 and 2010. This research was conducted under approved UK Biobank data applications 648 and 4892. Individuals included in our analysis if (1) they had genome-wide imputed genetic data; (2) they had complete information for age (at recruitment), sex and smoking status (at recruitment); (3) they had no 2nd degree or closer relative (defined by a kinship estimate >0.0884 from the KING software, provided by UK Biobank), and (4) they were of European ancestry based on k-means clustering of the first two principal components of ancestry. Overall, we included 19,459 hospitalised respiratory infection cases and 101,438 healthy controls in our study.

    Genotyping was undertaken using the Affymetrix Axiom UK BiLEVE and UK Biobank arrays. Genotype imputation was conducted using the Haplotype Reference Consortium (HRC) panel and the merged 1000 Genomes phase 3 and UK10K panels. Imputed genotypes with a minor allele count >20 and an imputation quality score >0.5 were tested for association with hospitalised respiratory infections.

    PLINK 2.0 (https://www.cog-genomics.org/plink/2.0/) was used to conduct the genome-wide association study. We assessed autosomal variant associations under an additive genetic model adjusted for age (at recruitment), age2, genotyping array, sex, smoking status and the first ten principal components of ancestry. We analysed variant dosages in order to account for genotype uncertainty.

    Further details and results are described in the manuscript: Genome-wide association study of susceptibility to hospitalised respiratory infections. Wellcome Open Research.

  15. f

    Inference results of Boolean network (Best-Fit with optimal discretization...

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maria Simak; Chen-Hsiang Yeang; Henry Horng-Shing Lu (2023). Inference results of Boolean network (Best-Fit with optimal discretization k-means and maxK = 2), Dynamic Bayesian Network (with least squares estimation, and default parameters alpha1 = 0.5 and alpha2 = 0.05 for first order dependencies and full dependencies correspondingly) and Boolean Function Network (maxTimeDelay = 10 and p1 = 0.005). [Dataset]. http://doi.org/10.1371/journal.pone.0185475.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Maria Simak; Chen-Hsiang Yeang; Henry Horng-Shing Lu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Inference results of Boolean network (Best-Fit with optimal discretization k-means and maxK = 2), Dynamic Bayesian Network (with least squares estimation, and default parameters alpha1 = 0.5 and alpha2 = 0.05 for first order dependencies and full dependencies correspondingly) and Boolean Function Network (maxTimeDelay = 10 and p1 = 0.005).

  16. f

    Data from: Successful Identification of Duck Genome Region Determining...

    • scielo.figshare.com
    jpeg
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    A Dobek; E Gornowicz; K Moliński; B Grajewski; M Lisowski; T Szwaczkowski (2023). Successful Identification of Duck Genome Region Determining Desirable Uniformity of Meat Performance Traits [Dataset]. http://doi.org/10.6084/m9.figshare.20009783.v1
    Explore at:
    jpegAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    SciELO journals
    Authors
    A Dobek; E Gornowicz; K Moliński; B Grajewski; M Lisowski; T Szwaczkowski
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ABSTRACT The objective of this study was to identify genome regions determining duck meat performance traits with possible small variation. In total, 368 crossbred ducks of F2 generation obtained from two parental lines: Pekin-type ducks of Polish origin (A55) and Pekin-type ducks of French origin (GL-30) were recorded. The following seven traits were analyzed: body weight, breast muscle weight, leg muscle weight, water holding capacity in the breast and leg muscles, and color lightness L* of the breast and leg muscles. All birds (including parental and F1 generations) were genotyped (29 microsatellite markers). Means and coefficients of variation (CV) were calculated for 28 full-sibs (four sires by six dams and one sire by four dams). Number of progeny per full-sib group ranged from 7 to 17. The multivariate cluster analysis using grouping by k-means algorithm was used on transformed data. The multivariate cluster analysis gave two clusters: first group with 10 full-sibs and second one with 18 families. Differences among half-sibs in the CV of the recorded traits were determined. It should be noted that one out of five sire groups showed statistically significant differences from the other ones. Moreover, the CVs in this group were smaller. The analysis of microsatellite markers indicated three alleles from three loci were present only in the “superior” sire group. The obtained results indicate a promising opportunity of effective selection for improving carcass technological quality using molecular markers.

  17. f

    Table4_A Pyroptosis-Related Signature Predicts Overall Survival and...

    • figshare.com
    xlsx
    Updated Jun 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaibin Zhu; An Yan; Fucheng Zhou; Su Zhao; Jinfeng Ning; Lei Yao; Desi Shang; Lantao Chen (2023). Table4_A Pyroptosis-Related Signature Predicts Overall Survival and Immunotherapy Responses in Lung Adenocarcinoma.XLSX [Dataset]. http://doi.org/10.3389/fgene.2022.891301.s009
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    Frontiers
    Authors
    Kaibin Zhu; An Yan; Fucheng Zhou; Su Zhao; Jinfeng Ning; Lei Yao; Desi Shang; Lantao Chen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Background: Lung adenocarcinoma (LUAD) is a highly malignant cancer with a bleak prognosis. Pyroptosis is crucial in LUAD. The present study investigated the prognostic value of a pyroptosis-related signature in LUAD.Methods: LUAD’s genomic data were downloaded from TCGA and GEO databases. K-means clustering was used to classify the data based on pyroptosis-related genes (PRGs). The features of tumor microenvironment were compared between the two subtypes. Differentially expressed genes (DEGs) were identified between the two subtypes, and functional enrichment and module analysis were carried out. LASSO Cox regression was used to build a prognostic model. Its prognostic value was assessed.Results: In LUAD, genetic and transcriptional changes in PRGs were found. A total of 30 PRGs were found to be differentially expressed in LUAD tissues. Based on PRGs, LUAD patients were divided into two subgroups. Subtype 1 has a higher overall survival rate than subtype 2. The tumor microenvironment characteristics of the two subtypes differed significantly. Compared to subtype 1, subtype 2 had strong immunological infiltration. Between the two groups, 719 DEGs were discovered. WGCNA used these DEGs to build a co-expression network. The network modules were analyzed. A prognostic model based on seven genes was developed, including FOSL1, KRT6A, GPR133, TMPRSS2, PRDM16, SFTPB, and SFTA3. The developed model was linked to overall survival and response to immunotherapy in patients with LUAD.Conclusion: In LUAD, a pyroptosis-related signature was developed to predict overall survival and treatment responses to immunotherapy.

  18. f

    Table_2_Comparison of Normalization Methods for Analysis of TempO-Seq...

    • frontiersin.figshare.com
    • figshare.com
    xlsx
    Updated Jun 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pierre R. Bushel; Stephen S. Ferguson; Sreenivasa C. Ramaiahgari; Richard S. Paules; Scott S. Auerbach (2023). Table_2_Comparison of Normalization Methods for Analysis of TempO-Seq Targeted RNA Sequencing Data.xlsx [Dataset]. http://doi.org/10.3389/fgene.2020.00594.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Frontiers
    Authors
    Pierre R. Bushel; Stephen S. Ferguson; Sreenivasa C. Ramaiahgari; Richard S. Paules; Scott S. Auerbach
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of bulk RNA sequencing (RNA-Seq) data is a valuable tool to understand transcription at the genome scale. Targeted sequencing of RNA has emerged as a practical means of assessing the majority of the transcriptomic space with less reliance on large resources for consumables and bioinformatics. TempO-Seq is a templated, multiplexed RNA-Seq platform that interrogates a panel of sentinel genes representative of genome-wide transcription. Nuances of the technology require proper preprocessing of the data. Various methods have been proposed and compared for normalizing bulk RNA-Seq data, but there has been little to no investigation of how the methods perform on TempO-Seq data. We simulated count data into two groups (treated vs. untreated) at seven-fold change (FC) levels (including no change) using control samples from human HepaRG cells run on TempO-Seq and normalized the data using seven normalization methods. Upper Quartile (UQ) performed the best with regard to maintaining FC levels as detected by a limma contrast between treated vs. untreated groups. For all FC levels, specificity of the UQ normalization was greater than 0.84 and sensitivity greater than 0.90 except for the no change and +1.5 levels. Furthermore, K-means clustering of the simulated genes normalized by UQ agreed the most with the FC assignments [adjusted Rand index (ARI) = 0.67]. Despite having an assumption of the majority of genes being unchanged, the DESeq2 scaling factors normalization method performed reasonably well as did simple normalization procedures counts per million (CPM) and total counts (TCs). These results suggest that for two class comparisons of TempO-Seq data, UQ, CPM, TC, or DESeq2 normalization should provide reasonably reliable results at absolute FC levels ≥2.0. These findings will help guide researchers to normalize TempO-Seq gene expression data for more reliable results.

  19. f

    Additional file 5 of Emergence and influence of sequence bias in...

    • springernature.figshare.com
    xlsx
    Updated Aug 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Margarita V. Brovkina; Margaret A. Chapman; Matthew L. Holding; E. Josephine Clowney (2024). Additional file 5 of Emergence and influence of sequence bias in evolutionarily malleable, mammalian tandem arrays [Dataset]. http://doi.org/10.6084/m9.figshare.26612219.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Aug 13, 2024
    Dataset provided by
    figshare
    Authors
    Margarita V. Brovkina; Margaret A. Chapman; Matthew L. Holding; E. Josephine Clowney
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 5. Gene-wise data, including isochore assignment, k-means cluster, local sequence characteristics, variant rates.

  20. f

    Table_3_A Ten-N6-Methyladenosine (m6A)-Modified Gene Signature Based on a...

    • frontiersin.figshare.com
    xlsx
    Updated Jun 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wei Huang; Gen Li; Zihang Wang; Lin Zhou; Xin Yin; Tianshu Yang; Pei Wang; Xu Teng; Yajuan Feng; Hefen Yu (2023). Table_3_A Ten-N6-Methyladenosine (m6A)-Modified Gene Signature Based on a Risk Score System Predicts Patient Prognosis in Rectum Adenocarcinoma.xlsx [Dataset]. http://doi.org/10.3389/fonc.2020.567931.s004
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    Frontiers
    Authors
    Wei Huang; Gen Li; Zihang Wang; Lin Zhou; Xin Yin; Tianshu Yang; Pei Wang; Xu Teng; Yajuan Feng; Hefen Yu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ObjectivesThe study aims to analyze the expression of N6-methyladenosine (m6A)-modified genes in rectum adenocarcinoma (READ) and identify reliable prognostic biomarkers to predict the prognosis of READ.Materials and MethodsRNA sequence data of READ and corresponding clinical survival data were obtained from The Cancer Genome Atlas (TCGA) database. N6-methyladenosine (m6A)-modified genes in READ were downloaded from the “m6Avar” database. Differentially expressed m6A-modified genes in READ stratified by different clinicopathological characteristics were identified using the “limma” package in R. Protein-protein interaction (PPI) network and co-expression analysis of differentially expressed genes (DEGs) were performed using “STRING” and Cytoscape, respectively. Principal component analysis (PCA) was done using R. In addition, Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways were used to functionally annotate the differentially expressed genes in different subgroups. Univariate Cox regression analyses were conducted to identify the powerful independent prognostic factors in READ associated with overall survival (OS). A robust likelihood-based survival model was built using the “rbsurv” package to screen for survival-associated signature genes. The Support Vector Machine (SVM) was used to predict the prognosis of READ through the risk score of survival-associated signature genes. Correlation analysis were carried out using GraphPad prism 8.ResultsWe screened 974 differentially expressed m6A-modified genes among four types of READ samples. Two READ subgroups (group 1 and group 2) were identified by K means clustering according to the expression of DEGs. The two subgroups were significantly different in overall survival and pathological stages. Next, 118 differentially expressed genes between the two subgroups were screened and the expression of 112 genes was found to be related to the prognosis of READ. Next, a panel of 10 survival-associated signature genes including adamtsl1, csmd2, fam13c, fam184a, klhl4, olfml2b, pdzd4, sec14l5, setbp1, tmem132b was constructed. The signature performed very well for prognosis prediction, time-dependent receiver-operating characteristic (ROC) analysis displaying an area under the curve (AUC) of 0.863, 0.8721, and 0.8752 for 3-year survival rate, prognostic status, and pathological stage prediction, respectively. Correlation analysis showed that the expression levels of the 10 m6A-modified genes were positively correlated with that of m6A demethylase FTO and ALKBH5.ConclusionThis study identified potential m6A-modified genes that may be involved in the pathophysiology of READ and constructed a novel gene expression panel for READ risk stratification and prognosis prediction.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2025). Gap filling of real genomics data based on k-means clustering of populations [Dataset]. https://catalog.eoxhub.fairicube.eu/collections/ML%20collection/items/LFT4VLQJOC

Gap filling of real genomics data based on k-means clustering of populations

Explore at:
data, binAvailable download formats
Dataset updated
May 14, 2025
License

https://spdx.org/licenses/MIT.htmlhttps://spdx.org/licenses/MIT.html

Time period covered
May 14, 2025
Area covered
Earth
Description

Gap filling of real genomics data based on k-means clustering of populations

Search
Clear search
Close search
Google apps
Main menu