https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
We used double digestion restriction site associated DNA (ddRAD) sequencing to discover SNPs in samples across a transect including a hybrid zone between Podarcis carbonelli and Podarcis carbonelli. We used P. bocagei and P. carbonelli samples from the locations at the extremes of the transect as references. We obtained a SNP dataset including all SNPs after removing loci with depth coverage <8, missing data >20%, removing loci containing more than five SNPs, and with more than 70% heterozygosity (complete dataset; 6905 SNPs, 329 individuals). Additionally, we obtained from the complete dataset two other datasets, prior to apply a missing data filter. One dataset contained loci with allele frequencies higher than 0.8 in the reference population containing only parental individuals of one species and lower than 0.2 in the reference population of the other species ("80/20" dataset; 2300 SNPs, 329 individuals); the other dataset comprised diagnostic SNPs between reference populations (diagnostic dataset; 1241 SNPs, 236 individuals) but excluding private alleles from references, i.e. excluding alleles that are not present in the populations of contact. Individuals with missing data >35% were removed from all datasets (the number of individuals reported for each dataset is after applying this filter, but note that the 80/20 and the diagnostic datasets were obtained before applying this filter to the complete dataset). Across datasets, average depth of coverage by individuals was 28 (median = 26.8, min = 12.5, max = 85.8) and by loci was 29 (median = 28.8; min = 15.6; max = 48.6). The analysis of replicate samples (four samples were replicated twice, i.e. were amplified and sequenced in independent libraries and SNP calling was performed independently) showed high levels (99.87%) of multilocus genotype replicability. Methods Samples were collected between spring and autumn of 2013 in a contact zone between Podarcis bocagei and P. carbonelli. We collected samples across a transect with 8 locations, including the hybrid zone. Sampling scheme aimed at capturing all the individuals encountered, avoiding bias towards species, sex or age. We used 20 P. bocagei and 23 P. carbonelli samples from the locations at the extremes of the transect as references. We obtained SNP datasets from ddRAD sequencing from 356 samples by preparing one library following the modifications described by Brelsford et al. (2016) to the protocols from Parchman et al. (2013), Peterson et al. (2012) and Purcell et al. (2014), and sequenced on a Illumina® HiSeq 2000. Individual raw reads were demultiplexed using the process_radtags module of Stacks version 2.2 (Catchen et al., 2013). SNP calling was performed using the Stacks pipeline, following Rochette and Catchen (2017) recommendations, by running consecutivelly ustacks (build loci), cstacks (create a catalogue of loci), sstacks (match individual samples against the catalogue), tsv2bam (transpose data) and gstacks (align each read to a locus and call SNPs) units. SNP filtering was done with populations unit from Stacks, VCFtools 0.1.15 (Danecek et al., 2011) and a custom Python script (available at https://github.com/catpinho/filter_RADseq_data).
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
This data set was used to assess the climate resilience of Themeda triandra, a foundational species and the most widespread plant in Australia, by assessing the relative contributions of spatial, environmental, and ploidy factors to contemporary genomic variation. Reduced-representation genome sequencing on 472 samples from 52 locations was used to test how the distribution of genomic variation, including ploidy polymorphism, supports adaptation to hotter and drier climates. For reduced-representation library preparation and sequencing, genomic DNA from each individual was isolated from approximately 25 mg of silica-dried leaf tissue using the Stratec Invisorb DNA Plant HTS 96 kit (Invitek, Berlin, Germany). Libraries for each individual were created similarly to Ahrens et al. (2017). Briefly, extracted DNA was digested with PstI for genome complexity reduction, and ligated with a uniquely barcoded sequencing adapter pair. We then amplified each sample individually by PCR. Amplicons between 350 and 600 bp were selected from an agarose gel. The final library pool was sequenced on three Illumina NextSeq400 lanes using a 75 bp paired-end protocol on a high output flowcell at the Biomolecular Resources Facility at the Australian National University, generating approximately 864 million read pairs.
We checked the quality of the raw short-read sequencing reads with FastQC v0.10.1 (Andrews, 2010), then demultiplexed the raw reads associated with each sample’s unique combinatorial barcode using AXE v0.2.6 (Murray & Borevitz, 2018). During this step we were unable to assign 19% of the reads. Each sequence was trimmed to 64 basepairs while removing the barcodes. Read quality was assessed with trimmomatic v 0.38 (Bolger, Lohse, & Usadel, 2014) using a sliding window of 4 basepairs (the number of bases used to average quality) and a quality score of 15 (the average quality required among the sliding window), and if the average quality dropped below 15, the sequences were cut. Long-reads were indexed (Figure S2 for distribution of length and number of reads sequenced) using the BWA software and the index argument. Short-reads were aligned to the long-reads for more accurate SNP calling compared to a de novo pipeline. Short-reads were aligned using BWA-mem v 0.7.17-r1198 (Li, 2013), as paired reads, with 82.5% of reads successfully mapped. Samtools v 1.9 (Li et al., 2009) was used to transform the SAM files to BAM files for use within STACKS v 2.41 (Catchen, Hohenlohe, Bassham, Amores, & Cresko, 2013). The argument gstacks and populations were used in that order on the BAM files to create a VCF file, minimum thresholds (minor allele frequency = 0.01; one random SNP per read was retained) were set here for further filtering in R (R core development team 2019).
The minimum missing data threshold was set to 50% per locus and individual which resulted in an average of 30% missing data from the whole SNP dataframe. Minor allele frequency was set to 0.05 to avoid identifying patterns of population structure that may be due to locally shared alleles.
Usage notes:
lfmm file for snmf analysis - 012 file represents count of minor allele. missing values = 9 |||| the individual key file is: themea_52_012v2_indkey.key
genpop file provided for input as genind object. missing values = 0000
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Crop wild relatives possess desirable traits that confer resilience to various environmental stresses. We applied landscape genomics, that associates environment with genomic variation to understand the genetic basis of their adaptation.
In this study, we applied landscape genomics to examine the differences in allele frequency of 15,416 Single Nucleotide Polymorphisms (SNPs) among 153 accessions of wild eggplant relatives from Africa, the principal hotspot of these wild relatives. Further, we explored the correlation between the genetic variations and the bio-climatic and soil conditions at their collection sites.
Our results showed that the environment has a greater impact on the genetic variation in the eggplant wild relative populations compared to the geographical distances between collection sites while controlling for population structure. These findings indicate the relevance of the environment in shaping genetic variation in eggplant relatives over time. We detected also candidate SNPs associated with ten environmental factors. Some of these SNPs signal genes involved in pathways that help with adaptation to environmental stresses such as drought, heat, cold, salinity, pests, and diseases. Methods SNP dataset According to the manufacturer's instructions, we isolated the genomic DNA from fresh leaves of five seedlings per accession using the FavorPrep Plant Genomic DNA Extraction Mini Kit (FAVORGEN). We then constructed the sequencing library following the approach of Elshire et al. (2011). Genomic DNA was quantified by Qubit and normalized to 100ng in 96-well plates. We digested the DNA samples using the restriction enzyme ApeKI and ligated them with two adapters for sequencing, followed by the polymerase chain reaction to amplify the target DNA fragments to complete the sequencing library preparation. A service provider did sequencing with the Illumina HiseqX platform in a pair-end 150bp run.For the SNP calling, we followed mainly the manual of Stacks software (Catchen et al., 2013). In short, we filtered the raw reads by quality and demultiplexed using the process radtags program. We then mapped the retained reads to the eggplant reference genome (Eggplant_V4.1.fa) (Barchi et al., 2021) using the Burrows-Wheeler Aligner (BWA) version 0.7.17 (Li & Durbin, 2009). We sorted and indexed the reads using Samtools version 1.15.1 (Li et al., 2009), after which we performed the variant calling using the gstacks and population programs in Stacks software. We further filtered the SNPs and the accessions with less than 20% missing data and a Minor Allele Frequency (MAF) > 0.05, giving the final high-quality SNP dataset comprising 15,146 SNPs. Environmental variable dataset We downloaded the grids for 19 bioclimatic variables, solar radiation, wind speed, and vapor pressure derived from WorlClim 2.1 (Fick & Hijmans, 2017) at a resolution of 2.5 minutes. The 19 bioclimatic variables were each downloaded as annual data averages between 1970 and 2000. We averaged the monthly solar radiation, wind, and vapor pressure rasters to obtain annual value rasters from this period. We downloaded the soil data from the SoilGrids database released in 2016 (https://soilgrids.org/) through ISRIC—WDC Soils (Hengl et al., 2017) at 250-meter resolution and at a depth of 15-30 cm, approximately the depth at which the eggplant roots can grow. Soil variables included nitrogen, soil organic carbon, organic carbon density, organic carbon stock, cation exchange capacity, pH, clay sand, and silt content. The soil dataset resolution was aggregated to match that of the climate data using the resample and extent functions of the raster package in R (Hijmans, 2023), ensuring they are consistent in both resolution and extent. The environmental variables for each accession with the extract function of the R raster package (Hijman, 2023) using the GIS coordinates at sampling points to obtain a full data set of all the climate and soil variables. For our modeling, we selected the environmental variables based on Variance Inflation Factors (VIFs) selecting for variables with a VIF less than 5.
Lipid content: From 2015-2019, somatic lipid content (as a percent) was measured for each fish using a battery powered, handheld microwave oscillator (Distell Model 692 Fish Fatmeter, Distell Inc.). The fatmeter emits a low-powered microwave (2 GHz, 2000 MHz, power 2 mW) that interacts with water within the somatic tissues and uses the inverse relationship between water and lipid to estimate the lipid concentration (as a percent) in the tissue (Crossin and Hinch, 2005; Kent, 1990). Fatmeter readings taken on the Research-1 setting were collected on each fish at a site that was in the epaxial muscle mass just posterior to the head (site "S1" as shown in Sitar et al., 2020). From 2012 to 2014 the fish were too small to use the fatmeter on, but a subsample (n2012=12; n2013-2014=20) of fish from each cross were lethally sampled and a muscle sample (cross-section of the epaxial muscle behind the head and anterior to the dorsal fin) was excised and used for lipid analysis by Soxhlett extraction as described in Goetz et al. (2014) and Sitar et al. (2020). Pedigree: Parentage of all individuals was assigned using genotypes from a six-microsatellite marker panel amplified and genotyped for all parents and offspring using previously designed primers for Loci SnaMSU 01, 03, 06, 10, 11, and 12 (Rollins et al., 2009). Pedigrees were reconstructed using Colony2 Version 2.0.6.5 and a full-likelihood approach specifying maternal polygamy without inbreeding and no sibship prior (Jones & Wang, 2010). Genotype data: Restriction site associated DNA sequencing (RAD-seq) was conducted on 74 parents and 542 F1 offspring using SbfI and bestRAD library protocols outlined in Ali et al. (2016). An initial library was sequenced at BGI America (Cambridge, MA) on one lane of a HiSeq4000 and the remainder of the libraries were sent to Novogene (Sacramento, CA) where they were sequenced on seven HiSeq4000 lanes for paired-end 150 sequencing. Identification of SNPs and genotyping were conducted in STACKS v.2.3 using the de novo assembly pipeline (Rochette et al., 2019). Samples were demultiplexed with process_radtags (flags = c, -q, -r, -t 140). Stacks of similar sequences (loci) for each individual were identified with ustacks (flags = -m 3, -M 5, -H –max_locus_stacks 4, --model_type bounded, --bound_high 0.05) and a catalog of putative loci was generated based on sequences from the parents. Individual stacks were then aligned to the catalog in sstacks, and genotypes for all putative SNPs were assigned using gstacks. Finally, a datafile containing genotypes for all SNPs with a minor allele count greater than two and all individuals was generated using populations and subsequently filtered in VCFTools (Danecek, et al., 2011). To ensure that paralogs did not influence our findings, we ran HDPlot on unfiltered data for all crosses and removed loci identified as potential paralogs (McKinney et al., 2017). Parameters for this analysis were set by visually choosing threshold values for read depth ratio and proportion of heterozygotes that identified the loci conforming to theoretical expectations for singletons (McKinney et al., 2017; supplemental figure 1). Once putative paralogs were removed, quality filtering was conducted in VCFTools hierarchically. First, all individuals with more than 80% missing data were removed from the dataset. Second, SNPs missing more than 10% of genotypes were removed. Finally, for loci that contained more than one SNP, the SNP with the highest minor-allele-frequency was retained leaving a single SNP per-locus. All sites that passed the above thresholds were retained in downstream analysis. All bioinformatic analysis was conducted using the Turing High Performance Computing cluster at Old Dominion University, Virginia. Genetic evidence of selection for complex and polygenically regulated phenotypes can easily become masked by neutral population genetic structure and phenotypic plasticity. Without direct evidence of genotype-phenotype associations, it can be difficult to conclude to what degree a phenotype is heritable or a product of environment. Common garden laboratory studies control for environmental stochasticity and help to determine the mechanism that regulates traits. Here we assess lipid content, growth, weight, and length variation in full and hybrid F1 crosses of deep and shallow water sympatric lake charr ecotypes reared for nine years in a common garden experiment. Redundancy analysis (RDA) and quantitative-trait-loci (QTL) genomic scans are used to identify associations between genotypes at 19,714 single nucleotide polymorphisms (SNPs) aligned to the lake charr genome and individual phenotypes to determine the role that genetic inheritance plays in ecotype phenotypic diversity. Lipid content, growth, length, and weight differed significantly among lake charr crosses throughout the experiment suggesting that pedigree plays a la...
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Hymenocallis henryae is a rare, charismatic spider-lily endemic to the Florida panhandle. Currently under review to determine if listing under the Endangered Species Act is warranted, this species has undescribed genetic diversity, information crucial to the listing process. We conducted field observations of 21 historic populations across the species' geographical range and performed genomic analyses of 279 individuals from 19 extant populations. Most populations had fewer than 40 individuals, while populations with >100 individuals were found exclusively on managed lands. Genetic diversity was uniformly low within populations (HE: 0.074- 0.093), with low to moderate inbreeding coefficients (FIS: 0.068-0.431). Genetic differentiation was relatively low among most populations (FST: 0–0.098), although there was statistical support for isolation by distance. In addition, we found high genetic similarity and lack of population structure across the species range. Clonal propagation through fused bulbs is a common reproductive strategy. We confirmed current threats (habitat change, residential development, fire suppression) and identified several coastal populations threatened by sea level rise. It is recommended to continue with in situ protection and management as well as establishment of ex situ living collections to preserve populations most at risk of extirpation from habitat loss and degradation. Methods DNA Isolation and Sequencing Total genomic DNA was extracted from 279 selected samples. Briefly, frozen tissue was ground using a mortar and pestle in an extraction buffer containing 100mM Tris, pH 8, 50 mM EDTA, 500 mM NaCl, and 0.1% W:V PVP-40. DNA was precipitated with 5M potassium acetate followed by an isopropanol wash and resuspension in Tris-EDTA (TE) buffer. DNA quantity was assessed using Thermo Fisher Scientific Qubit 4.0. DNA quality was assessed on a 1% agarose gel, and purified DNA was stored at -80 °C. Samples that displayed adequate quality and reached a minimum DNA concentration of 20 ng/ul were then sent to Floragenex (Floragenex, Inc, 4640 SW Macadam Ave, Portland, OR), where double‐digest restriction site associated DNA sequencing (ddRAD-Seq) was carried out. To summarize, DNA was first digested using the restriction endonucleases PstI and MseI. Samples were diluted for PCR amplification and the product was used to construct a ddRAD‐Seq library. The library was sequenced at the University of Oregon Genomics and Cell Characterization Core Facility (GC3F) on a NovaSeq 6000 with a SP100 chip, generating 118 bp single end reads with a mean 27.5x effective coverage per sample. The sequence data was run through the pipeline STACKS (version 2.60) to assemble the short‐read sequences from all the samples (via the process radtags program), and to align reads into loci that are genotyped (via the gstacks program). Single nucleotide polymorphism data was exported in VCF version 4.2 file format for downstream data analysis. Three quality cut‐off filters were applied allowing for genotypes present in 40%, 60%, or 80% of individuals. We used a dataset in which each locus was represented in at least 60% of individuals; datasets with less missing data (found in 80% of individuals) resulted in a loss of informative loci. Genetic Diversity and Clonality Assessment To assess within‐population genetic diversity, we calculated heterozygosity and inbreeding coefficients for each population using the R package hierf‐stat. To assess genetic differentiation between populations, we calculated pairwise FST for populations using the package StaMPP. To investigate isolation by distance, we ran a Mantel test for a significant relationship between pairwise FST and geographic distance between populations using the package ade4. We estimated ancestry coefficients for individuals via an sNMF analysis using the package LEA and performed a Principal component analysis (PCA) with the function ‘dudi.pca’ found in the ade4 package. Clonality was tested usingthe function ‘bitwise.dist’ in the package poppr.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
During the Anthropocene, Earth has experienced unprecedented habitat loss, native species decline, and global climate change. Concurrently, greater globalisation is facilitating species movement, increasing the likelihood of alien species establishment and propagation. There is a great need to understand what influences a species’ ability to persist or perish within a new or changing environment. Examining genes that may be associated with a species’ invasion success or persistence informs invasive species management, assists with native species preservation, and sheds light on important evolutionary mechanisms that occur in novel environments. This approach can be aided by coupling spatial and temporal investigations of evolutionary processes. Here we use the common starling, Sturnus vulgaris, to identify parallel and divergent evolutionary change between contemporary native and invasive range samples and their common ancestral population. To do this, we use reduced-representation sequencing of native samples collected recently in north-western Europe and invasive samples from Australia, together with museum specimens sampled in the UK during the mid-19th Century. We found evidence of parallel selection on both continents, possibly resulting from common global selective forces such as exposure to pollutants. We also identified divergent selection in these populations, which might be related to adaptive changes in response to the novel environment encountered in the introduced Australian range. Interestingly, signatures of selection are equally as common within both invasive and native range contemporary samples. Our results demonstrate the value of including historical samples in genetic studies of invasion and highlight the ongoing and occasionally parallel role of adaptation in both native and invasive ranges.
Methods DArTseq of 85 sturnus vulgaris samples.
Excerpt from manuscript pertaining to variant calling and filtering for attached files.
Variant calling:
We used the STACKS v2.2 pipeline to process the DArTseq raw data. We used the process_radtags function to clean the tags; discarding reads of low quality (-q), removing reads with uncalled bases (-c), and rescuing barcodes and radtags (-r). We used the Burrows-Wheeler aligner (BWA) v0.7.15 aln function to align the read data to the reference genome S. vulgaris vAU1.0. Using FastQC, we identified base sequence bias in the adapter region, and so the first five bases were trimmed (-B 5) during alignment. The reads were then processed through BWA samse and SAMTOOLS v1.10, before SNP variants were called through STACKS gstacks (default parameters) and then populations (parameter information below).
bwaaln_filter_allsample_rr_nofamily.recode.vcf file:
We generated a ‘population genetics’ variant file by running STACKS populations, filtering for a minimum per-population site call rate of 50% (-r 0.5), a minimum populations per-site of 2 (-p 2), a minimum loci log likelihood value of -15 (--lnl_lim -15), with one random SNP per tag retained (--write_random_snp). We used VCFTOOLS v0.1.16 to filter the following parameters: maximum missingness per site of 10% (-max-missing 0.9), minor allele frequency of 2.5% (MAF; --maf 0.025), minimum loci depth of 2 (--minDP 2), minimum genotype quality score of 15 (--minGQ 15) and site Hardy-Weinberg Equilibrium exact test minimum p value 0.001 (--hwe 0.001). We chose a high threshold for missingness to not bias the population genetics analysis against the historical samples, which had much higher levels of missingness than the contemporary samples. MAF filtering helps remove misreads, and HWE filtering removed highly non-neutral loci, both of which are important for capturing neutral population substructure. After filtering, we calculated individual relatedness, and closely related individuals were removed so that there was only one representative from each cluster in the final data. This resulted in a population genetic variant file of 3,840 SNPs used in the subsequent section ‘Population structure analysis’.
bwaaln_allsample_selection_histSNPs_maf025.recode.vcf file:
We generated a ‘selection’ variant file by using STACKS populations to align the raw reads for all samples (with --lnl_lim -15 --write_random_snp flags) and then used VCFTOOLS to filter out only SNPs present in at least 50% of the historical individuals (i.e. in at least 5 historical individuals), with additional quality filtering (--minGQ 15 --minDP 2), resulting in 12,219 SNP sites. Only these sites were then retained to filter the original populations variant file, along with a MAF minimum of 2.5% to remove possible sequencing errors. This produced a data set that retained only SNPs sequenced in at least half the historical individuals, which would be necessary for selection analysis.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The roles of ‘silent’ synonymous mutations for organisms adapting to stressfully thermal environments are of fundamental biological and ecological interests but poorly understood. To study whether synonymous mutations influence the thermal adaptation of animals to heat stress at specific microhabitats, a genome-wide genotype-phenotype association analysis was carried out in the black mussels Mytilisepta virgata inhabiting different microhabitats. A synonymous mutation of Ubiquitin-specific Peptidase 15 (MvUSP15) was significantly associated with the physiological upper thermal limit of the mussel. The individuals carrying GG genotype (the G-type) at the mutant locus owned significantly lower heat tolerance compared to the individuals carrying GA and AA genotype (the A-type). Furthermore, when heated to sublethal temperature, the G-type exhibited higher inter-individual variations in the MvUSP15 expression, especially for the mussels on the sun-exposed microhabitats. Taken together, a synonymous mutation in MvUSP15 can affect the gene expression profile and interact with microhabitat heterogeneity to influence thermal resistance. This integrative study sheds light on the ecological importance of adaptive synonymous mutations as an underappreciated genetic buffer against heat stress and emphasizes the importance of integrative studies with the consideration of genetic, physiological, and environmental heterogeneity at a microhabitat scale for evaluating and predicting the impacts of climate change. Methods (a) Microhabitat-scale temperature and mortality statistics survey of M. virgata in the field The operative temperatures of M. virgata under filed conditions in the Dongshan Swire Marine Station (D-SMART), China (23.65°N, 117.49°E) were continuously recorded from July to August 2020 using biomimetic thermal loggers (electronic supplementary material). Two loggers were deployed separately in one sun-exposed and one shaded microhabitat, where both M. virgata had an abundance. In two types of microhabitats, the 99th percentile of all temperatures recorded (T99) and the average daily maximum temperature (ADM) were calculated. The former was described as ‘acute’ thermal stress, while the latter as a measure of ‘chronic’ high-temperature exposure [29]. Mortality of the mussels inhabiting at sun-exposed and shaded microhabitat in summer was also investigated using quadrat method (electronic supplementary material). (b) Genome-wide association mapping analysis ddRADseq data from our previous study (BioProject ID: PRJNA517974; BioSample accessions: SAMN10849586 - SAMN10849649) and our unpublished M. virgata genome were used for identified SNPs in 64 samples (21 collected in April, 23 in August, and 20 in December) at genome-wide level. The reads were first cleaned using fastp [30], aligned to reference with BWA-MEM algorithm [31], and sorted using SAMtools [32]. The BAM outputs were then processed via gstacks script of package Stacks [33] to create loci and identify SNPs with unpaired reads discarded (--rm-unpaired-reads). Candidate loci were then filtered and converted to VCF file using populations scripts of package Stacks. A locus was retained using a minimum number of populations of three (−p 3), a maximum number of missing samples per locus of 20% (−r 0.8) and a minor allele frequency of 0.05 (−min_maf 0.05). Filtered loci were functionally annotated with the ANNOVA [34]. Finally, the genotype-phenotype association analysis using the 48 out of 64 sample’s ABT was used as thermal-tolerant phenotypic traits was performed with PLINK [35]. Those loci with missing frequencies greater than 0.1 (--geno 0.1) and Hardy-Weinberg equilibrium exact test p-values below 1e-6 (--hwe 1e -6) were excluded, and the retained loci were used in association analysis. (c) Sequence Analysis and Phylogenetic Tree Construction of M. virgata USP15 gene The coding domain sequence (CDS) of M. virgata USP15 gene (MvUSP15) was verified through molecular experiments (electronic supplementary material) and used for subsequent analysis. The codon usage bias in MvUSP15 was analyzed using DnaSP [36]. The prediction of MvUSP15 protein domain architectures based on the deduced amino sequence from verified CDS protein was referred to methods described in the previous study [37]. The MvUSP15 protein structure was predicted using homology modelling, which was performed on the SWISS-MODEL website [38]. The MvUSP15 protein sequence and homologous sequences obtained from GenBank (table S2) were aligned using MEGA Ⅹ [39]. Evolutionary analysis and Bayesian phylogenetic tree construction were further performed using MCMC methods on BEAST [40]. Motif structure analysis for every protein sequence was conducted on the MEME website [41]. (d) Expression patterns of MvUSP15 gene A total of 187 mussels were collected in the sun-exposed and shaded habitats along a ∼500 m shoreline in D-SMART on July 31st, 2021 and acclimated for over two months as previously described [24]. After acclimation, the mussels were randomly allocated into heat-treated group or control group. The mussels in the heat-treated group were heated at a rate of 6°C h-1 in the air from the acclimation temperature (22°C) to 42°C, the sub-lethal temperature for M. virgata [24]; while the control group was kept at 22°C. After the heating process, both the heat-treated (n = 135) and control group (n = 36) were stored in liquid nitrogen. The total DNA of these mussels were extracted and used for genotyping at the focused loci (see electronic supplementary material for further details). Then qRT-PCR experiment was performed to identify the relative expression level of MvUSP15.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The Bonin Islands, comprised of the Mukojima, Chichijima, and Hahajima Islands, are known for their isolated and distinctive habitats, hosting a diverse array of endemic flora and fauna. In these islands, adaptive radiation has played a remarkable role in speciation, particularly evident in the Callicarpa genus that is represented by three species: Callicarpa parvifolia and C. glabra exclusive to the Chichijima Islands, and Callicarpa subpubescens, distributed across the entire Bonin Islands. Notably, C. subpubescens exhibits multiple ecotypes, differing in leaf hair density, flowering time, and tree size. In this study, we aimed to investigate species and ecotype diversification patterns, estimate divergence times, and explore cryptic species within Callicarpa in the Bonin Islands using phenotypic and genetic data (double-digest restriction site–associated DNA sequencing). Genetic analysis revealed that C. parvifolia and C. glabra both formed single, distinct genetic groups. Conversely, C. subpubescens consisted of six genetic groups corresponding to different ecotypes and regions, and a hybrid group resulting from the hybridization between two of these genetic groups. Population demography analysis focusing on six Chichijima and Hahajima Islands–based species/ecotypes indicated that all species and ecotypes except one ecotype diverged simultaneously around 73–77 kya. The star-shaped neighbor-net tree also suggests the simultaneous divergence of species and ecotypes. The species and ecotypes that simultaneously diverged adapted to dry environments and understory forests, suggesting that aridification may have contributed to this process of adaptive radiation. Moreover, leaf morphology, flowering time, and genetic analyses suggested the presence of two cryptic species and one hybrid species within C. subpubescens. Methods Study species and sampling Callicarpa parvifolia grows in sunny dry dwarf scrub on rocky ground in the Chichijima Islands, whereas C. glabra grows in the understory of dry scrub in the Chichijima Islands. In contrast, C. subpubescens is not listed as a threatened species and is widely distributed in the Bonin and Volcano Islands, situated approximately 150 km southwest of the Bonin Islands. Callicarpa subpubescens exhibits different ecotypes, each with distinct habitats and some with different flowering peaks. For example, the Chichijima Islands’ ecotype (S) inhabits the forest edge of mesic forests, with peak flowering in June. The Hahajima Islands have four ecotypes: the glabrescent ecotype (SG), the tall ecotype (ST), the dwarf ecotype (SD), and the hybrid ecotype (SH). The ecotype SG inhabits the understory of mesic forests. The ecotype ST forms the canopy of tall mesic forests. The ecotype SD forms the canopy of dry scrub. The ecotype SH forms the canopy of mesic scrub (cloud forests) or inhabits the forest edge of mesic forests. The Mukojima Islands have two ecotypes (STm and Sm), which are genetically close to the ecotype ST in the Hahajima Islands and the ecotype S in the Chichijima Islands, respectively. To cover species and ecotypes of each island in the Bonin Islands, leaf samples were collected from 94 individuals across 14 populations for DNA extraction (Table 1; Fig. S1). These samples included two populations (ecotypes STm and Sm) in the Mukojima Islands, six populations from the Chichijima Islands, representing three species and ecotypes (P, G, S) from two islands, Anijima and Chichijima Islands. Additionally, six populations from the Hahajima comprised two ecotypes (SG and SD) from the two islands, Hahajima and Imoutojima Islands, and ecotypes ST and SH from Hahajima Island. As outgroups, one individual each of Callicarpa japonica and Callicarpa mollis, both of which grow in Kyoto Prefecture, mainland Japan, was also collected (Fig. S1). Silica gel was used to immediately dry leaf samples used for DNA extraction. ddRAD genotyping and SNP filtering DNA was extracted using a modified CTAB method (Milligan, 1992). The DNA samples were quantified using a Qubit 2.0 Fluorometer (Invitrogen, MA, USA) and adjusted to 12.6 ng/μl through dilution with TE buffer. Sequencing libraries were prepared following a modified version of Peterson’s protocol for ddRAD-seq (Peterson et al., 2012). For detailed library preparation methods, refer to appendix 1. The libraries were sequenced using a HiSeq2000 platform (Illumina, CA, USA) with 51 bp single-end reads at BGI Japan (Kobe, Japan). To ensure appropriate data resolution and accuracy for each specific analysis, three datasets were created: denovo, referenced, and demography datasets. SNPs were detected using dDocent (Puritz et al., 2014) and Stacks version 2.60 (Catchen et al., 2013; Catchen et al., 2011). The detection conditions and number of SNPs used in each analysis are summarized in Table S1. In all data sets, we excluded five individuals with low individual-level genotyping rates from SNP detection. In the referenced and denovo datasets, SNPs were detected using dDocent, following its tutorial. When creating the denovo dataset, the reference genome of C. subpubescens was not used as a reference sequence and the two outgroup individuals were not included. This dataset is optimal for detecting population genetic structure without reference bias. In contrast, when creating the referenced dataset, the reference genome and two outgroup individuals were used. This dataset provides more accurate SNP calling for phylogenetic analysis. Total raw SNPs generated via dDocent were filtered using vcftools -0.1.14 to meet the conditions outlined in Table S1. For the demography dataset, SNPs were re-extracted from the .bam files created for the referenced dataset using dDocent. First, gstacks from Stacks was used to generate catalogs of variable sites (Rochette et al., 2019). Subsequently, populations from Stacks were employed to extract SNPs with the following options: -r 0.8 -p X --min-mac 1 --max-obs-het 0.5 --vcf (where X represents the number of species/ecotypes in each dataset). The pairwise two-dimensional minor allele site frequency spectrum (2D-mSFS) was calculated from the .vcf file using the R script 2D-msfs-R (https://github.com/garageit46/2D-msfs-R). Missing data were addressed through bootstrapping within the same ecotype. This dataset includes non-variable sites and low-frequency SNPs, making it suitable for inferring evolutionary processes and population history.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Population genomics analysis holds great potential for informing conservation of endangered populations. We focused on a controversial case of European whitefish (Coregonus spp.) populations. The endangered North Sea houting is the only coregonid fish that tolerates oceanic salinities and was previously considered a species (C. oxyrhinchus) distinct from European lake whitefish (C. lavaretus). However, no firm evidence for genetic-based salinity adaptation has been available. Also, studies based on microsatellite and mitogenome data suggested surprisingly recent divergence (ca. 2,500 years bp) between houting and lake whitefish. These data types furthermore have provided no evidence for possible inbreeding. Finally, a controversial taxonomic revision recently classified all whitefish in the region as C. maraena, calling conservation priorities of houting into question. We used whole genome and ddRAD sequencing to analyze six lake whitefish populations and the only extant indigenous houting population. Demographic inference indicated postglacial expansion and divergence between lake whitefish and houting occurring not long after the Last Glaciation, implying deeper population histories than previous analyses. Runs of Homozygosity analysis suggested high inbreeding (FROH up to 30.6%) in some freshwater populations, but also FROH up to 10.6% in the houting prompting conservation concerns. Finally, outlier scans provided evidence for adaptation to high salinities in the houting. Applying a framework for defining conservation units based on current and historical reproductive isolation and adaptive divergence led us to recommend that the houting be treated as a separate conservation unit regardless of species status. In total, the results underscore the potential of genomics to inform conservation practices, in this case clarifying conservation units and highlighting populations of concern. Methods A. Sampling and DNA extraction Samples of lake whitefish were collected in 1995-2012 from five locations in Denmark; brackish populations in the Ringkoebing fjord (RIN) and Nissum fjord (NIS), two lagoons connected with the North Sea, and freshwater populations from Lake Flynder (FLYN), Lake Glenstrup (GLEN) and Lake Nors (NORS); and a brackish population from one German location, Achterwasser (ACHT), a lagoon flowing into the Baltic Sea. Houting were collected from the single extant population in Vidaa (VID), a river with outlet into the Wadden Sea (Fig. 1A). Sampling was conducted by electrofishing (VID) and net fishing (remaining populations). Tissue samples consisted of adipose fin clips stored in ethanol at -20°C. DNA was extracted using either a phenol-chloroform method (Taggart et al., 1992) (ACHT, FLYN) or the E.Z.N.A.® Tissue DNA Kit (OMEGA, Bio-tek, CA, USA) following the manufacturer's recommendations (the remaining samples). In total, 35 individuals were whole-genome sequenced and 95 were ddRAD-sequenced (Table 1). A group of 23 individuals occur in both data sets and were consequently both ddRAD and whole-genome sequenced. B. Whole-genome sequencing, mapping, and variant calling Library construction (using insert size ~300 bp) and whole-genome sequencing was outsourced to BGI (Beijing Genomics Institute, Hongkong, China). Paired-end Illumina sequencing was conducted using the Illumina HiSeq 2500 platform with a read length of 150 bp. The sequence reads were mapped to the Coregonus sp. “Balchen” Alpine whitefish reference genome (De-Kayne et al. (2020); GenBank accession: GCA_902810595.1) using BWA-MEM v.0.7.17 (Li, 2013; Li & Durbin, 2009a) with default parameters. SAM format files were sorted, indexed and converted into BAM files using SAMtools v.1.9 (Danecek et al., 2021). Variants were called using BCFtools v.1.2 (Danecek et al., 2021) function mpileup and call with a minimum mapping quality requirement of 20. We used the ‘--multiallelic-caller’ for calling combined with ‘--variants-only’ to output only variant sites. To produce an ‘all sites’ data set containing both monomorphic and polymorphic sites, we repeated the SNP calling process without the ‘--variants-only’ parameter in BCFtools call. C. WGS data set generation We filtered the resulting VCF file containing variant sites with VCFutils.pl (Li et al., 2009b) and VCFtools v.0.1.16 (Danecek et al., 2011) to remove indels, monomorphic sites, multi-allelic SNPs and SNPs with a variant quality <20 or extreme depth of coverage (lower than 400 or higher than 1000 across all individuals) determined from the coverage distribution of SNPs (Fig. S1). The bimodal coverage distribution with two distinct peaks suggested the presence of paralogous loci, a well-known issue in salmonid fishes due to their tetraploid origin. In addition to excluding the variants in the higher coverage peak, which was centered at approximately twice the depth of the lower peak and thus likely represented duplicated regions, we also used VCFtools to discard SNPs located within putative duplicated genomic regions identified by De-Kayne et al. (2020). Furthermore, as loci with an excess of heterozygotes can also represent duplicated genomic regions, we removed SNPs out of Hardy-Weinberg equilibrium (HWE) in one or more populations using a custom R script (https://github.com/shenglin-liu/VCF_HWF). Tests for HWE were conducted using the statistic (Brown, 1970), where is Wright’s fixation index within populations and is the sample size. The statistic follows a standard normal distribution with a mean of 0 and a standard deviation of 1. Negative values denote heterozygote excess and positive values heterozygote deficit, and values > |1.96| are significant at the 5 % level. The effects of the individual filtering steps are detailed in Supplementary Table S1. The resulting data set, hereafter referred to as the ‘HW-filtered WGS data set’, contained 16,898,181 SNPs. Additionally, we produced a ‘LD-pruned WGS data set’ with the addition of 5 individuals of the alpine whitefish species C. arenicolus (AREN) as an outgroup (Extended methods S1) by pruning SNPs on the basis of linkage disequilibrium (LD) in the HW-filtered WGS data set. Pruning was performed with the indep-pairwise function in PLINK v.1.9 (Purcell et al., 2007), where SNPs with r2>0.1 were removed from sliding windows of 50 SNPs with 10 SNPs of overlap. A total of 596,078 SNPs remained after pruning. The ‘all sites’ data set was filtered to remove indels and sites with extreme depth of coverage or located in putative duplicated regions and SNPs not in HWE, as detailed for the ‘variant sites’ data set above. No filtering for minor allele frequency or missing data was performed. After filtering, the VCF contained 1,181,919,736 sites with individuals exhibiting between x and y % missing genotypes. D. Filtering for ROH analyses We opted to further filter our the HW-filtered WGS data set to ensure only the most reliable genotype calls were retained. Following the protocol implemented in Balboa et al. (2024), we estimated mappability of the genome assembly with GENMAP v.1.3.0 (Pockrandt et al., 2020) using 100 bp k-mers and allowing for up to two mismatches, and we identified repetitive elements in the assembly with RepeatMasker v.4.1.2 (Smit et al., 2013) using ‘rmblast’ as the search engine and ‘Actinopterygii’ (ray-finned fishes) as the query species. Repeat regions and sites with a mappability score <1 were excluded from the analyses. In addition to the extreme depth filters applied as previously described, we furthermore used VCFtools to change individual genotypes with very low (DP<10) or very high read depth (DP>40) and genotypes with low quality (GQ<30) to missing (./.). Finally, only SNPs with variant quality (QUAL) >30 and no missing data were kept, resulting in a data set containing 2,646,198 SNPs. E. ddRAD sequencing, mapping, and loci assembly Samples were prepared using ddRADseq (Peterson et al., 2012). The ddRADseq libraries used PstI (6-base) and MspI (4-base) restriction enzymes. Two libraries of equal size were constructed (using insert size of 200-500 bp) and sequenced on an Illumina HiSeq2000 platform with 100 bp paired-end reads at BGI (Hong Kong, China). Raw reads were cleaned and demultiplexed with process_radtags in Stacks v.2.55 (Catchen et al., 2011; Catchen et al., 2013) in addition to being truncated to 90 bp (-t 90). Low-quality reads (phred score < 10 over a sliding window of 15% of the read length) were discarded. Mapping of reads to the Alpine whitefish reference genome (De-Kayne et al., 2020) progressed as described for the whole-genome sequencing data. Loci were assembled from the aligned and sorted reads using gstacks v.2.55 with default parameters. F. ddRADseq data set generation The populations program in Stacks (Catchen et al., 2011; Catchen et al., 2013) was used to generate a preliminary VCF file including only loci present in all six populations (-p 6; GLEN was not analyzed by ddRAD sequencing) and at least 70 percent of individuals within each population (-r 0.7). Exports were ordered (--ordered-export) to ensure that only a single representative of each overlapping site was included. Loci out of HWE in one or more populations were filtered out using a custom R script, as previously described. Based on this data set, five individuals (two from ACHT, one from each of the populations NIS, NORS, and RIN) with more than 10 % missing data were identified. We then generated a new VCF file excluding these five individuals using populations with parameters as previously stated, yielding a total of 347,397 SNPs, and a second VCF file with data analysis restricted to one random SNP per locus, yielding 141,157 SNPs. Both files were filtered to remove SNPs located within potentially duplicated regions of the genome (De-Kayne et al., 2020) and SNPs out of HWE in one or more populations as described for WGS data. A total of 254,693 SNPs and 105,452 SNPs,
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
We used double digestion restriction site associated DNA (ddRAD) sequencing to discover SNPs in samples across a transect including a hybrid zone between Podarcis carbonelli and Podarcis carbonelli. We used P. bocagei and P. carbonelli samples from the locations at the extremes of the transect as references. We obtained a SNP dataset including all SNPs after removing loci with depth coverage <8, missing data >20%, removing loci containing more than five SNPs, and with more than 70% heterozygosity (complete dataset; 6905 SNPs, 329 individuals). Additionally, we obtained from the complete dataset two other datasets, prior to apply a missing data filter. One dataset contained loci with allele frequencies higher than 0.8 in the reference population containing only parental individuals of one species and lower than 0.2 in the reference population of the other species ("80/20" dataset; 2300 SNPs, 329 individuals); the other dataset comprised diagnostic SNPs between reference populations (diagnostic dataset; 1241 SNPs, 236 individuals) but excluding private alleles from references, i.e. excluding alleles that are not present in the populations of contact. Individuals with missing data >35% were removed from all datasets (the number of individuals reported for each dataset is after applying this filter, but note that the 80/20 and the diagnostic datasets were obtained before applying this filter to the complete dataset). Across datasets, average depth of coverage by individuals was 28 (median = 26.8, min = 12.5, max = 85.8) and by loci was 29 (median = 28.8; min = 15.6; max = 48.6). The analysis of replicate samples (four samples were replicated twice, i.e. were amplified and sequenced in independent libraries and SNP calling was performed independently) showed high levels (99.87%) of multilocus genotype replicability. Methods Samples were collected between spring and autumn of 2013 in a contact zone between Podarcis bocagei and P. carbonelli. We collected samples across a transect with 8 locations, including the hybrid zone. Sampling scheme aimed at capturing all the individuals encountered, avoiding bias towards species, sex or age. We used 20 P. bocagei and 23 P. carbonelli samples from the locations at the extremes of the transect as references. We obtained SNP datasets from ddRAD sequencing from 356 samples by preparing one library following the modifications described by Brelsford et al. (2016) to the protocols from Parchman et al. (2013), Peterson et al. (2012) and Purcell et al. (2014), and sequenced on a Illumina® HiSeq 2000. Individual raw reads were demultiplexed using the process_radtags module of Stacks version 2.2 (Catchen et al., 2013). SNP calling was performed using the Stacks pipeline, following Rochette and Catchen (2017) recommendations, by running consecutivelly ustacks (build loci), cstacks (create a catalogue of loci), sstacks (match individual samples against the catalogue), tsv2bam (transpose data) and gstacks (align each read to a locus and call SNPs) units. SNP filtering was done with populations unit from Stacks, VCFtools 0.1.15 (Danecek et al., 2011) and a custom Python script (available at https://github.com/catpinho/filter_RADseq_data).