Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The dataset has been created by using the open-source code released by LNDS (Luxembourg National Data Service). It is meant to be an example of the dataset structure anyone can generate and personalize in terms of some fixed parameter, including the sample size. The file format is .csv, and the data are organized by individual profiles on the rows and their personal features on the columns. The information in the dataset has been generated based on the statistical information about the age-structure distribution, the number of populations over municipalities, the number of different nationalities present in Luxembourg, and salary statistics per municipality. The STATEC platform, the statistics portal of Luxembourg, is the public source we used to gather the real information that we ingested into our synthetic generation model. Other features like Date of birth, Social matricule, First name, Surname, Ethnicity, and physical attributes have been obtained by a logical relationship between variables without exploiting any additional real information. We are in compliance with the law in putting close to zero the risk of identifying a real person completely by chance.
Facebook
TwitterWe illustrate the utility of expert elicitation, explicit recognition of uncertainty, and the value of information for directing management and research efforts for invasive species, using tegu lizards (Salvator merianae) in southern Florida as a case study. We posited a post-birth pulse, matrix model, which was parameterized using a 3-point process to elicit estimates of tegu demographic rates from herpetology experts. We fit statistical distributions for each parameter and for each expert, then drew and pooled a large number of replicate samples from these to form a distribution for each demographic parameter. Using these distributions, we generated a large sample of matrix models to infer how the tegu population might respond to control efforts. We used the concepts of Pareto efficiency and stochastic dominance to conclude that targeting older age classes at relatively high rates appears to have the best chance of minimizing tegu abundance and control costs. Expert opinion combined with an explicit consideration of uncertainty can be valuable for conducting an initial assessment of the effort needed to control the invader. The value of information can be used to focus research in a way that not only helps increases the efficacy of control, but minimizes costs as well.
Facebook
TwitterOpen Access# Data and R code used in: Plant geographic distribution influences chemical defenses in native and introduced Plantago lanceolata populations ## Description of the data and file structure * 00_ReadMe_DescriptonVariables.csv: A list with the description of variables from each file used. * 00_Metadata_Coordinates.csv : A dataset that includes the coordinates of each Plantago lanceolata population used. * 00_Metadata_Climate.csv : A dataset that includes coordinates, bioclimatic parameters, and the results of PCA. The dataset was created based on the script '1_Environmental variables.qmd' * 00_Metadata_Individuals.csv: A dataset that includes general information about each plant individual. Information about root traits and chemistry is missing in four samples since we lost the samples. * 01_Datset_PlantTraits.csv: Size-related and resource allocation traits measured of Plantago lanceolata and herbivore damage. * 02_Dataset_TargetedCompounds.csv: Phytohormones, Iridoid glycosides, Verbascoside and Flavonoids quantification of the leaves and roots of Plantago lanceolata. Data generated from HPLC * 03_Dataset_Volatiles_Area.csv: Area of identified volatile compounds. Data generated from GC-FID * 03_Dataset_Volatiles_Compounds.csv: Information on identified volatile compounds. Data generated from GC-MS. * 04_Dataset_Metabolome_Negative_Metadata.txt: Metadata for files in negative mode * 04_Dataset_Metabolome_Negative_Intensity.xlsx : File with the intensity of the metabolite features in negative mode. The file was generated from Metaboscape and adapted as required for the Notame package. * 04_Dataset_Metabolome_Negative_Intensity_filtered.xlsx: File generated after preprocessing of features in negative mode. During the notadame pacakged preprossesing 0 were converted to na * 04_Dataset_Metabolome_Negative.msmsonly.csv: File with a intensity of the the metabolite features in negative mode with ms/ms data. File generated from Metaboscape. * 04_Results_Metabolome_Negative_canopus_compound_summary.tsv: Feature classification. Results generated from Sirius software. * 04_Results_Metabolome_Negative_compound_identifications.tsv: Feature identification. Results generated from Sirius software. * 05_Dataset_Metabolome_Positive_Metadata.txt: Metadata for files in positive mode * 05_DatasetMetabolome_Positive_Intensity.xlsx : File with a intensity of the the metabolite features in positive mode. File generated from Metaboscape and adapted as required for the Notame package. * 05_Dataset_Metabolome_Positive_Intensity_filtered: File generated after preprocessing of features in positive mode.During the notadame pacakged preprossesing 0 were converted to na ## ## Code/Software * 1_Environmental vairables.qmd: Rscript to Retrieve bioclimatic variables from based on the coordinates of each population and then perform a principal components analysis to reduce the axes variation and included the first principal component as an explanatory variable in our model to estimate trait differences between native and introduced populations. Figure 1b and 1d * 2_PlantTraits_and_Herbivory: Rscript for statistical anaylsis of size-related traits, resource allocation traits and herbivore damage. Figure 2. It needs to source: Model_1_Fucntion.R, Model_2_Fucntion.R, Plot_Function.R * 3_Metabolome: Rscript for statistical anaylsis of Plantago lanceolata metabolome. Figure 3. It needs to source: Metabolome_preprocessing_R, Model_1_Fucntion.R, Model_2_Fucntion.R, Plot_Function.R. * 4_TargetedCompounds: Rscript for statistical anaylsis of Plantago lanceolata targeted compounds. Figure 4. It needs to source: Model_1_Fucntion.R, Model_2_Fucntion.R, Plot_Function.R * 5_Volatilome: Rscript for statistical anaylsis of Plantago lanceolata metabolome. Figure 5. It needs to source: Model_1_Fucntion.R, Model_2_Fucntion.R, Plot_Function.R * Model_1_Function.R : Function to run statistical models * Model_2_Function.R : Function to run statistical models * Plots_Function.R : Function to run plot graphs * Metabolome_prepocessing.R: Script to preprocess features
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
Water companies in the UK are responsible for testing the quality of drinking water. This dataset contains the results of samples taken from the taps in domestic households to make sure they meet the standards set out by UK and European legislation. This data shows the location, date, and measured levels of determinands set out by the Drinking Water Inspectorate (DWI).
Key Definitions
Aggregation
Process involving summarizing or grouping data to obtain a single or reduced set of information, often for analysis or reporting purposes
Anonymisation
Anonymised data is a type of information sanitization in which data anonymisation tools encrypt or remove personally identifiable information from datasets for the purpose of preserving a data subject's privacy
Dataset
Structured and organized collection of related elements, often stored digitally, used for analysis and interpretation in various fields.
Determinand
A constituent or property of drinking water which can be determined or estimated.
DWI
Drinking Water Inspectorate, an organisation “providing independent reassurance that water supplies in England and Wales are safe and drinking water quality is acceptable to consumers.”
DWI Determinands
Constituents or properties that are tested for when evaluating a sample for its quality as per the guidance of the DWI. For this dataset, only determinands with “point of compliance” as “customer taps” are included.
Granularity
Data granularity is a measure of the level of detail in a data structure. In time-series data, for example, the granularity of measurement might be based on intervals of years, months, weeks, days, or hours
ID
Abbreviation for Identification that refers to any means of verifying the unique identifier assigned to each asset for the purposes of tracking, management, and maintenance.
LSOA
Lower-Level Super Output Area is made up of small geographic areas used for statistical and administrative purposes by the Office for National Statistics. It is designed to have homogeneous populations in terms of population size, making them suitable for statistical analysis and reporting. Each LSOA is built from groups of contiguous Output Areas with an average of about 1,500 residents or 650 households allowing for granular data collection useful for analysis, planning and policy- making while ensuring privacy.
ONS
Office for National Statistics
Open Data Triage
The process carried out by a Data Custodian to determine if there is any evidence of sensitivities associated with Data Assets, their associated Metadata and Software Scripts used to process Data Assets if they are used as Open Data. <
Sample
A sample is a representative segment or portion of water taken from a larger whole for the purpose of analysing or testing to ensure compliance with safety and quality standards.
Schema
Structure for organizing and handling data within a dataset, defining the attributes, their data types, and the relationships between different entities. It acts as a framework that ensures data integrity and consistency by specifying permissible data types and constraints for each attribute.
Units
Standard measurements used to quantify and compare different physical quantities.
Water Quality
The chemical, physical, biological, and radiological characteristics of water, typically in relation to its suitability for a specific purpose, such as drinking, swimming, or ecological health. It is determined by assessing a variety of parameters, including but not limited to pH, turbidity, microbial content, dissolved oxygen, presence of substances and temperature.
Data History
Data Origin
These samples were taken from customer taps. They were then analysed for water quality, and the results were uploaded to a database. This dataset is an extract from this database.
Data Triage Considerations
Granularity
Is it useful to share results as averages or individual?
We decided to share as individual results as the lowest level of granularity
Anonymisation
It is a requirement that this data cannot be used to identify a singular person or household. We discussed many options for aggregating the data to a specific geography to ensure this requirement is met. The following geographical aggregations were discussed:
<!--·
Water Supply Zone (WSZ) - Limits interoperability
with other datasets
<!--·
Postcode – Some postcodes contain very few
households and may not offer necessary anonymisation
<!--·
Postal Sector – Deemed not granular enough in
highly populated areas
<!--·
Rounded Co-ordinates – Not a recognised standard
and may cause overlapping areas
<!--·
MSOA – Deemed not granular enough
<!--·
LSOA – Agreed as a recognised standard appropriate
for England and Wales
<!--·
Data Zones – Agreed as a recognised standard
appropriate for Scotland
Data Specifications
Each dataset will cover a calendar year of samples
This dataset will be published annually
Historical datasets will be published as far back as 2016 from the introduction of of The Water Supply (Water Quality) Regulations 2016
The Determinands included in the dataset are as per the list that is required to be reported to the Drinking Water Inspectorate.
Context
Many UK water companies provide a search tool on their websites where you can search for water quality in your area by postcode. The results of the search may identify the water supply zone that supplies the postcode searched. Water supply zones are not linked to LSOAs which means the results may differ to this dataset
Some sample results are influenced by internal plumbing and may not be representative of drinking water quality in the wider area.
Some samples are tested on site and others are sent to scientific laboratories.
Data Publish Frequency
Annually
Data Triage Review Frequency
Annually unless otherwise requested
Supplementary information
Below is a curated selection of links for additional reading, which provide a deeper understanding of this dataset.
<!--1.
Drinking Water
Inspectorate Standards and Regulations:
<!--2.
https://www.dwi.gov.uk/drinking-water-standards-and-regulations/
<!--3.
LSOA (England
and Wales) and Data Zone (Scotland):
<!--5.
Description
for LSOA boundaries by the ONS: Census
2021 geographies - Office for National Statistics (ons.gov.uk)
<!--[6.
Postcode to
LSOA lookup tables: Postcode
to 2021 Census Output Area to Lower Layer Super Output Area to Middle Layer
Super Output Area to Local Authority District (August 2023) Lookup in the UK
(statistics.gov.uk)
<!--7.
Legislation history: Legislation -
Drinking Water Inspectorate (dwi.gov.uk)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Predicting the timing of phenological events is important in agriculture, especially high-revenue products. A project sponsored by USDA-ARS had the objective of adapting a previously developed model for estimating proportions of insects in different development stages as a function of temperature (degree) and time (days) for predicting bloom in almond orchards. Data for the model normally form a two-way table of counts, with rows corresponding to sample percentages of different development stages and columns to sampling times. In this study, we report a technique developed to estimate sample sizes of multinomial and product multinomial models using a method of moments and determine the empirical coverage of sample size. This study aims to determine an appropriate sample size for data collection. This involves establishing a sampling distribution for the Pearson statistic, defined as the product of the sample size and the deviance of empirical proportions from population proportions. The intended outcome is to predict the optimal timing for harvesting crops at desired development stages when coupled with the phenology model, for which variability of the maximum likelihood estimates of the phenology model depends on sample size.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison of phylogenetic path models produced (A) by the approach of Santos [32] using the implementation of ‘piecewiseSEM’, and (B) by the approach of von Hardenberg and Gonzalez-Voyer [33] using the implementation of ‘phylopath’. For each model, the table shows the number of independence claims (k), the number of parameters (q), Fisher’s C-statistic (C) for model fit, and its associated p-value. AICc and CICc are the Akaike and C-statistic information criterion, respectively, corrected for small sample sizes. ΔAICc (and ΔCICc) indicates the difference in AICc (or CICc) values between the most supported model (lowest AICc or CICc, model m1.b) and the focal models. ΔAICc (and ΔCICc) > 2 indicates substantially higher support for the best
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The unprecedented worldwide pandemic caused by COVID-19 has motivated several research groups to develop machine-learning based approaches that aim to automate the diagnosis or screening of COVID-19, in large-scale. The gold standard for COVID-19 detection, quantitative-Real-Time-Polymerase-Chain-Reaction (qRT-PCR), is expensive and time-consuming. Alternatively, haematology-based detections were fast and near-accurate, although those were less explored. The external-validity of the haematology-based COVID-19-predictions on diverse populations are yet to be fully investigated. Here we report external-validity of machine learning-based prediction scores from haematological parameters recorded in different hospitals of Brazil, Italy, and Western Europe (raw sample size, 195554). The XGBoost classifier performed consistently better (out of seven ML classifiers) on all the datasets. The working models include a set of either four or fourteen haematological parameters. The internal performances of the XGBoost models (AUC scores range from 84% to 97%) were superior to ML models reported in the literature for some of these datasets (AUC scores range from 84% to 87%). The meta-validation on the external performances revealed the reliability of the performance (AUC score 86%) along with good accuracy of the probabilistic prediction (Brier score 14%), particularly when the model was trained and tested on fourteen haematological parameters from the same country (Brazil). The external performance was reduced when the model was trained on datasets from Italy and tested on Brazil (AUC score 69%) and Western Europe (AUC score 65%); presumably affected by factors, like, ethnicity, phenotype, immunity, reference ranges, across the populations. The state-of-the-art in the present study is the development of a COVID-19 prediction tool that is reliable and parsimonious, using a fewer number of hematological features, in comparison to the earlier study with meta-validation, based on sufficient sample size (n = 195554). Thus, current models can be applied at other demographic locations, preferably, with prior training of the model on the same population. Availability: https://covipred.bits-hyderabad.ac.in/home; https://github.com/debashreebanerjee/CoviPred.
Facebook
TwitterThis data release contains the input-data files and R scripts associated with the analysis presented in [citation of manuscript]. The spatial extent of the data is the contiguous U.S. The input-data files include one comma separated value (csv) file of county-level data, and one csv file of city-level data. The county-level csv (“county_data.csv”) contains data for 3,109 counties. This data includes two measures of water use, descriptive information about each county, three grouping variables (climate region, urban class, and economic dependency), and contains 18 explanatory variables: proportion of population growth from 2000-2010, fraction of withdrawals from surface water, average daily water yield, mean annual maximum temperature from 1970-2010, 2005-2010 maximum temperature departure from the 40-year maximum, mean annual precipitation from 1970-2010, 2005-2010 mean precipitation departure from the 40-year mean, Gini income disparity index, percent of county population with at least some college education, Cook Partisan Voting Index, housing density, median household income, average number of people per household, median age of structures, percent of renters, percent of single family homes, percent apartments, and a numeric version of urban class. The city-level csv (city_data.csv) contains data for 83 cities. This data includes descriptive information for each city, water-use measures, one grouping variable (climate region), and 6 explanatory variables: type of water bill (increasing block rate, decreasing block rate, or uniform), average price of water bill, number of requirement-oriented water conservation policies, number of rebate-oriented water conservation policies, aridity index, and regional price parity. The R scripts construct fixed-effects and Bayesian Hierarchical regression models. The primary difference between these models relates to how they handle possible clustering in the observations that define unique water-use settings. Fixed-effects models address possible clustering in one of two ways. In a "fully pooled" fixed-effects model, any clustering by group is ignored, and a single, fixed estimate of the coefficient for each covariate is developed using all of the observations. Conversely, in an unpooled fixed-effects model, separate coefficient estimates are developed only using the observations in each group. A hierarchical model provides a compromise between these two extremes. Hierarchical models extend single-level regression to data with a nested structure, whereby the model parameters vary at different levels in the model, including a lower level that describes the actual data and an upper level that influences the values taken by parameters in the lower level. The county-level models were compared using the Watanabe-Akaike information criterion (WAIC) which is derived from the log pointwise predictive density of the models and can be shown to approximate out-of-sample predictive performance. All script files are intended to be used with R statistical software (R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org) and Stan probabilistic modeling software (Stan Development Team. 2017. RStan: the R interface to Stan. R package version 2.16.2. http://mc-stan.org).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Coalescent theory is routinely used to estimate past population dynamics and demographic parameters from genealogies. While early work in coalescent theory only considered simple demographic models, advances in theory have allowed for increasingly complex demographic scenarios to be considered. The success of this approach has lead to coalescent-based inference methods being applied to populations with rapidly changing population dynamics, including pathogens like RNA viruses. However, fitting epidemiological models to genealogies via coalescent models remains a challenging task, because pathogen populations often exhibit complex, nonlinear dynamics and are structured by multiple factors. Moreover, it often becomes necessary to consider stochastic variation in population dynamics when fitting such complex models to real data. Using recently developed structured coalescent models that accommodate complex population dynamics and population structure, we develop a statistical framework for fitting stochastic epidemiological models to genealogies. By combining particle filtering methods with Bayesian Markov chain Monte Carlo methods, we are able to fit a wide class of stochastic, nonlinear epidemiological models with different forms of population structure to genealogies. We demonstrate our framework using two structured epidemiological models: a model with disease progression between multiple stages of infection and a two-population model reflecting spatial structure. We apply the multi-stage model to HIV genealogies and show that the proposed method can be used to estimate the stage-specific transmission rates and prevalence of HIV. Finally, using the two-population model we explore how much information about population structure is contained in genealogies and what sample sizes are necessary to reliably infer parameters like migration rates.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
S1Data-simulated.tar.gz - These are simulated data evaluating prediction accuracy for different traits, number of markers, numbers of samples on different testing scenarios.
Other files are example results of the SelectML methods. These are the combined results of the optimise and predict scripts. The compressed folders are named by the simulated dataset that they correspond to and the model.
The trait is first (e.g. A1 means an additive trait with 1 causal marker), N1000 means 1000 samples (in the training population), M1000 means 1000 markers sampled, "_CAUSAL" means that the causal loci were included in the sampled markers (note that this means for M1000 all markers sampled did have a genuine effect), and the final section before ".tar.gz" indicates the model used (e.g. sgd, xgb, BGLR).
Inside each of these compressed folders are the following files. From the selectml optimise command:
regression_*_best.json - the best performing combination of hyper parameters for these data and model type. regression_*_results.tsv - The optuna running logs showing sampled parameters and average mean squared error (of cross validated samples) of models from that parameter set. regression_*_full_results.tsv - Like _results.tsv but includes other statistics relevant to the task, such as pearsons correlation.
And from selectml predict:
regression_*_model.pkl - a stored version of the trained model given the best parameters from optimise, trained from the complete train dataset.
regression_*_predictions.tsv - predicted results for all training datasets.
regression_*_stats.tsv - summary statistics (e.g. MSE, pearsons correlation) for the model in different test populations.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
(*) The HCPDV and HCPAG datasets are collected by the same data acquisition centers. We consider this in computing the total number of scanners in data.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Population genomics analysis holds great potential for informing conservation of endangered populations. We focused on a controversial case of European whitefish (Coregonus spp.) populations. The endangered North Sea houting is the only coregonid fish that tolerates oceanic salinities and was previously considered a species (C. oxyrhinchus) distinct from European lake whitefish (C. lavaretus). However, no firm evidence for genetic-based salinity adaptation has been available. Also, studies based on microsatellite and mitogenome data suggested surprisingly recent divergence (ca. 2,500 years bp) between houting and lake whitefish. These data types furthermore have provided no evidence for possible inbreeding. Finally, a controversial taxonomic revision recently classified all whitefish in the region as C. maraena, calling conservation priorities of houting into question. We used whole genome and ddRAD sequencing to analyze six lake whitefish populations and the only extant indigenous houting population. Demographic inference indicated postglacial expansion and divergence between lake whitefish and houting occurring not long after the Last Glaciation, implying deeper population histories than previous analyses. Runs of Homozygosity analysis suggested high inbreeding (FROH up to 30.6%) in some freshwater populations, but also FROH up to 10.6% in the houting prompting conservation concerns. Finally, outlier scans provided evidence for adaptation to high salinities in the houting. Applying a framework for defining conservation units based on current and historical reproductive isolation and adaptive divergence led us to recommend that the houting be treated as a separate conservation unit regardless of species status. In total, the results underscore the potential of genomics to inform conservation practices, in this case clarifying conservation units and highlighting populations of concern. Methods A. Sampling and DNA extraction Samples of lake whitefish were collected in 1995-2012 from five locations in Denmark; brackish populations in the Ringkoebing fjord (RIN) and Nissum fjord (NIS), two lagoons connected with the North Sea, and freshwater populations from Lake Flynder (FLYN), Lake Glenstrup (GLEN) and Lake Nors (NORS); and a brackish population from one German location, Achterwasser (ACHT), a lagoon flowing into the Baltic Sea. Houting were collected from the single extant population in Vidaa (VID), a river with outlet into the Wadden Sea (Fig. 1A). Sampling was conducted by electrofishing (VID) and net fishing (remaining populations). Tissue samples consisted of adipose fin clips stored in ethanol at -20°C. DNA was extracted using either a phenol-chloroform method (Taggart et al., 1992) (ACHT, FLYN) or the E.Z.N.A.® Tissue DNA Kit (OMEGA, Bio-tek, CA, USA) following the manufacturer's recommendations (the remaining samples). In total, 35 individuals were whole-genome sequenced and 95 were ddRAD-sequenced (Table 1). A group of 23 individuals occur in both data sets and were consequently both ddRAD and whole-genome sequenced. B. Whole-genome sequencing, mapping, and variant calling Library construction (using insert size ~300 bp) and whole-genome sequencing was outsourced to BGI (Beijing Genomics Institute, Hongkong, China). Paired-end Illumina sequencing was conducted using the Illumina HiSeq 2500 platform with a read length of 150 bp. The sequence reads were mapped to the Coregonus sp. “Balchen” Alpine whitefish reference genome (De-Kayne et al. (2020); GenBank accession: GCA_902810595.1) using BWA-MEM v.0.7.17 (Li, 2013; Li & Durbin, 2009a) with default parameters. SAM format files were sorted, indexed and converted into BAM files using SAMtools v.1.9 (Danecek et al., 2021). Variants were called using BCFtools v.1.2 (Danecek et al., 2021) function mpileup and call with a minimum mapping quality requirement of 20. We used the ‘--multiallelic-caller’ for calling combined with ‘--variants-only’ to output only variant sites. To produce an ‘all sites’ data set containing both monomorphic and polymorphic sites, we repeated the SNP calling process without the ‘--variants-only’ parameter in BCFtools call. C. WGS data set generation We filtered the resulting VCF file containing variant sites with VCFutils.pl (Li et al., 2009b) and VCFtools v.0.1.16 (Danecek et al., 2011) to remove indels, monomorphic sites, multi-allelic SNPs and SNPs with a variant quality <20 or extreme depth of coverage (lower than 400 or higher than 1000 across all individuals) determined from the coverage distribution of SNPs (Fig. S1). The bimodal coverage distribution with two distinct peaks suggested the presence of paralogous loci, a well-known issue in salmonid fishes due to their tetraploid origin. In addition to excluding the variants in the higher coverage peak, which was centered at approximately twice the depth of the lower peak and thus likely represented duplicated regions, we also used VCFtools to discard SNPs located within putative duplicated genomic regions identified by De-Kayne et al. (2020). Furthermore, as loci with an excess of heterozygotes can also represent duplicated genomic regions, we removed SNPs out of Hardy-Weinberg equilibrium (HWE) in one or more populations using a custom R script (https://github.com/shenglin-liu/VCF_HWF). Tests for HWE were conducted using the statistic (Brown, 1970), where is Wright’s fixation index within populations and is the sample size. The statistic follows a standard normal distribution with a mean of 0 and a standard deviation of 1. Negative values denote heterozygote excess and positive values heterozygote deficit, and values > |1.96| are significant at the 5 % level. The effects of the individual filtering steps are detailed in Supplementary Table S1. The resulting data set, hereafter referred to as the ‘HW-filtered WGS data set’, contained 16,898,181 SNPs. Additionally, we produced a ‘LD-pruned WGS data set’ with the addition of 5 individuals of the alpine whitefish species C. arenicolus (AREN) as an outgroup (Extended methods S1) by pruning SNPs on the basis of linkage disequilibrium (LD) in the HW-filtered WGS data set. Pruning was performed with the indep-pairwise function in PLINK v.1.9 (Purcell et al., 2007), where SNPs with r2>0.1 were removed from sliding windows of 50 SNPs with 10 SNPs of overlap. A total of 596,078 SNPs remained after pruning. The ‘all sites’ data set was filtered to remove indels and sites with extreme depth of coverage or located in putative duplicated regions and SNPs not in HWE, as detailed for the ‘variant sites’ data set above. No filtering for minor allele frequency or missing data was performed. After filtering, the VCF contained 1,181,919,736 sites with individuals exhibiting between x and y % missing genotypes. D. Filtering for ROH analyses We opted to further filter our the HW-filtered WGS data set to ensure only the most reliable genotype calls were retained. Following the protocol implemented in Balboa et al. (2024), we estimated mappability of the genome assembly with GENMAP v.1.3.0 (Pockrandt et al., 2020) using 100 bp k-mers and allowing for up to two mismatches, and we identified repetitive elements in the assembly with RepeatMasker v.4.1.2 (Smit et al., 2013) using ‘rmblast’ as the search engine and ‘Actinopterygii’ (ray-finned fishes) as the query species. Repeat regions and sites with a mappability score <1 were excluded from the analyses. In addition to the extreme depth filters applied as previously described, we furthermore used VCFtools to change individual genotypes with very low (DP<10) or very high read depth (DP>40) and genotypes with low quality (GQ<30) to missing (./.). Finally, only SNPs with variant quality (QUAL) >30 and no missing data were kept, resulting in a data set containing 2,646,198 SNPs. E. ddRAD sequencing, mapping, and loci assembly Samples were prepared using ddRADseq (Peterson et al., 2012). The ddRADseq libraries used PstI (6-base) and MspI (4-base) restriction enzymes. Two libraries of equal size were constructed (using insert size of 200-500 bp) and sequenced on an Illumina HiSeq2000 platform with 100 bp paired-end reads at BGI (Hong Kong, China). Raw reads were cleaned and demultiplexed with process_radtags in Stacks v.2.55 (Catchen et al., 2011; Catchen et al., 2013) in addition to being truncated to 90 bp (-t 90). Low-quality reads (phred score < 10 over a sliding window of 15% of the read length) were discarded. Mapping of reads to the Alpine whitefish reference genome (De-Kayne et al., 2020) progressed as described for the whole-genome sequencing data. Loci were assembled from the aligned and sorted reads using gstacks v.2.55 with default parameters. F. ddRADseq data set generation The populations program in Stacks (Catchen et al., 2011; Catchen et al., 2013) was used to generate a preliminary VCF file including only loci present in all six populations (-p 6; GLEN was not analyzed by ddRAD sequencing) and at least 70 percent of individuals within each population (-r 0.7). Exports were ordered (--ordered-export) to ensure that only a single representative of each overlapping site was included. Loci out of HWE in one or more populations were filtered out using a custom R script, as previously described. Based on this data set, five individuals (two from ACHT, one from each of the populations NIS, NORS, and RIN) with more than 10 % missing data were identified. We then generated a new VCF file excluding these five individuals using populations with parameters as previously stated, yielding a total of 347,397 SNPs, and a second VCF file with data analysis restricted to one random SNP per locus, yielding 141,157 SNPs. Both files were filtered to remove SNPs located within potentially duplicated regions of the genome (De-Kayne et al., 2020) and SNPs out of HWE in one or more populations as described for WGS data. A total of 254,693 SNPs and 105,452 SNPs,
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Locations and summary statistics for Octopus vulgaris samples, including estimates of haplotype (h) and nucleotide (π) diversity, mismatch distribution parameters, neutrality, and demographic expansion test based on mitochondrial control region.
Facebook
TwitterIce streams are believed to play a major role in determining the response of their parent ice sheet to climate change, and in determining global sea level by serving as regulators on the fresh water stored in the ice sheets. Ice streams are characterized by rapid, laterally confined flow which makes them uniquely identifiable within the body of the more slowly and more homogeneously flowing ice sheet. But while these characteristics enable the identification of ice streams, the processes which control ice-stream motion and evolution, and differences among ice streams in the polar regions, are only partially understood. Understanding the relative importance of lateral and basal drags, as well as the role of gradients in longitudinal stress, is essential for developing models for future evolution of the polar ice sheets. In this project, physical statistical models are used to explore the processes that control ice-stream flow, and to compare these processes between seemingly different ice-stream systems. In particular, the Northeast Ice Stream in Greenland will be investigated. Geophysical models lie at the core of the approach, but are embellished by statistical modeling of various components of variability. One important component comes from the uncertainty in observations on basal elevation, surface elevation, and surface velocity. In this project, new observational data collected using remote-sensing techniques are used. The various components, some of which are spatial, are combined hierarchically using Bayesian statistical methodology. All these are combined mathematically into a physical statistical model that yields the posterior distributions for basal and surface elevations, surface velocity fields, and stress fields, conditional on the data. Inference based on these distributions is carried out via Markov chain Monte Carlo techniques, to obtain estimates of these unknown fields along with uncertainty measures associated with them.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Index of notation for parameter values, with default values, and variables used in computing an optimal sampling design for disease-free status.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Model comparison and fitness parameter outputs.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description of the demographic attributes of the dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Survey years of each country with respective weighted sample size.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The dataset has been created by using the open-source code released by LNDS (Luxembourg National Data Service). It is meant to be an example of the dataset structure anyone can generate and personalize in terms of some fixed parameter, including the sample size. The file format is .csv, and the data are organized by individual profiles on the rows and their personal features on the columns. The information in the dataset has been generated based on the statistical information about the age-structure distribution, the number of populations over municipalities, the number of different nationalities present in Luxembourg, and salary statistics per municipality. The STATEC platform, the statistics portal of Luxembourg, is the public source we used to gather the real information that we ingested into our synthetic generation model. Other features like Date of birth, Social matricule, First name, Surname, Ethnicity, and physical attributes have been obtained by a logical relationship between variables without exploiting any additional real information. We are in compliance with the law in putting close to zero the risk of identifying a real person completely by chance.