18 datasets found
  1. p

    Representative synthetic dataset of Luxembourg’s citizens

    • data.public.lu
    csv
    Updated Dec 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luxembourg National Data Service (2023). Representative synthetic dataset of Luxembourg’s citizens [Dataset]. https://data.public.lu/en/datasets/representative-synthetic-dataset-of-luxembourgs-citizens/
    Explore at:
    csv(10936553), csv(108540)Available download formats
    Dataset updated
    Dec 1, 2023
    Dataset authored and provided by
    Luxembourg National Data Service
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Luxembourg
    Description

    The dataset has been created by using the open-source code released by LNDS (Luxembourg National Data Service). It is meant to be an example of the dataset structure anyone can generate and personalize in terms of some fixed parameter, including the sample size. The file format is .csv, and the data are organized by individual profiles on the rows and their personal features on the columns. The information in the dataset has been generated based on the statistical information about the age-structure distribution, the number of populations over municipalities, the number of different nationalities present in Luxembourg, and salary statistics per municipality. The STATEC platform, the statistics portal of Luxembourg, is the public source we used to gather the real information that we ingested into our synthetic generation model. Other features like Date of birth, Social matricule, First name, Surname, Ethnicity, and physical attributes have been obtained by a logical relationship between variables without exploiting any additional real information. We are in compliance with the law in putting close to zero the risk of identifying a real person completely by chance.

  2. d

    Data from: Expert opinions of demographic rates of Argentine black and white...

    • catalog.data.gov
    • data.usgs.gov
    • +2more
    Updated Oct 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Expert opinions of demographic rates of Argentine black and white tegus in South Florida [Dataset]. https://catalog.data.gov/dataset/expert-opinions-of-demographic-rates-of-argentine-black-and-white-tegus-in-south-florida
    Explore at:
    Dataset updated
    Oct 29, 2025
    Dataset provided by
    U.S. Geological Survey
    Area covered
    Florida
    Description

    We illustrate the utility of expert elicitation, explicit recognition of uncertainty, and the value of information for directing management and research efforts for invasive species, using tegu lizards (Salvator merianae) in southern Florida as a case study. We posited a post-birth pulse, matrix model, which was parameterized using a 3-point process to elicit estimates of tegu demographic rates from herpetology experts. We fit statistical distributions for each parameter and for each expert, then drew and pooled a large number of replicate samples from these to form a distribution for each demographic parameter. Using these distributions, we generated a large sample of matrix models to infer how the tegu population might respond to control efforts. We used the concepts of Pareto efficiency and stochastic dominance to conclude that targeting older age classes at relatively high rates appears to have the best chance of minimizing tegu abundance and control costs. Expert opinion combined with an explicit consideration of uncertainty can be valuable for conducting an initial assessment of the effort needed to control the invader. The value of information can be used to focus research in a way that not only helps increases the efficacy of control, but minimizes costs as well.

  3. s

    Data and R code used in: Plant geographic distribution influences chemical...

    • repository.soilwise-he.eu
    • search.dataone.org
    • +1more
    Updated Jan 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Data and R code used in: Plant geographic distribution influences chemical defenses in native and introduced Plantago lanceolata populations [Dataset]. http://doi.org/10.5061/dryad.5dv41nsd1
    Explore at:
    Dataset updated
    Jan 22, 2024
    Description

    Open Access# Data and R code used in: Plant geographic distribution influences chemical defenses in native and introduced Plantago lanceolata populations ## Description of the data and file structure * 00_ReadMe_DescriptonVariables.csv: A list with the description of variables from each file used. * 00_Metadata_Coordinates.csv : A dataset that includes the coordinates of each Plantago lanceolata population used. * 00_Metadata_Climate.csv : A dataset that includes coordinates, bioclimatic parameters, and the results of PCA. The dataset was created based on the script '1_Environmental variables.qmd' * 00_Metadata_Individuals.csv: A dataset that includes general information about each plant individual. Information about root traits and chemistry is missing in four samples since we lost the samples. * 01_Datset_PlantTraits.csv: Size-related and resource allocation traits measured of Plantago lanceolata and herbivore damage. * 02_Dataset_TargetedCompounds.csv: Phytohormones, Iridoid glycosides, Verbascoside and Flavonoids quantification of the leaves and roots of Plantago lanceolata. Data generated from HPLC * 03_Dataset_Volatiles_Area.csv: Area of identified volatile compounds. Data generated from GC-FID * 03_Dataset_Volatiles_Compounds.csv: Information on identified volatile compounds. Data generated from GC-MS. * 04_Dataset_Metabolome_Negative_Metadata.txt: Metadata for files in negative mode * 04_Dataset_Metabolome_Negative_Intensity.xlsx : File with the intensity of the metabolite features in negative mode. The file was generated from Metaboscape and adapted as required for the Notame package. * 04_Dataset_Metabolome_Negative_Intensity_filtered.xlsx: File generated after preprocessing of features in negative mode. During the notadame pacakged preprossesing 0 were converted to na * 04_Dataset_Metabolome_Negative.msmsonly.csv: File with a intensity of the the metabolite features in negative mode with ms/ms data. File generated from Metaboscape. * 04_Results_Metabolome_Negative_canopus_compound_summary.tsv: Feature classification. Results generated from Sirius software. * 04_Results_Metabolome_Negative_compound_identifications.tsv: Feature identification. Results generated from Sirius software. * 05_Dataset_Metabolome_Positive_Metadata.txt: Metadata for files in positive mode * 05_DatasetMetabolome_Positive_Intensity.xlsx : File with a intensity of the the metabolite features in positive mode. File generated from Metaboscape and adapted as required for the Notame package. * 05_Dataset_Metabolome_Positive_Intensity_filtered: File generated after preprocessing of features in positive mode.During the notadame pacakged preprossesing 0 were converted to na ## ## Code/Software * 1_Environmental vairables.qmd: Rscript to Retrieve bioclimatic variables from based on the coordinates of each population and then perform a principal components analysis to reduce the axes variation and included the first principal component as an explanatory variable in our model to estimate trait differences between native and introduced populations. Figure 1b and 1d * 2_PlantTraits_and_Herbivory: Rscript for statistical anaylsis of size-related traits, resource allocation traits and herbivore damage. Figure 2. It needs to source: Model_1_Fucntion.R, Model_2_Fucntion.R, Plot_Function.R * 3_Metabolome: Rscript for statistical anaylsis of Plantago lanceolata metabolome. Figure 3. It needs to source: Metabolome_preprocessing_R, Model_1_Fucntion.R, Model_2_Fucntion.R, Plot_Function.R. * 4_TargetedCompounds: Rscript for statistical anaylsis of Plantago lanceolata targeted compounds. Figure 4. It needs to source: Model_1_Fucntion.R, Model_2_Fucntion.R, Plot_Function.R * 5_Volatilome: Rscript for statistical anaylsis of Plantago lanceolata metabolome. Figure 5. It needs to source: Model_1_Fucntion.R, Model_2_Fucntion.R, Plot_Function.R * Model_1_Function.R : Function to run statistical models * Model_2_Function.R : Function to run statistical models * Plots_Function.R : Function to run plot graphs * Metabolome_prepocessing.R: Script to preprocess features

  4. a

    Portsmouth Water Drinking Water Quality Data 2022 2023 2024

    • hub.arcgis.com
    • streamwaterdata.co.uk
    • +1more
    Updated Oct 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AHughes_Portsmouth (2025). Portsmouth Water Drinking Water Quality Data 2022 2023 2024 [Dataset]. https://hub.arcgis.com/datasets/d3165fd17d624b22a9900d47677dfa45
    Explore at:
    Dataset updated
    Oct 1, 2025
    Dataset authored and provided by
    AHughes_Portsmouth
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    Water companies in the UK are responsible for testing the quality of drinking water. This dataset contains the results of samples taken from the taps in domestic households to make sure they meet the standards set out by UK and European legislation. This data shows the location, date, and measured levels of determinands set out by the Drinking Water Inspectorate (DWI).

    Key Definitions

    Aggregation

    Process involving summarizing or grouping data to obtain a single or reduced set of information, often for analysis or reporting purposes

    Anonymisation

    Anonymised data is a type of information sanitization in which data anonymisation tools encrypt or remove personally identifiable information from datasets for the purpose of preserving a data subject's privacy

    Dataset

    Structured and organized collection of related elements, often stored digitally, used for analysis and interpretation in various fields.

    Determinand

    A constituent or property of drinking water which can be determined or estimated.

    DWI

    Drinking Water Inspectorate, an organisation “providing independent reassurance that water supplies in England and Wales are safe and drinking water quality is acceptable to consumers.”

    DWI Determinands

    Constituents or properties that are tested for when evaluating a sample for its quality as per the guidance of the DWI. For this dataset, only determinands with “point of compliance” as “customer taps” are included.

    Granularity

    Data granularity is a measure of the level of detail in a data structure. In time-series data, for example, the granularity of measurement might be based on intervals of years, months, weeks, days, or hours

    ID

    Abbreviation for Identification that refers to any means of verifying the unique identifier assigned to each asset for the purposes of tracking, management, and maintenance.

    LSOA

    Lower-Level Super Output Area is made up of small geographic areas used for statistical and administrative purposes by the Office for National Statistics. It is designed to have homogeneous populations in terms of population size, making them suitable for statistical analysis and reporting. Each LSOA is built from groups of contiguous Output Areas with an average of about 1,500 residents or 650 households allowing for granular data collection useful for analysis, planning and policy- making while ensuring privacy.

    ONS

    Office for National Statistics

    Open Data Triage

    The process carried out by a Data Custodian to determine if there is any evidence of sensitivities associated with Data Assets, their associated Metadata and Software Scripts used to process Data Assets if they are used as Open Data. <

    Sample

    A sample is a representative segment or portion of water taken from a larger whole for the purpose of analysing or testing to ensure compliance with safety and quality standards.

    Schema

    Structure for organizing and handling data within a dataset, defining the attributes, their data types, and the relationships between different entities. It acts as a framework that ensures data integrity and consistency by specifying permissible data types and constraints for each attribute.

    Units

    Standard measurements used to quantify and compare different physical quantities.

    Water Quality

    The chemical, physical, biological, and radiological characteristics of water, typically in relation to its suitability for a specific purpose, such as drinking, swimming, or ecological health. It is determined by assessing a variety of parameters, including but not limited to pH, turbidity, microbial content, dissolved oxygen, presence of substances and temperature.

    Data History

    Data Origin

    These samples were taken from customer taps. They were then analysed for water quality, and the results were uploaded to a database. This dataset is an extract from this database.

    Data Triage Considerations

    Granularity

    Is it useful to share results as averages or individual?

    We decided to share as individual results as the lowest level of granularity

    Anonymisation

    It is a requirement that this data cannot be used to identify a singular person or household. We discussed many options for aggregating the data to a specific geography to ensure this requirement is met. The following geographical aggregations were discussed:

    <!--·
    Water Supply Zone (WSZ) - Limits interoperability with other datasets

    <!--·
    Postcode – Some postcodes contain very few households and may not offer necessary anonymisation

    <!--·
    Postal Sector – Deemed not granular enough in highly populated areas

    <!--·
    Rounded Co-ordinates – Not a recognised standard and may cause overlapping areas

    <!--·
    MSOA – Deemed not granular enough

    <!--·
    LSOA – Agreed as a recognised standard appropriate for England and Wales

    <!--·
    Data Zones – Agreed as a recognised standard appropriate for Scotland

    Data Specifications

    Each dataset will cover a calendar year of samples

    This dataset will be published annually

    Historical datasets will be published as far back as 2016 from the introduction of of The Water Supply (Water Quality) Regulations 2016

    The Determinands included in the dataset are as per the list that is required to be reported to the Drinking Water Inspectorate.

    Context

    Many UK water companies provide a search tool on their websites where you can search for water quality in your area by postcode. The results of the search may identify the water supply zone that supplies the postcode searched. Water supply zones are not linked to LSOAs which means the results may differ to this dataset

    Some sample results are influenced by internal plumbing and may not be representative of drinking water quality in the wider area.

    Some samples are tested on site and others are sent to scientific laboratories.

    Data Publish Frequency

    Annually

    Data Triage Review Frequency

    Annually unless otherwise requested

    Supplementary information

    Below is a curated selection of links for additional reading, which provide a deeper understanding of this dataset.

    <!--1.
    Drinking Water Inspectorate Standards and Regulations:

    <!--2.
    https://www.dwi.gov.uk/drinking-water-standards-and-regulations/

    <!--3.
    LSOA (England and Wales) and Data Zone (Scotland):

    <!--4. https://www.nrscotland.gov.uk/files/geography/2011-census/geography-bckground-info-comparison-of-thresholds.pdf

    <!--5.
    Description for LSOA boundaries by the ONS: Census 2021 geographies - Office for National Statistics (ons.gov.uk)

    <!--[6.
    Postcode to LSOA lookup tables: Postcode to 2021 Census Output Area to Lower Layer Super Output Area to Middle Layer Super Output Area to Local Authority District (August 2023) Lookup in the UK (statistics.gov.uk)

    <!--7.
    Legislation history: Legislation - Drinking Water Inspectorate (dwi.gov.uk)

  5. f

    Table_1_Determination of sample size for a multinomial model coupled with...

    • frontiersin.figshare.com
    docx
    Updated Jul 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martyna Lukaszewicz; Brian Dennis (2024). Table_1_Determination of sample size for a multinomial model coupled with the phenology model.DOCX [Dataset]. http://doi.org/10.3389/fams.2024.1374832.s002
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jul 2, 2024
    Dataset provided by
    Frontiers
    Authors
    Martyna Lukaszewicz; Brian Dennis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Predicting the timing of phenological events is important in agriculture, especially high-revenue products. A project sponsored by USDA-ARS had the objective of adapting a previously developed model for estimating proportions of insects in different development stages as a function of temperature (degree) and time (days) for predicting bloom in almond orchards. Data for the model normally form a two-way table of counts, with rows corresponding to sample percentages of different development stages and columns to sampling times. In this study, we report a technique developed to estimate sample sizes of multinomial and product multinomial models using a method of moments and determine the empirical coverage of sample size. This study aims to determine an appropriate sample size for data collection. This involves establishing a sampling distribution for the Pearson statistic, defined as the product of the sample size and the deviance of empirical proportions from population proportions. The intended outcome is to predict the optimal timing for harvesting crops at desired development stages when coupled with the phenology model, for which variability of the maximum likelihood estimates of the phenology model depends on sample size.

  6. Comparison of phylogenetic path models produced (A) by the approach of...

    • plos.figshare.com
    xls
    Updated Jun 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivett Pipoly; Veronika Bókony; Jean-Michel Gaillard; Jean-François Lemaître; Tamás Székely; András Liker (2025). Comparison of phylogenetic path models produced (A) by the approach of Santos [32] using the implementation of ‘piecewiseSEM’, and (B) by the approach of von Hardenberg and Gonzalez-Voyer [33] using the implementation of ‘phylopath’. For each model, the table shows the number of independence claims (k), the number of parameters (q), Fisher’s C-statistic (C) for model fit, and its associated p-value. AICc and CICc are the Akaike and C-statistic information criterion, respectively, corrected for small sample sizes. ΔAICc (and ΔCICc) indicates the difference in AICc (or CICc) values between the most supported model (lowest AICc or CICc, model m1.b) and the focal models. ΔAICc (and ΔCICc) > 2 indicates substantially higher support for the best [Dataset]. http://doi.org/10.1371/journal.pbio.3003156.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 9, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Ivett Pipoly; Veronika Bókony; Jean-Michel Gaillard; Jean-François Lemaître; Tamás Székely; András Liker
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Comparison of phylogenetic path models produced (A) by the approach of Santos [32] using the implementation of ‘piecewiseSEM’, and (B) by the approach of von Hardenberg and Gonzalez-Voyer [33] using the implementation of ‘phylopath’. For each model, the table shows the number of independence claims (k), the number of parameters (q), Fisher’s C-statistic (C) for model fit, and its associated p-value. AICc and CICc are the Akaike and C-statistic information criterion, respectively, corrected for small sample sizes. ΔAICc (and ΔCICc) indicates the difference in AICc (or CICc) values between the most supported model (lowest AICc or CICc, model m1.b) and the focal models. ΔAICc (and ΔCICc) > 2 indicates substantially higher support for the best

  7. Statistics of the datasets.

    • plos.figshare.com
    xls
    Updated Feb 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ali Safdari; Chanda Sai Keshav; Deepanshu Mody; Kshitij Verma; Utsav Kaushal; Vaadeendra Kumar Burra; Sibnath Ray; Debashree Bandyopadhyay (2025). Statistics of the datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0316467.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 4, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Ali Safdari; Chanda Sai Keshav; Deepanshu Mody; Kshitij Verma; Utsav Kaushal; Vaadeendra Kumar Burra; Sibnath Ray; Debashree Bandyopadhyay
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The unprecedented worldwide pandemic caused by COVID-19 has motivated several research groups to develop machine-learning based approaches that aim to automate the diagnosis or screening of COVID-19, in large-scale. The gold standard for COVID-19 detection, quantitative-Real-Time-Polymerase-Chain-Reaction (qRT-PCR), is expensive and time-consuming. Alternatively, haematology-based detections were fast and near-accurate, although those were less explored. The external-validity of the haematology-based COVID-19-predictions on diverse populations are yet to be fully investigated. Here we report external-validity of machine learning-based prediction scores from haematological parameters recorded in different hospitals of Brazil, Italy, and Western Europe (raw sample size, 195554). The XGBoost classifier performed consistently better (out of seven ML classifiers) on all the datasets. The working models include a set of either four or fourteen haematological parameters. The internal performances of the XGBoost models (AUC scores range from 84% to 97%) were superior to ML models reported in the literature for some of these datasets (AUC scores range from 84% to 87%). The meta-validation on the external performances revealed the reliability of the performance (AUC score 86%) along with good accuracy of the probabilistic prediction (Brier score 14%), particularly when the model was trained and tested on fourteen haematological parameters from the same country (Brazil). The external performance was reduced when the model was trained on datasets from Italy and tested on Brazil (AUC score 69%) and Western Europe (AUC score 65%); presumably affected by factors, like, ethnicity, phenotype, immunity, reference ranges, across the populations. The state-of-the-art in the present study is the development of a COVID-19 prediction tool that is reliable and parsimonious, using a fewer number of hematological features, in comparison to the earlier study with meta-validation, based on sufficient sample size (n = 195554). Thus, current models can be applied at other demographic locations, preferably, with prior training of the model on the same population. Availability: https://covipred.bits-hyderabad.ac.in/home; https://github.com/debashreebanerjee/CoviPred.

  8. d

    Data from: 2010 County and City-Level Water-Use Data and Associated...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). 2010 County and City-Level Water-Use Data and Associated Explanatory Variables [Dataset]. https://catalog.data.gov/dataset/2010-county-and-city-level-water-use-data-and-associated-explanatory-variables
    Explore at:
    Dataset updated
    Nov 20, 2025
    Dataset provided by
    U.S. Geological Survey
    Description

    This data release contains the input-data files and R scripts associated with the analysis presented in [citation of manuscript]. The spatial extent of the data is the contiguous U.S. The input-data files include one comma separated value (csv) file of county-level data, and one csv file of city-level data. The county-level csv (“county_data.csv”) contains data for 3,109 counties. This data includes two measures of water use, descriptive information about each county, three grouping variables (climate region, urban class, and economic dependency), and contains 18 explanatory variables: proportion of population growth from 2000-2010, fraction of withdrawals from surface water, average daily water yield, mean annual maximum temperature from 1970-2010, 2005-2010 maximum temperature departure from the 40-year maximum, mean annual precipitation from 1970-2010, 2005-2010 mean precipitation departure from the 40-year mean, Gini income disparity index, percent of county population with at least some college education, Cook Partisan Voting Index, housing density, median household income, average number of people per household, median age of structures, percent of renters, percent of single family homes, percent apartments, and a numeric version of urban class. The city-level csv (city_data.csv) contains data for 83 cities. This data includes descriptive information for each city, water-use measures, one grouping variable (climate region), and 6 explanatory variables: type of water bill (increasing block rate, decreasing block rate, or uniform), average price of water bill, number of requirement-oriented water conservation policies, number of rebate-oriented water conservation policies, aridity index, and regional price parity. The R scripts construct fixed-effects and Bayesian Hierarchical regression models. The primary difference between these models relates to how they handle possible clustering in the observations that define unique water-use settings. Fixed-effects models address possible clustering in one of two ways. In a "fully pooled" fixed-effects model, any clustering by group is ignored, and a single, fixed estimate of the coefficient for each covariate is developed using all of the observations. Conversely, in an unpooled fixed-effects model, separate coefficient estimates are developed only using the observations in each group. A hierarchical model provides a compromise between these two extremes. Hierarchical models extend single-level regression to data with a nested structure, whereby the model parameters vary at different levels in the model, including a lower level that describes the actual data and an upper level that influences the values taken by parameters in the lower level. The county-level models were compared using the Watanabe-Akaike information criterion (WAIC) which is derived from the log pointwise predictive density of the models and can be shown to approximate out-of-sample predictive performance. All script files are intended to be used with R statistical software (R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org) and Stan probabilistic modeling software (Stan Development Team. 2017. RStan: the R interface to Stan. R package version 2.16.2. http://mc-stan.org).

  9. Phylodynamic Inference for Structured Epidemiological Models

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    tiff
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David A. Rasmussen; Erik M. Volz; Katia Koelle (2023). Phylodynamic Inference for Structured Epidemiological Models [Dataset]. http://doi.org/10.1371/journal.pcbi.1003570
    Explore at:
    tiffAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    David A. Rasmussen; Erik M. Volz; Katia Koelle
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Coalescent theory is routinely used to estimate past population dynamics and demographic parameters from genealogies. While early work in coalescent theory only considered simple demographic models, advances in theory have allowed for increasingly complex demographic scenarios to be considered. The success of this approach has lead to coalescent-based inference methods being applied to populations with rapidly changing population dynamics, including pathogens like RNA viruses. However, fitting epidemiological models to genealogies via coalescent models remains a challenging task, because pathogen populations often exhibit complex, nonlinear dynamics and are structured by multiple factors. Moreover, it often becomes necessary to consider stochastic variation in population dynamics when fitting such complex models to real data. Using recently developed structured coalescent models that accommodate complex population dynamics and population structure, we develop a statistical framework for fitting stochastic epidemiological models to genealogies. By combining particle filtering methods with Bayesian Markov chain Monte Carlo methods, we are able to fit a wide class of stochastic, nonlinear epidemiological models with different forms of population structure to genealogies. We demonstrate our framework using two structured epidemiological models: a model with disease progression between multiple stages of infection and a two-population model reflecting spatial structure. We apply the multi-stage model to HIV genealogies and show that the proposed method can be used to estimate the stage-specific transmission rates and prevalence of HIV. Finally, using the two-population model we explore how much information about population structure is contained in genealogies and what sample sizes are necessary to reliably infer parameters like migration rates.

  10. GRDC - CCDM/CIC genomic prediction report

    • figshare.com
    bin
    Updated Jun 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darcy Jones (2022). GRDC - CCDM/CIC genomic prediction report [Dataset]. http://doi.org/10.6084/m9.figshare.20069921.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 15, 2022
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Darcy Jones
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    S1Data-simulated.tar.gz - These are simulated data evaluating prediction accuracy for different traits, number of markers, numbers of samples on different testing scenarios.

    Other files are example results of the SelectML methods. These are the combined results of the optimise and predict scripts. The compressed folders are named by the simulated dataset that they correspond to and the model.

    The trait is first (e.g. A1 means an additive trait with 1 causal marker), N1000 means 1000 samples (in the training population), M1000 means 1000 markers sampled, "_CAUSAL" means that the causal loci were included in the sampled markers (note that this means for M1000 all markers sampled did have a genuine effect), and the final section before ".tar.gz" indicates the model used (e.g. sgd, xgb, BGLR).

    Inside each of these compressed folders are the following files. From the selectml optimise command:

    regression_*_best.json - the best performing combination of hyper parameters for these data and model type. regression_*_results.tsv - The optuna running logs showing sampled parameters and average mean squared error (of cross validated samples) of models from that parameter set. regression_*_full_results.tsv - Like _results.tsv but includes other statistics relevant to the task, such as pearsons correlation.

    And from selectml predict: regression_*_model.pkl - a stored version of the trained model given the best parameters from optimise, trained from the complete train dataset. regression_*_predictions.tsv - predicted results for all training datasets. regression_*_stats.tsv - summary statistics (e.g. MSE, pearsons correlation) for the model in different test populations.

  11. Demographics of multi-site experimental data.

    • plos.figshare.com
    • figshare.com
    xls
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seyed Mostafa Kia; Hester Huijsdens; Saige Rutherford; Augustijn de Boer; Richard Dinga; Thomas Wolfers; Pierre Berthet; Maarten Mennes; Ole A. Andreassen; Lars T. Westlye; Christian F. Beckmann; Andre F. Marquand (2023). Demographics of multi-site experimental data. [Dataset]. http://doi.org/10.1371/journal.pone.0278776.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Seyed Mostafa Kia; Hester Huijsdens; Saige Rutherford; Augustijn de Boer; Richard Dinga; Thomas Wolfers; Pierre Berthet; Maarten Mennes; Ole A. Andreassen; Lars T. Westlye; Christian F. Beckmann; Andre F. Marquand
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    (*) The HCPDV and HCPAG datasets are collected by the same data acquisition centers. We consider this in computing the total number of scanners in data.

  12. Data from: Genomic insights on conservation priorities for North Sea houting...

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Apr 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aja Noersgaard Buur Tengstedt; Shenglin Liu; Magnus W. Jacobsen; Camilla Gundlund; Peter Rask Møller; Søren Berg; Dorte Bekkevold; Michael M. Hansen (2024). Genomic insights on conservation priorities for North Sea houting and European lake whitefish (Coregonus spp.) [Dataset]. http://doi.org/10.5061/dryad.qfttdz0r0
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 8, 2024
    Dataset provided by
    University of Copenhagen
    Technical University of Denmark
    Aarhus University
    Authors
    Aja Noersgaard Buur Tengstedt; Shenglin Liu; Magnus W. Jacobsen; Camilla Gundlund; Peter Rask Møller; Søren Berg; Dorte Bekkevold; Michael M. Hansen
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Area covered
    North Sea
    Description

    Population genomics analysis holds great potential for informing conservation of endangered populations. We focused on a controversial case of European whitefish (Coregonus spp.) populations. The endangered North Sea houting is the only coregonid fish that tolerates oceanic salinities and was previously considered a species (C. oxyrhinchus) distinct from European lake whitefish (C. lavaretus). However, no firm evidence for genetic-based salinity adaptation has been available. Also, studies based on microsatellite and mitogenome data suggested surprisingly recent divergence (ca. 2,500 years bp) between houting and lake whitefish. These data types furthermore have provided no evidence for possible inbreeding. Finally, a controversial taxonomic revision recently classified all whitefish in the region as C. maraena, calling conservation priorities of houting into question. We used whole genome and ddRAD sequencing to analyze six lake whitefish populations and the only extant indigenous houting population. Demographic inference indicated postglacial expansion and divergence between lake whitefish and houting occurring not long after the Last Glaciation, implying deeper population histories than previous analyses. Runs of Homozygosity analysis suggested high inbreeding (FROH up to 30.6%) in some freshwater populations, but also FROH up to 10.6% in the houting prompting conservation concerns. Finally, outlier scans provided evidence for adaptation to high salinities in the houting. Applying a framework for defining conservation units based on current and historical reproductive isolation and adaptive divergence led us to recommend that the houting be treated as a separate conservation unit regardless of species status. In total, the results underscore the potential of genomics to inform conservation practices, in this case clarifying conservation units and highlighting populations of concern. Methods A. Sampling and DNA extraction Samples of lake whitefish were collected in 1995-2012 from five locations in Denmark; brackish populations in the Ringkoebing fjord (RIN) and Nissum fjord (NIS), two lagoons connected with the North Sea, and freshwater populations from Lake Flynder (FLYN), Lake Glenstrup (GLEN) and Lake Nors (NORS); and a brackish population from one German location, Achterwasser (ACHT), a lagoon flowing into the Baltic Sea. Houting were collected from the single extant population in Vidaa (VID), a river with outlet into the Wadden Sea (Fig. 1A). Sampling was conducted by electrofishing (VID) and net fishing (remaining populations). Tissue samples consisted of adipose fin clips stored in ethanol at -20°C. DNA was extracted using either a phenol-chloroform method (Taggart et al., 1992) (ACHT, FLYN) or the E.Z.N.A.® Tissue DNA Kit (OMEGA, Bio-tek, CA, USA) following the manufacturer's recommendations (the remaining samples). In total, 35 individuals were whole-genome sequenced and 95 were ddRAD-sequenced (Table 1). A group of 23 individuals occur in both data sets and were consequently both ddRAD and whole-genome sequenced. B. Whole-genome sequencing, mapping, and variant calling Library construction (using insert size ~300 bp) and whole-genome sequencing was outsourced to BGI (Beijing Genomics Institute, Hongkong, China). Paired-end Illumina sequencing was conducted using the Illumina HiSeq 2500 platform with a read length of 150 bp. The sequence reads were mapped to the Coregonus sp. “Balchen” Alpine whitefish reference genome (De-Kayne et al. (2020); GenBank accession: GCA_902810595.1) using BWA-MEM v.0.7.17 (Li, 2013; Li & Durbin, 2009a) with default parameters. SAM format files were sorted, indexed and converted into BAM files using SAMtools v.1.9 (Danecek et al., 2021). Variants were called using BCFtools v.1.2 (Danecek et al., 2021) function mpileup and call with a minimum mapping quality requirement of 20. We used the ‘--multiallelic-caller’ for calling combined with ‘--variants-only’ to output only variant sites. To produce an ‘all sites’ data set containing both monomorphic and polymorphic sites, we repeated the SNP calling process without the ‘--variants-only’ parameter in BCFtools call. C. WGS data set generation We filtered the resulting VCF file containing variant sites with VCFutils.pl (Li et al., 2009b) and VCFtools v.0.1.16 (Danecek et al., 2011) to remove indels, monomorphic sites, multi-allelic SNPs and SNPs with a variant quality <20 or extreme depth of coverage (lower than 400 or higher than 1000 across all individuals) determined from the coverage distribution of SNPs (Fig. S1). The bimodal coverage distribution with two distinct peaks suggested the presence of paralogous loci, a well-known issue in salmonid fishes due to their tetraploid origin. In addition to excluding the variants in the higher coverage peak, which was centered at approximately twice the depth of the lower peak and thus likely represented duplicated regions, we also used VCFtools to discard SNPs located within putative duplicated genomic regions identified by De-Kayne et al. (2020). Furthermore, as loci with an excess of heterozygotes can also represent duplicated genomic regions, we removed SNPs out of Hardy-Weinberg equilibrium (HWE) in one or more populations using a custom R script (https://github.com/shenglin-liu/VCF_HWF). Tests for HWE were conducted using the statistic (Brown, 1970), where is Wright’s fixation index within populations and is the sample size. The statistic follows a standard normal distribution with a mean of 0 and a standard deviation of 1. Negative values denote heterozygote excess and positive values heterozygote deficit, and values > |1.96| are significant at the 5 % level. The effects of the individual filtering steps are detailed in Supplementary Table S1. The resulting data set, hereafter referred to as the ‘HW-filtered WGS data set’, contained 16,898,181 SNPs. Additionally, we produced a ‘LD-pruned WGS data set’ with the addition of 5 individuals of the alpine whitefish species C. arenicolus (AREN) as an outgroup (Extended methods S1) by pruning SNPs on the basis of linkage disequilibrium (LD) in the HW-filtered WGS data set. Pruning was performed with the indep-pairwise function in PLINK v.1.9 (Purcell et al., 2007), where SNPs with r2>0.1 were removed from sliding windows of 50 SNPs with 10 SNPs of overlap. A total of 596,078 SNPs remained after pruning. The ‘all sites’ data set was filtered to remove indels and sites with extreme depth of coverage or located in putative duplicated regions and SNPs not in HWE, as detailed for the ‘variant sites’ data set above. No filtering for minor allele frequency or missing data was performed. After filtering, the VCF contained 1,181,919,736 sites with individuals exhibiting between x and y % missing genotypes. D. Filtering for ROH analyses We opted to further filter our the HW-filtered WGS data set to ensure only the most reliable genotype calls were retained. Following the protocol implemented in Balboa et al. (2024), we estimated mappability of the genome assembly with GENMAP v.1.3.0 (Pockrandt et al., 2020) using 100 bp k-mers and allowing for up to two mismatches, and we identified repetitive elements in the assembly with RepeatMasker v.4.1.2 (Smit et al., 2013) using ‘rmblast’ as the search engine and ‘Actinopterygii’ (ray-finned fishes) as the query species. Repeat regions and sites with a mappability score <1 were excluded from the analyses. In addition to the extreme depth filters applied as previously described, we furthermore used VCFtools to change individual genotypes with very low (DP<10) or very high read depth (DP>40) and genotypes with low quality (GQ<30) to missing (./.). Finally, only SNPs with variant quality (QUAL) >30 and no missing data were kept, resulting in a data set containing 2,646,198 SNPs. E. ddRAD sequencing, mapping, and loci assembly Samples were prepared using ddRADseq (Peterson et al., 2012). The ddRADseq libraries used PstI (6-base) and MspI (4-base) restriction enzymes. Two libraries of equal size were constructed (using insert size of 200-500 bp) and sequenced on an Illumina HiSeq2000 platform with 100 bp paired-end reads at BGI (Hong Kong, China). Raw reads were cleaned and demultiplexed with process_radtags in Stacks v.2.55 (Catchen et al., 2011; Catchen et al., 2013) in addition to being truncated to 90 bp (-t 90). Low-quality reads (phred score < 10 over a sliding window of 15% of the read length) were discarded. Mapping of reads to the Alpine whitefish reference genome (De-Kayne et al., 2020) progressed as described for the whole-genome sequencing data. Loci were assembled from the aligned and sorted reads using gstacks v.2.55 with default parameters. F. ddRADseq data set generation The populations program in Stacks (Catchen et al., 2011; Catchen et al., 2013) was used to generate a preliminary VCF file including only loci present in all six populations (-p 6; GLEN was not analyzed by ddRAD sequencing) and at least 70 percent of individuals within each population (-r 0.7). Exports were ordered (--ordered-export) to ensure that only a single representative of each overlapping site was included. Loci out of HWE in one or more populations were filtered out using a custom R script, as previously described. Based on this data set, five individuals (two from ACHT, one from each of the populations NIS, NORS, and RIN) with more than 10 % missing data were identified. We then generated a new VCF file excluding these five individuals using populations with parameters as previously stated, yielding a total of 347,397 SNPs, and a second VCF file with data analysis restricted to one random SNP per locus, yielding 141,157 SNPs. Both files were filtered to remove SNPs located within potentially duplicated regions of the genome (De-Kayne et al., 2020) and SNPs out of HWE in one or more populations as described for WGS data. A total of 254,693 SNPs and 105,452 SNPs,

  13. Locations and summary statistics for Octopus vulgaris samples, including...

    • plos.figshare.com
    xls
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javier Quinteiro; Jorge Rodríguez-Castro; Manuel Rey-Méndez; Nieves González-Henríquez (2023). Locations and summary statistics for Octopus vulgaris samples, including estimates of haplotype (h) and nucleotide (π) diversity, mismatch distribution parameters, neutrality, and demographic expansion test based on mitochondrial control region. [Dataset]. http://doi.org/10.1371/journal.pone.0230294.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Javier Quinteiro; Jorge Rodríguez-Castro; Manuel Rey-Méndez; Nieves González-Henríquez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Locations and summary statistics for Octopus vulgaris samples, including estimates of haplotype (h) and nucleotide (π) diversity, mismatch distribution parameters, neutrality, and demographic expansion test based on mitochondrial control region.

  14. n

    Data from: Dynamics of Ice Streams: A Physical Statistical Approach

    • access.earthdata.nasa.gov
    • cmr.earthdata.nasa.gov
    Updated Apr 20, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). Dynamics of Ice Streams: A Physical Statistical Approach [Dataset]. https://access.earthdata.nasa.gov/collections/C1214586540-SCIOPS
    Explore at:
    Dataset updated
    Apr 20, 2017
    Time period covered
    Jan 1, 1970 - Present
    Description

    Ice streams are believed to play a major role in determining the response of their parent ice sheet to climate change, and in determining global sea level by serving as regulators on the fresh water stored in the ice sheets. Ice streams are characterized by rapid, laterally confined flow which makes them uniquely identifiable within the body of the more slowly and more homogeneously flowing ice sheet. But while these characteristics enable the identification of ice streams, the processes which control ice-stream motion and evolution, and differences among ice streams in the polar regions, are only partially understood. Understanding the relative importance of lateral and basal drags, as well as the role of gradients in longitudinal stress, is essential for developing models for future evolution of the polar ice sheets. In this project, physical statistical models are used to explore the processes that control ice-stream flow, and to compare these processes between seemingly different ice-stream systems. In particular, the Northeast Ice Stream in Greenland will be investigated. Geophysical models lie at the core of the approach, but are embellished by statistical modeling of various components of variability. One important component comes from the uncertainty in observations on basal elevation, surface elevation, and surface velocity. In this project, new observational data collected using remote-sensing techniques are used. The various components, some of which are spatial, are combined hierarchically using Bayesian statistical methodology. All these are combined mathematically into a physical statistical model that yields the posterior distributions for basal and surface elevations, surface velocity fields, and stress fields, conditional on the data. Inference based on these distributions is carried out via Markov chain Monte Carlo techniques, to obtain estimates of these unknown fields along with uncertainty measures associated with them.

  15. Index of notation for parameter values, with default values, and variables...

    • figshare.com
    • plos.figshare.com
    xls
    Updated Jun 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    D. B. Bonnéry; L. -S. Pretorius; A. E. C. Jooste; A. D. W. Geering; C. A. Gilligan (2023). Index of notation for parameter values, with default values, and variables used in computing an optimal sampling design for disease-free status. [Dataset]. http://doi.org/10.1371/journal.pone.0277725.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 17, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    D. B. Bonnéry; L. -S. Pretorius; A. E. C. Jooste; A. D. W. Geering; C. A. Gilligan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Index of notation for parameter values, with default values, and variables used in computing an optimal sampling design for disease-free status.

  16. f

    Model comparison and fitness parameter outputs.

    • figshare.com
    xls
    Updated Jun 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abiyu Abadi Tareke; Ermias Bekele Enyew; Bayley Adane Takele (2023). Model comparison and fitness parameter outputs. [Dataset]. http://doi.org/10.1371/journal.pone.0264559.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 15, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Abiyu Abadi Tareke; Ermias Bekele Enyew; Bayley Adane Takele
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Model comparison and fitness parameter outputs.

  17. Description of the demographic attributes of the dataset.

    • plos.figshare.com
    xls
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jorge Sánchez-Garcés; Nelly Rosario Moreno-Leyva; Lorena Marténez Soto; Alex Danny Chambi-Rodriguez; Dina Milagros Tapara-Yanarico; Dennis Karlo Silva-Vargas; Himer Avila-George (2023). Description of the demographic attributes of the dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0279989.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Jorge Sánchez-Garcés; Nelly Rosario Moreno-Leyva; Lorena Marténez Soto; Alex Danny Chambi-Rodriguez; Dina Milagros Tapara-Yanarico; Dennis Karlo Silva-Vargas; Himer Avila-George
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description of the demographic attributes of the dataset.

  18. Survey years of each country with respective weighted sample size.

    • plos.figshare.com
    xls
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abiyu Abadi Tareke; Ermias Bekele Enyew; Bayley Adane Takele (2023). Survey years of each country with respective weighted sample size. [Dataset]. http://doi.org/10.1371/journal.pone.0264559.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Abiyu Abadi Tareke; Ermias Bekele Enyew; Bayley Adane Takele
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Survey years of each country with respective weighted sample size.

  19. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Luxembourg National Data Service (2023). Representative synthetic dataset of Luxembourg’s citizens [Dataset]. https://data.public.lu/en/datasets/representative-synthetic-dataset-of-luxembourgs-citizens/

Representative synthetic dataset of Luxembourg’s citizens

representative-synthetic-dataset-of-luxembourgs-citizens

Explore at:
csv(10936553), csv(108540)Available download formats
Dataset updated
Dec 1, 2023
Dataset authored and provided by
Luxembourg National Data Service
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Area covered
Luxembourg
Description

The dataset has been created by using the open-source code released by LNDS (Luxembourg National Data Service). It is meant to be an example of the dataset structure anyone can generate and personalize in terms of some fixed parameter, including the sample size. The file format is .csv, and the data are organized by individual profiles on the rows and their personal features on the columns. The information in the dataset has been generated based on the statistical information about the age-structure distribution, the number of populations over municipalities, the number of different nationalities present in Luxembourg, and salary statistics per municipality. The STATEC platform, the statistics portal of Luxembourg, is the public source we used to gather the real information that we ingested into our synthetic generation model. Other features like Date of birth, Social matricule, First name, Surname, Ethnicity, and physical attributes have been obtained by a logical relationship between variables without exploiting any additional real information. We are in compliance with the law in putting close to zero the risk of identifying a real person completely by chance.

Search
Clear search
Close search
Google apps
Main menu