18 datasets found

p
Representative synthetic dataset of Luxembourg’s citizens
data.public.lu
csv
Updated Dec 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luxembourg National Data Service (2023). Representative synthetic dataset of Luxembourg’s citizens [Dataset]. https://data.public.lu/en/datasets/representative-synthetic-dataset-of-luxembourgs-citizens/
Explore at:
csv(10936553), csv(108540)Available download formats
Dataset updated
Dec 1, 2023
Dataset authored and provided by
Luxembourg National Data Service
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Luxembourg
Description
The dataset has been created by using the open-source code released by LNDS (Luxembourg National Data Service). It is meant to be an example of the dataset structure anyone can generate and personalize in terms of some fixed parameter, including the sample size. The file format is .csv, and the data are organized by individual profiles on the rows and their personal features on the columns. The information in the dataset has been generated based on the statistical information about the age-structure distribution, the number of populations over municipalities, the number of different nationalities present in Luxembourg, and salary statistics per municipality. The STATEC platform, the statistics portal of Luxembourg, is the public source we used to gather the real information that we ingested into our synthetic generation model. Other features like Date of birth, Social matricule, First name, Surname, Ethnicity, and physical attributes have been obtained by a logical relationship between variables without exploiting any additional real information. We are in compliance with the law in putting close to zero the risk of identifying a real person completely by chance.
d
Data from: Expert opinions of demographic rates of Argentine black and white...
catalog.data.gov
data.usgs.gov
+2more
Updated Oct 29, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Expert opinions of demographic rates of Argentine black and white tegus in South Florida [Dataset]. https://catalog.data.gov/dataset/expert-opinions-of-demographic-rates-of-argentine-black-and-white-tegus-in-south-florida
Explore at:
Dataset updated
Oct 29, 2025
Dataset provided by
U.S. Geological Survey
Area covered
Florida
Description
We illustrate the utility of expert elicitation, explicit recognition of uncertainty, and the value of information for directing management and research efforts for invasive species, using tegu lizards (Salvator merianae) in southern Florida as a case study. We posited a post-birth pulse, matrix model, which was parameterized using a 3-point process to elicit estimates of tegu demographic rates from herpetology experts. We fit statistical distributions for each parameter and for each expert, then drew and pooled a large number of replicate samples from these to form a distribution for each demographic parameter. Using these distributions, we generated a large sample of matrix models to infer how the tegu population might respond to control efforts. We used the concepts of Pareto efficiency and stochastic dominance to conclude that targeting older age classes at relatively high rates appears to have the best chance of minimizing tegu abundance and control costs. Expert opinion combined with an explicit consideration of uncertainty can be valuable for conducting an initial assessment of the effort needed to control the invader. The value of information can be used to focus research in a way that not only helps increases the efficacy of control, but minimizes costs as well.
s
Data and R code used in: Plant geographic distribution influences chemical...
repository.soilwise-he.eu
search.dataone.org
+1more
Updated Jan 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Data and R code used in: Plant geographic distribution influences chemical defenses in native and introduced Plantago lanceolata populations [Dataset]. http://doi.org/10.5061/dryad.5dv41nsd1
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.5dv41nsd1
Dataset updated
Jan 22, 2024
Description
Open Access# Data and R code used in: Plant geographic distribution influences chemical defenses in native and introduced Plantago lanceolata populations ## Description of the data and file structure * 00_ReadMe_DescriptonVariables.csv: A list with the description of variables from each file used. * 00_Metadata_Coordinates.csv : A dataset that includes the coordinates of each Plantago lanceolata population used. * 00_Metadata_Climate.csv : A dataset that includes coordinates, bioclimatic parameters, and the results of PCA. The dataset was created based on the script '1_Environmental variables.qmd' * 00_Metadata_Individuals.csv: A dataset that includes general information about each plant individual. Information about root traits and chemistry is missing in four samples since we lost the samples. * 01_Datset_PlantTraits.csv: Size-related and resource allocation traits measured of Plantago lanceolata and herbivore damage. * 02_Dataset_TargetedCompounds.csv: Phytohormones, Iridoid glycosides, Verbascoside and Flavonoids quantification of the leaves and roots of Plantago lanceolata. Data generated from HPLC * 03_Dataset_Volatiles_Area.csv: Area of identified volatile compounds. Data generated from GC-FID * 03_Dataset_Volatiles_Compounds.csv: Information on identified volatile compounds. Data generated from GC-MS. * 04_Dataset_Metabolome_Negative_Metadata.txt: Metadata for files in negative mode * 04_Dataset_Metabolome_Negative_Intensity.xlsx : File with the intensity of the metabolite features in negative mode. The file was generated from Metaboscape and adapted as required for the Notame package. * 04_Dataset_Metabolome_Negative_Intensity_filtered.xlsx: File generated after preprocessing of features in negative mode. During the notadame pacakged preprossesing 0 were converted to na * 04_Dataset_Metabolome_Negative.msmsonly.csv: File with a intensity of the the metabolite features in negative mode with ms/ms data. File generated from Metaboscape. * 04_Results_Metabolome_Negative_canopus_compound_summary.tsv: Feature classification. Results generated from Sirius software. * 04_Results_Metabolome_Negative_compound_identifications.tsv: Feature identification. Results generated from Sirius software. * 05_Dataset_Metabolome_Positive_Metadata.txt: Metadata for files in positive mode * 05_DatasetMetabolome_Positive_Intensity.xlsx : File with a intensity of the the metabolite features in positive mode. File generated from Metaboscape and adapted as required for the Notame package. * 05_Dataset_Metabolome_Positive_Intensity_filtered: File generated after preprocessing of features in positive mode.During the notadame pacakged preprossesing 0 were converted to na ## ## Code/Software * 1_Environmental vairables.qmd: Rscript to Retrieve bioclimatic variables from based on the coordinates of each population and then perform a principal components analysis to reduce the axes variation and included the first principal component as an explanatory variable in our model to estimate trait differences between native and introduced populations. Figure 1b and 1d * 2_PlantTraits_and_Herbivory: Rscript for statistical anaylsis of size-related traits, resource allocation traits and herbivore damage. Figure 2. It needs to source: Model_1_Fucntion.R, Model_2_Fucntion.R, Plot_Function.R * 3_Metabolome: Rscript for statistical anaylsis of Plantago lanceolata metabolome. Figure 3. It needs to source: Metabolome_preprocessing_R, Model_1_Fucntion.R, Model_2_Fucntion.R, Plot_Function.R. * 4_TargetedCompounds: Rscript for statistical anaylsis of Plantago lanceolata targeted compounds. Figure 4. It needs to source: Model_1_Fucntion.R, Model_2_Fucntion.R, Plot_Function.R * 5_Volatilome: Rscript for statistical anaylsis of Plantago lanceolata metabolome. Figure 5. It needs to source: Model_1_Fucntion.R, Model_2_Fucntion.R, Plot_Function.R * Model_1_Function.R : Function to run statistical models * Model_2_Function.R : Function to run statistical models * Plots_Function.R : Function to run plot graphs * Metabolome_prepocessing.R: Script to preprocess features
a
Portsmouth Water Drinking Water Quality Data 2022 2023 2024
hub.arcgis.com
streamwaterdata.co.uk
+1more
Updated Oct 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AHughes_Portsmouth (2025). Portsmouth Water Drinking Water Quality Data 2022 2023 2024 [Dataset]. https://hub.arcgis.com/datasets/d3165fd17d624b22a9900d47677dfa45
Explore at:
Dataset updated
Oct 1, 2025
Dataset authored and provided by
AHughes_Portsmouth
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview

Water companies in the UK are responsible for testing the quality of drinking water. This dataset contains the results of samples taken from the taps in domestic households to make sure they meet the standards set out by UK and European legislation. This data shows the location, date, and measured levels of determinands set out by the Drinking Water Inspectorate (DWI).

Key Definitions

Aggregation

Process involving summarizing or grouping data to obtain a single or reduced set of information, often for analysis or reporting purposes

Anonymisation

Anonymised data is a type of information sanitization in which data anonymisation tools encrypt or remove personally identifiable information from datasets for the purpose of preserving a data subject's privacy

Dataset

Structured and organized collection of related elements, often stored digitally, used for analysis and interpretation in various fields.

Determinand

A constituent or property of drinking water which can be determined or estimated.

DWI

Drinking Water Inspectorate, an organisation “providing independent reassurance that water supplies in England and Wales are safe and drinking water quality is acceptable to consumers.”

DWI Determinands

Constituents or properties that are tested for when evaluating a sample for its quality as per the guidance of the DWI. For this dataset, only determinands with “point of compliance” as “customer taps” are included.

Granularity

Data granularity is a measure of the level of detail in a data structure. In time-series data, for example, the granularity of measurement might be based on intervals of years, months, weeks, days, or hours

ID

Abbreviation for Identification that refers to any means of verifying the unique identifier assigned to each asset for the purposes of tracking, management, and maintenance.

LSOA

Lower-Level Super Output Area is made up of small geographic areas used for statistical and administrative purposes by the Office for National Statistics. It is designed to have homogeneous populations in terms of population size, making them suitable for statistical analysis and reporting. Each LSOA is built from groups of contiguous Output Areas with an average of about 1,500 residents or 650 households allowing for granular data collection useful for analysis, planning and policy- making while ensuring privacy.

ONS

Office for National Statistics

Open Data Triage

The process carried out by a Data Custodian to determine if there is any evidence of sensitivities associated with Data Assets, their associated Metadata and Software Scripts used to process Data Assets if they are used as Open Data. <

Sample

A sample is a representative segment or portion of water taken from a larger whole for the purpose of analysing or testing to ensure compliance with safety and quality standards.

Schema

Structure for organizing and handling data within a dataset, defining the attributes, their data types, and the relationships between different entities. It acts as a framework that ensures data integrity and consistency by specifying permissible data types and constraints for each attribute.

Units

Standard measurements used to quantify and compare different physical quantities.

Water Quality

The chemical, physical, biological, and radiological characteristics of water, typically in relation to its suitability for a specific purpose, such as drinking, swimming, or ecological health. It is determined by assessing a variety of parameters, including but not limited to pH, turbidity, microbial content, dissolved oxygen, presence of substances and temperature.

Data History

Data Origin

These samples were taken from customer taps. They were then analysed for water quality, and the results were uploaded to a database. This dataset is an extract from this database.

Data Triage Considerations

Granularity

Is it useful to share results as averages or individual?

We decided to share as individual results as the lowest level of granularity

Anonymisation

It is a requirement that this data cannot be used to identify a singular person or household. We discussed many options for aggregating the data to a specific geography to ensure this requirement is met. The following geographical aggregations were discussed:

<!--·
Water Supply Zone (WSZ) - Limits interoperability with other datasets

<!--·
Postcode – Some postcodes contain very few households and may not offer necessary anonymisation

<!--·
Postal Sector – Deemed not granular enough in highly populated areas

<!--·
Rounded Co-ordinates – Not a recognised standard and may cause overlapping areas

<!--·
MSOA – Deemed not granular enough

<!--·
LSOA – Agreed as a recognised standard appropriate for England and Wales

<!--·
Data Zones – Agreed as a recognised standard appropriate for Scotland

Data Specifications

Each dataset will cover a calendar year of samples

This dataset will be published annually

Historical datasets will be published as far back as 2016 from the introduction of of The Water Supply (Water Quality) Regulations 2016

The Determinands included in the dataset are as per the list that is required to be reported to the Drinking Water Inspectorate.

Context

Many UK water companies provide a search tool on their websites where you can search for water quality in your area by postcode. The results of the search may identify the water supply zone that supplies the postcode searched. Water supply zones are not linked to LSOAs which means the results may differ to this dataset

Some sample results are influenced by internal plumbing and may not be representative of drinking water quality in the wider area.

Some samples are tested on site and others are sent to scientific laboratories.

Data Publish Frequency

Annually

Data Triage Review Frequency

Annually unless otherwise requested

Supplementary information

Below is a curated selection of links for additional reading, which provide a deeper understanding of this dataset.

<!--1.
Drinking Water Inspectorate Standards and Regulations:

<!--2.
https://www.dwi.gov.uk/drinking-water-standards-and-regulations/

<!--3.
LSOA (England and Wales) and Data Zone (Scotland):

<!--4. https://www.nrscotland.gov.uk/files/geography/2011-census/geography-bckground-info-comparison-of-thresholds.pdf

<!--5.
Description for LSOA boundaries by the ONS: Census 2021 geographies - Office for National Statistics (ons.gov.uk)

<!--[6.
Postcode to LSOA lookup tables: Postcode to 2021 Census Output Area to Lower Layer Super Output Area to Middle Layer Super Output Area to Local Authority District (August 2023) Lookup in the UK (statistics.gov.uk)

<!--7.
Legislation history: Legislation - Drinking Water Inspectorate (dwi.gov.uk)
f
Table_1_Determination of sample size for a multinomial model coupled with...
frontiersin.figshare.com
docx
Updated Jul 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martyna Lukaszewicz; Brian Dennis (2024). Table_1_Determination of sample size for a multinomial model coupled with the phenology model.DOCX [Dataset]. http://doi.org/10.3389/fams.2024.1374832.s002
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fams.2024.1374832.s002
Dataset updated
Jul 2, 2024
Dataset provided by
Frontiers
Authors
Martyna Lukaszewicz; Brian Dennis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Predicting the timing of phenological events is important in agriculture, especially high-revenue products. A project sponsored by USDA-ARS had the objective of adapting a previously developed model for estimating proportions of insects in different development stages as a function of temperature (degree) and time (days) for predicting bloom in almond orchards. Data for the model normally form a two-way table of counts, with rows corresponding to sample percentages of different development stages and columns to sampling times. In this study, we report a technique developed to estimate sample sizes of multinomial and product multinomial models using a method of moments and determine the empirical coverage of sample size. This study aims to determine an appropriate sample size for data collection. This involves establishing a sampling distribution for the Pearson statistic, defined as the product of the sample size and the deviance of empirical proportions from population proportions. The intended outcome is to predict the optimal timing for harvesting crops at desired development stages when coupled with the phenology model, for which variability of the maximum likelihood estimates of the phenology model depends on sample size.
Comparison of phylogenetic path models produced (A) by the approach of...
plos.figshare.com
xls
Updated Jun 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivett Pipoly; Veronika Bókony; Jean-Michel Gaillard; Jean-François Lemaître; Tamás Székely; András Liker (2025). Comparison of phylogenetic path models produced (A) by the approach of Santos [32] using the implementation of ‘piecewiseSEM’, and (B) by the approach of von Hardenberg and Gonzalez-Voyer [33] using the implementation of ‘phylopath’. For each model, the table shows the number of independence claims (k), the number of parameters (q), Fisher’s C-statistic (C) for model fit, and its associated p-value. AICc and CICc are the Akaike and C-statistic information criterion, respectively, corrected for small sample sizes. ΔAICc (and ΔCICc) indicates the difference in AICc (or CICc) values between the most supported model (lowest AICc or CICc, model m1.b) and the focal models. ΔAICc (and ΔCICc) > 2 indicates substantially higher support for the best [Dataset]. http://doi.org/10.1371/journal.pbio.3003156.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pbio.3003156.t002
Dataset updated
Jun 9, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Ivett Pipoly; Veronika Bókony; Jean-Michel Gaillard; Jean-François Lemaître; Tamás Székely; András Liker
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparison of phylogenetic path models produced (A) by the approach of Santos [32] using the implementation of ‘piecewiseSEM’, and (B) by the approach of von Hardenberg and Gonzalez-Voyer [33] using the implementation of ‘phylopath’. For each model, the table shows the number of independence claims (k), the number of parameters (q), Fisher’s C-statistic (C) for model fit, and its associated p-value. AICc and CICc are the Akaike and C-statistic information criterion, respectively, corrected for small sample sizes. ΔAICc (and ΔCICc) indicates the difference in AICc (or CICc) values between the most supported model (lowest AICc or CICc, model m1.b) and the focal models. ΔAICc (and ΔCICc) > 2 indicates substantially higher support for the best
Statistics of the datasets.
plos.figshare.com
xls
Updated Feb 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ali Safdari; Chanda Sai Keshav; Deepanshu Mody; Kshitij Verma; Utsav Kaushal; Vaadeendra Kumar Burra; Sibnath Ray; Debashree Bandyopadhyay (2025). Statistics of the datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0316467.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0316467.t001
Dataset updated
Feb 4, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Ali Safdari; Chanda Sai Keshav; Deepanshu Mody; Kshitij Verma; Utsav Kaushal; Vaadeendra Kumar Burra; Sibnath Ray; Debashree Bandyopadhyay
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The unprecedented worldwide pandemic caused by COVID-19 has motivated several research groups to develop machine-learning based approaches that aim to automate the diagnosis or screening of COVID-19, in large-scale. The gold standard for COVID-19 detection, quantitative-Real-Time-Polymerase-Chain-Reaction (qRT-PCR), is expensive and time-consuming. Alternatively, haematology-based detections were fast and near-accurate, although those were less explored. The external-validity of the haematology-based COVID-19-predictions on diverse populations are yet to be fully investigated. Here we report external-validity of machine learning-based prediction scores from haematological parameters recorded in different hospitals of Brazil, Italy, and Western Europe (raw sample size, 195554). The XGBoost classifier performed consistently better (out of seven ML classifiers) on all the datasets. The working models include a set of either four or fourteen haematological parameters. The internal performances of the XGBoost models (AUC scores range from 84% to 97%) were superior to ML models reported in the literature for some of these datasets (AUC scores range from 84% to 87%). The meta-validation on the external performances revealed the reliability of the performance (AUC score 86%) along with good accuracy of the probabilistic prediction (Brier score 14%), particularly when the model was trained and tested on fourteen haematological parameters from the same country (Brazil). The external performance was reduced when the model was trained on datasets from Italy and tested on Brazil (AUC score 69%) and Western Europe (AUC score 65%); presumably affected by factors, like, ethnicity, phenotype, immunity, reference ranges, across the populations. The state-of-the-art in the present study is the development of a COVID-19 prediction tool that is reliable and parsimonious, using a fewer number of hematological features, in comparison to the earlier study with meta-validation, based on sufficient sample size (n = 195554). Thus, current models can be applied at other demographic locations, preferably, with prior training of the model on the same population. Availability: https://covipred.bits-hyderabad.ac.in/home; https://github.com/debashreebanerjee/CoviPred.
d
Data from: 2010 County and City-Level Water-Use Data and Associated...
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). 2010 County and City-Level Water-Use Data and Associated Explanatory Variables [Dataset]. https://catalog.data.gov/dataset/2010-county-and-city-level-water-use-data-and-associated-explanatory-variables
Explore at:
Dataset updated
Nov 20, 2025
Dataset provided by
U.S. Geological Survey
Description
This data release contains the input-data files and R scripts associated with the analysis presented in [citation of manuscript]. The spatial extent of the data is the contiguous U.S. The input-data files include one comma separated value (csv) file of county-level data, and one csv file of city-level data. The county-level csv (“county_data.csv”) contains data for 3,109 counties. This data includes two measures of water use, descriptive information about each county, three grouping variables (climate region, urban class, and economic dependency), and contains 18 explanatory variables: proportion of population growth from 2000-2010, fraction of withdrawals from surface water, average daily water yield, mean annual maximum temperature from 1970-2010, 2005-2010 maximum temperature departure from the 40-year maximum, mean annual precipitation from 1970-2010, 2005-2010 mean precipitation departure from the 40-year mean, Gini income disparity index, percent of county population with at least some college education, Cook Partisan Voting Index, housing density, median household income, average number of people per household, median age of structures, percent of renters, percent of single family homes, percent apartments, and a numeric version of urban class. The city-level csv (city_data.csv) contains data for 83 cities. This data includes descriptive information for each city, water-use measures, one grouping variable (climate region), and 6 explanatory variables: type of water bill (increasing block rate, decreasing block rate, or uniform), average price of water bill, number of requirement-oriented water conservation policies, number of rebate-oriented water conservation policies, aridity index, and regional price parity. The R scripts construct fixed-effects and Bayesian Hierarchical regression models. The primary difference between these models relates to how they handle possible clustering in the observations that define unique water-use settings. Fixed-effects models address possible clustering in one of two ways. In a "fully pooled" fixed-effects model, any clustering by group is ignored, and a single, fixed estimate of the coefficient for each covariate is developed using all of the observations. Conversely, in an unpooled fixed-effects model, separate coefficient estimates are developed only using the observations in each group. A hierarchical model provides a compromise between these two extremes. Hierarchical models extend single-level regression to data with a nested structure, whereby the model parameters vary at different levels in the model, including a lower level that describes the actual data and an upper level that influences the values taken by parameters in the lower level. The county-level models were compared using the Watanabe-Akaike information criterion (WAIC) which is derived from the log pointwise predictive density of the models and can be shown to approximate out-of-sample predictive performance. All script files are intended to be used with R statistical software (R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org) and Stan probabilistic modeling software (Stan Development Team. 2017. RStan: the R interface to Stan. R package version 2.16.2. http://mc-stan.org).
Phylodynamic Inference for Structured Epidemiological Models
plos.figshare.com
datasetcatalog.nlm.nih.gov
tiff
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David A. Rasmussen; Erik M. Volz; Katia Koelle (2023). Phylodynamic Inference for Structured Epidemiological Models [Dataset]. http://doi.org/10.1371/journal.pcbi.1003570
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1003570
Dataset updated
Jun 4, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
David A. Rasmussen; Erik M. Volz; Katia Koelle
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Coalescent theory is routinely used to estimate past population dynamics and demographic parameters from genealogies. While early work in coalescent theory only considered simple demographic models, advances in theory have allowed for increasingly complex demographic scenarios to be considered. The success of this approach has lead to coalescent-based inference methods being applied to populations with rapidly changing population dynamics, including pathogens like RNA viruses. However, fitting epidemiological models to genealogies via coalescent models remains a challenging task, because pathogen populations often exhibit complex, nonlinear dynamics and are structured by multiple factors. Moreover, it often becomes necessary to consider stochastic variation in population dynamics when fitting such complex models to real data. Using recently developed structured coalescent models that accommodate complex population dynamics and population structure, we develop a statistical framework for fitting stochastic epidemiological models to genealogies. By combining particle filtering methods with Bayesian Markov chain Monte Carlo methods, we are able to fit a wide class of stochastic, nonlinear epidemiological models with different forms of population structure to genealogies. We demonstrate our framework using two structured epidemiological models: a model with disease progression between multiple stages of infection and a two-population model reflecting spatial structure. We apply the multi-stage model to HIV genealogies and show that the proposed method can be used to estimate the stage-specific transmission rates and prevalence of HIV. Finally, using the two-population model we explore how much information about population structure is contained in genealogies and what sample sizes are necessary to reliably infer parameters like migration rates.
GRDC - CCDM/CIC genomic prediction report
figshare.com
bin
Updated Jun 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Darcy Jones (2022). GRDC - CCDM/CIC genomic prediction report [Dataset]. http://doi.org/10.6084/m9.figshare.20069921.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20069921.v1
Dataset updated
Jun 15, 2022
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Darcy Jones
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
S1Data-simulated.tar.gz - These are simulated data evaluating prediction accuracy for different traits, number of markers, numbers of samples on different testing scenarios.

Other files are example results of the SelectML methods. These are the combined results of the optimise and predict scripts. The compressed folders are named by the simulated dataset that they correspond to and the model.

The trait is first (e.g. A1 means an additive trait with 1 causal marker), N1000 means 1000 samples (in the training population), M1000 means 1000 markers sampled, "_CAUSAL" means that the causal loci were included in the sampled markers (note that this means for M1000 all markers sampled did have a genuine effect), and the final section before ".tar.gz" indicates the model used (e.g. sgd, xgb, BGLR).

Inside each of these compressed folders are the following files. From the selectml optimise command:

regression_*_best.json - the best performing combination of hyper parameters for these data and model type. regression_*_results.tsv - The optuna running logs showing sampled parameters and average mean squared error (of cross validated samples) of models from that parameter set. regression_*_full_results.tsv - Like _results.tsv but includes other statistics relevant to the task, such as pearsons correlation.

And from selectml predict: regression_*_model.pkl - a stored version of the trained model given the best parameters from optimise, trained from the complete train dataset. regression_*_predictions.tsv - predicted results for all training datasets. regression_*_stats.tsv - summary statistics (e.g. MSE, pearsons correlation) for the model in different test populations.
Demographics of multi-site experimental data.
plos.figshare.com
figshare.com
xls
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seyed Mostafa Kia; Hester Huijsdens; Saige Rutherford; Augustijn de Boer; Richard Dinga; Thomas Wolfers; Pierre Berthet; Maarten Mennes; Ole A. Andreassen; Lars T. Westlye; Christian F. Beckmann; Andre F. Marquand (2023). Demographics of multi-site experimental data. [Dataset]. http://doi.org/10.1371/journal.pone.0278776.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0278776.t001
Dataset updated
Jun 21, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Seyed Mostafa Kia; Hester Huijsdens; Saige Rutherford; Augustijn de Boer; Richard Dinga; Thomas Wolfers; Pierre Berthet; Maarten Mennes; Ole A. Andreassen; Lars T. Westlye; Christian F. Beckmann; Andre F. Marquand
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
(*) The HCPDV and HCPAG datasets are collected by the same data acquisition centers. We consider this in computing the total number of scanners in data.
Data from: Genomic insights on conservation priorities for North Sea houting...
data.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Apr 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aja Noersgaard Buur Tengstedt; Shenglin Liu; Magnus W. Jacobsen; Camilla Gundlund; Peter Rask Møller; Søren Berg; Dorte Bekkevold; Michael M. Hansen (2024). Genomic insights on conservation priorities for North Sea houting and European lake whitefish (Coregonus spp.) [Dataset]. http://doi.org/10.5061/dryad.qfttdz0r0
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.qfttdz0r0
Dataset updated
Apr 8, 2024
Dataset provided by
University of Copenhagen
Technical University of Denmark
Aarhus University
Authors
Aja Noersgaard Buur Tengstedt; Shenglin Liu; Magnus W. Jacobsen; Camilla Gundlund; Peter Rask Møller; Søren Berg; Dorte Bekkevold; Michael M. Hansen
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
North Sea
Description
Population genomics analysis holds great potential for informing conservation of endangered populations. We focused on a controversial case of European whitefish (Coregonus spp.) populations. The endangered North Sea houting is the only coregonid fish that tolerates oceanic salinities and was previously considered a species (C. oxyrhinchus) distinct from European lake whitefish (C. lavaretus). However, no firm evidence for genetic-based salinity adaptation has been available. Also, studies based on microsatellite and mitogenome data suggested surprisingly recent divergence (ca. 2,500 years bp) between houting and lake whitefish. These data types furthermore have provided no evidence for possible inbreeding. Finally, a controversial taxonomic revision recently classified all whitefish in the region as C. maraena, calling conservation priorities of houting into question. We used whole genome and ddRAD sequencing to analyze six lake whitefish populations and the only extant indigenous houting population. Demographic inference indicated postglacial expansion and divergence between lake whitefish and houting occurring not long after the Last Glaciation, implying deeper population histories than previous analyses. Runs of Homozygosity analysis suggested high inbreeding (FROH up to 30.6%) in some freshwater populations, but also FROH up to 10.6% in the houting prompting conservation concerns. Finally, outlier scans provided evidence for adaptation to high salinities in the houting. Applying a framework for defining conservation units based on current and historical reproductive isolation and adaptive divergence led us to recommend that the houting be treated as a separate conservation unit regardless of species status. In total, the results underscore the potential of genomics to inform conservation practices, in this case clarifying conservation units and highlighting populations of concern. Methods A. Sampling and DNA extraction Samples of lake whitefish were collected in 1995-2012 from five locations in Denmark; brackish populations in the Ringkoebing fjord (RIN) and Nissum fjord (NIS), two lagoons connected with the North Sea, and freshwater populations from Lake Flynder (FLYN), Lake Glenstrup (GLEN) and Lake Nors (NORS); and a brackish population from one German location, Achterwasser (ACHT), a lagoon flowing into the Baltic Sea. Houting were collected from the single extant population in Vidaa (VID), a river with outlet into the Wadden Sea (Fig. 1A). Sampling was conducted by electrofishing (VID) and net fishing (remaining populations). Tissue samples consisted of adipose fin clips stored in ethanol at -20°C. DNA was extracted using either a phenol-chloroform method (Taggart et al., 1992) (ACHT, FLYN) or the E.Z.N.A.® Tissue DNA Kit (OMEGA, Bio-tek, CA, USA) following the manufacturer's recommendations (the remaining samples). In total, 35 individuals were whole-genome sequenced and 95 were ddRAD-sequenced (Table 1). A group of 23 individuals occur in both data sets and were consequently both ddRAD and whole-genome sequenced. B. Whole-genome sequencing, mapping, and variant calling Library construction (using insert size ~300 bp) and whole-genome sequencing was outsourced to BGI (Beijing Genomics Institute, Hongkong, China). Paired-end Illumina sequencing was conducted using the Illumina HiSeq 2500 platform with a read length of 150 bp. The sequence reads were mapped to the Coregonus sp. “Balchen” Alpine whitefish reference genome (De-Kayne et al. (2020); GenBank accession: GCA_902810595.1) using BWA-MEM v.0.7.17 (Li, 2013; Li & Durbin, 2009a) with default parameters. SAM format files were sorted, indexed and converted into BAM files using SAMtools v.1.9 (Danecek et al., 2021). Variants were called using BCFtools v.1.2 (Danecek et al., 2021) function mpileup and call with a minimum mapping quality requirement of 20. We used the ‘--multiallelic-caller’ for calling combined with ‘--variants-only’ to output only variant sites. To produce an ‘all sites’ data set containing both monomorphic and polymorphic sites, we repeated the SNP calling process without the ‘--variants-only’ parameter in BCFtools call. C. WGS data set generation We filtered the resulting VCF file containing variant sites with VCFutils.pl (Li et al., 2009b) and VCFtools v.0.1.16 (Danecek et al., 2011) to remove indels, monomorphic sites, multi-allelic SNPs and SNPs with a variant quality <20 or extreme depth of coverage (lower than 400 or higher than 1000 across all individuals) determined from the coverage distribution of SNPs (Fig. S1). The bimodal coverage distribution with two distinct peaks suggested the presence of paralogous loci, a well-known issue in salmonid fishes due to their tetraploid origin. In addition to excluding the variants in the higher coverage peak, which was centered at approximately twice the depth of the lower peak and thus likely represented duplicated regions, we also used VCFtools to discard SNPs located within putative duplicated genomic regions identified by De-Kayne et al. (2020). Furthermore, as loci with an excess of heterozygotes can also represent duplicated genomic regions, we removed SNPs out of Hardy-Weinberg equilibrium (HWE) in one or more populations using a custom R script (https://github.com/shenglin-liu/VCF_HWF). Tests for HWE were conducted using the statistic (Brown, 1970), where is Wright’s fixation index within populations and is the sample size. The statistic follows a standard normal distribution with a mean of 0 and a standard deviation of 1. Negative values denote heterozygote excess and positive values heterozygote deficit, and values > |1.96| are significant at the 5 % level. The effects of the individual filtering steps are detailed in Supplementary Table S1. The resulting data set, hereafter referred to as the ‘HW-filtered WGS data set’, contained 16,898,181 SNPs. Additionally, we produced a ‘LD-pruned WGS data set’ with the addition of 5 individuals of the alpine whitefish species C. arenicolus (AREN) as an outgroup (Extended methods S1) by pruning SNPs on the basis of linkage disequilibrium (LD) in the HW-filtered WGS data set. Pruning was performed with the indep-pairwise function in PLINK v.1.9 (Purcell et al., 2007), where SNPs with r2>0.1 were removed from sliding windows of 50 SNPs with 10 SNPs of overlap. A total of 596,078 SNPs remained after pruning. The ‘all sites’ data set was filtered to remove indels and sites with extreme depth of coverage or located in putative duplicated regions and SNPs not in HWE, as detailed for the ‘variant sites’ data set above. No filtering for minor allele frequency or missing data was performed. After filtering, the VCF contained 1,181,919,736 sites with individuals exhibiting between x and y % missing genotypes. D. Filtering for ROH analyses We opted to further filter our the HW-filtered WGS data set to ensure only the most reliable genotype calls were retained. Following the protocol implemented in Balboa et al. (2024), we estimated mappability of the genome assembly with GENMAP v.1.3.0 (Pockrandt et al., 2020) using 100 bp k-mers and allowing for up to two mismatches, and we identified repetitive elements in the assembly with RepeatMasker v.4.1.2 (Smit et al., 2013) using ‘rmblast’ as the search engine and ‘Actinopterygii’ (ray-finned fishes) as the query species. Repeat regions and sites with a mappability score <1 were excluded from the analyses. In addition to the extreme depth filters applied as previously described, we furthermore used VCFtools to change individual genotypes with very low (DP<10) or very high read depth (DP>40) and genotypes with low quality (GQ<30) to missing (./.). Finally, only SNPs with variant quality (QUAL) >30 and no missing data were kept, resulting in a data set containing 2,646,198 SNPs. E. ddRAD sequencing, mapping, and loci assembly Samples were prepared using ddRADseq (Peterson et al., 2012). The ddRADseq libraries used PstI (6-base) and MspI (4-base) restriction enzymes. Two libraries of equal size were constructed (using insert size of 200-500 bp) and sequenced on an Illumina HiSeq2000 platform with 100 bp paired-end reads at BGI (Hong Kong, China). Raw reads were cleaned and demultiplexed with process_radtags in Stacks v.2.55 (Catchen et al., 2011; Catchen et al., 2013) in addition to being truncated to 90 bp (-t 90). Low-quality reads (phred score < 10 over a sliding window of 15% of the read length) were discarded. Mapping of reads to the Alpine whitefish reference genome (De-Kayne et al., 2020) progressed as described for the whole-genome sequencing data. Loci were assembled from the aligned and sorted reads using gstacks v.2.55 with default parameters. F. ddRADseq data set generation The populations program in Stacks (Catchen et al., 2011; Catchen et al., 2013) was used to generate a preliminary VCF file including only loci present in all six populations (-p 6; GLEN was not analyzed by ddRAD sequencing) and at least 70 percent of individuals within each population (-r 0.7). Exports were ordered (--ordered-export) to ensure that only a single representative of each overlapping site was included. Loci out of HWE in one or more populations were filtered out using a custom R script, as previously described. Based on this data set, five individuals (two from ACHT, one from each of the populations NIS, NORS, and RIN) with more than 10 % missing data were identified. We then generated a new VCF file excluding these five individuals using populations with parameters as previously stated, yielding a total of 347,397 SNPs, and a second VCF file with data analysis restricted to one random SNP per locus, yielding 141,157 SNPs. Both files were filtered to remove SNPs located within potentially duplicated regions of the genome (De-Kayne et al., 2020) and SNPs out of HWE in one or more populations as described for WGS data. A total of 254,693 SNPs and 105,452 SNPs,
Locations and summary statistics for Octopus vulgaris samples, including...
plos.figshare.com
xls
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javier Quinteiro; Jorge Rodríguez-Castro; Manuel Rey-Méndez; Nieves González-Henríquez (2023). Locations and summary statistics for Octopus vulgaris samples, including estimates of haplotype (h) and nucleotide (π) diversity, mismatch distribution parameters, neutrality, and demographic expansion test based on mitochondrial control region. [Dataset]. http://doi.org/10.1371/journal.pone.0230294.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0230294.t001
Dataset updated
Jun 4, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Javier Quinteiro; Jorge Rodríguez-Castro; Manuel Rey-Méndez; Nieves González-Henríquez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Locations and summary statistics for Octopus vulgaris samples, including estimates of haplotype (h) and nucleotide (π) diversity, mismatch distribution parameters, neutrality, and demographic expansion test based on mitochondrial control region.
n
Data from: Dynamics of Ice Streams: A Physical Statistical Approach
access.earthdata.nasa.gov
cmr.earthdata.nasa.gov
Updated Apr 20, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2017). Dynamics of Ice Streams: A Physical Statistical Approach [Dataset]. https://access.earthdata.nasa.gov/collections/C1214586540-SCIOPS
Explore at:
Dataset updated
Apr 20, 2017
Time period covered
Jan 1, 1970 - Present
Description
Ice streams are believed to play a major role in determining the response of their parent ice sheet to climate change, and in determining global sea level by serving as regulators on the fresh water stored in the ice sheets. Ice streams are characterized by rapid, laterally confined flow which makes them uniquely identifiable within the body of the more slowly and more homogeneously flowing ice sheet. But while these characteristics enable the identification of ice streams, the processes which control ice-stream motion and evolution, and differences among ice streams in the polar regions, are only partially understood. Understanding the relative importance of lateral and basal drags, as well as the role of gradients in longitudinal stress, is essential for developing models for future evolution of the polar ice sheets. In this project, physical statistical models are used to explore the processes that control ice-stream flow, and to compare these processes between seemingly different ice-stream systems. In particular, the Northeast Ice Stream in Greenland will be investigated. Geophysical models lie at the core of the approach, but are embellished by statistical modeling of various components of variability. One important component comes from the uncertainty in observations on basal elevation, surface elevation, and surface velocity. In this project, new observational data collected using remote-sensing techniques are used. The various components, some of which are spatial, are combined hierarchically using Bayesian statistical methodology. All these are combined mathematically into a physical statistical model that yields the posterior distributions for basal and surface elevations, surface velocity fields, and stress fields, conditional on the data. Inference based on these distributions is carried out via Markov chain Monte Carlo techniques, to obtain estimates of these unknown fields along with uncertainty measures associated with them.
Index of notation for parameter values, with default values, and variables...
figshare.com
plos.figshare.com
xls
Updated Jun 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
D. B. Bonnéry; L. -S. Pretorius; A. E. C. Jooste; A. D. W. Geering; C. A. Gilligan (2023). Index of notation for parameter values, with default values, and variables used in computing an optimal sampling design for disease-free status. [Dataset]. http://doi.org/10.1371/journal.pone.0277725.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0277725.t001
Dataset updated
Jun 17, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
D. B. Bonnéry; L. -S. Pretorius; A. E. C. Jooste; A. D. W. Geering; C. A. Gilligan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Index of notation for parameter values, with default values, and variables used in computing an optimal sampling design for disease-free status.
f
Model comparison and fitness parameter outputs.
figshare.com
xls
Updated Jun 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abiyu Abadi Tareke; Ermias Bekele Enyew; Bayley Adane Takele (2023). Model comparison and fitness parameter outputs. [Dataset]. http://doi.org/10.1371/journal.pone.0264559.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0264559.t004
Dataset updated
Jun 15, 2023
Dataset provided by
PLOS ONE
Authors
Abiyu Abadi Tareke; Ermias Bekele Enyew; Bayley Adane Takele
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Model comparison and fitness parameter outputs.
Description of the demographic attributes of the dataset.
plos.figshare.com
xls
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jorge Sánchez-Garcés; Nelly Rosario Moreno-Leyva; Lorena Marténez Soto; Alex Danny Chambi-Rodriguez; Dina Milagros Tapara-Yanarico; Dennis Karlo Silva-Vargas; Himer Avila-George (2023). Description of the demographic attributes of the dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0279989.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0279989.t006
Dataset updated
Jun 21, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Jorge Sánchez-Garcés; Nelly Rosario Moreno-Leyva; Lorena Marténez Soto; Alex Danny Chambi-Rodriguez; Dina Milagros Tapara-Yanarico; Dennis Karlo Silva-Vargas; Himer Avila-George
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description of the demographic attributes of the dataset.
Survey years of each country with respective weighted sample size.
plos.figshare.com
xls
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abiyu Abadi Tareke; Ermias Bekele Enyew; Bayley Adane Takele (2023). Survey years of each country with respective weighted sample size. [Dataset]. http://doi.org/10.1371/journal.pone.0264559.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0264559.t001
Dataset updated
Jun 5, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Abiyu Abadi Tareke; Ermias Bekele Enyew; Bayley Adane Takele
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Survey years of each country with respective weighted sample size.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Luxembourg National Data Service (2023). Representative synthetic dataset of Luxembourg’s citizens [Dataset]. https://data.public.lu/en/datasets/representative-synthetic-dataset-of-luxembourgs-citizens/

Representative synthetic dataset of Luxembourg’s citizens

representative-synthetic-dataset-of-luxembourgs-citizens

Explore at:

csv(10936553), csv(108540)Available download formats

Dataset updated

Dec 1, 2023

Dataset authored and provided by

Luxembourg National Data Service

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Area covered

Luxembourg

Description

The dataset has been created by using the open-source code released by LNDS (Luxembourg National Data Service). It is meant to be an example of the dataset structure anyone can generate and personalize in terms of some fixed parameter, including the sample size. The file format is .csv, and the data are organized by individual profiles on the rows and their personal features on the columns. The information in the dataset has been generated based on the statistical information about the age-structure distribution, the number of populations over municipalities, the number of different nationalities present in Luxembourg, and salary statistics per municipality. The STATEC platform, the statistics portal of Luxembourg, is the public source we used to gather the real information that we ingested into our synthetic generation model. Other features like Date of birth, Social matricule, First name, Surname, Ethnicity, and physical attributes have been obtained by a logical relationship between variables without exploiting any additional real information. We are in compliance with the law in putting close to zero the risk of identifying a real person completely by chance.

Clear search

Close search

Google apps

Main menu

Representative synthetic dataset of Luxembourg’s citizens

Data from: Expert opinions of demographic rates of Argentine black and white...

Data and R code used in: Plant geographic distribution influences chemical...

Portsmouth Water Drinking Water Quality Data 2022 2023 2024

Table_1_Determination of sample size for a multinomial model coupled with...

Comparison of phylogenetic path models produced (A) by the approach of...

Statistics of the datasets.

Data from: 2010 County and City-Level Water-Use Data and Associated...

Phylodynamic Inference for Structured Epidemiological Models

GRDC - CCDM/CIC genomic prediction report

Demographics of multi-site experimental data.

Data from: Genomic insights on conservation priorities for North Sea houting...

Locations and summary statistics for Octopus vulgaris samples, including...

Data from: Dynamics of Ice Streams: A Physical Statistical Approach

Index of notation for parameter values, with default values, and variables...

Model comparison and fitness parameter outputs.

Description of the demographic attributes of the dataset.

Survey years of each country with respective weighted sample size.

Representative synthetic dataset of Luxembourg’s citizensSee More Versions

representative-synthetic-dataset-of-luxembourgs-citizens

Representative synthetic dataset of Luxembourg’s citizens