100+ datasets found

Simulation Data Set
catalog.data.gov
s.cnmilf.com
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Simulation Data Set [Dataset]. https://catalog.data.gov/dataset/simulation-data-set
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; “Simulated_Dataset.RData”. Metadata (including data dictionary) • y: Vector of binary responses (1: adverse outcome, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (“CWVS_LMC.txt”) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (“Results_Summary.txt”) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description “CWVS_LMC.txt”: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the “Simulated_Dataset.RData” workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. “Results_Summary.txt”: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the “CWVS_LMC.txt” code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: • For running “CWVS_LMC.txt”: • msm: Sampling from the truncated normal distribution • mnormt: Sampling from the multivariate normal distribution • BayesLogit: Sampling from the Polya-Gamma distribution • For running “Results_Summary.txt”: • plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: • Load the “Simulated_Dataset.RData” workspace • Run the code contained in “CWVS_LMC.txt” • Once the “CWVS_LMC.txt” code is complete, run “Results_Summary.txt”. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).
i
Data from: Simulated dataset
ieee-dataport.org
Updated Apr 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nassim Ravanshad (2024). Simulated dataset [Dataset]. https://ieee-dataport.org/documents/simulated-dataset
Explore at:
Dataset updated
Apr 8, 2024
Authors
Nassim Ravanshad
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Normal 0

false false false

EN-US X-NONE AR-SA
l
Data from: Simulated dataset
figshare.le.ac.uk
zip
Updated Feb 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rodrigo Quian Quiroga (2024). Simulated dataset [Dataset]. http://doi.org/10.25392/leicester.data.11897595.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.11897595.v1
Dataset updated
Feb 20, 2024
Dataset provided by
University of Leicester
Authors
Rodrigo Quian Quiroga
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A simulated dataset that has been widely used in the evaluation of spike-sorting algorithms. Synthetic datasets are generated by adding spike waveform templates to background noise of various levels; this download contains several datasets, generated using different spike templates.Use wave_clus (see www2.le.ac.uk/centres/csn/software/wave-clus) for spike detection and sorting of this data. Wave_clus is a fast and unsupervised algorithm for spike detection and sorting compatible with Windows, Mac or Linux operating systems.
n
Data and code for: Generation and applications of simulated datasets to...
data.niaid.nih.gov
datadryad.org
+1more
zip
Updated Mar 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthew Silk; Olivier Gimenez (2023). Data and code for: Generation and applications of simulated datasets to integrate social network and demographic analyses [Dataset]. http://doi.org/10.5061/dryad.m0cfxpp7s
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.m0cfxpp7s
Dataset updated
Mar 10, 2023
Dataset provided by
Centre d'Écologie Fonctionnelle et Évolutive
Authors
Matthew Silk; Olivier Gimenez
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Social networks are tied to population dynamics; interactions are driven by population density and demographic structure, while social relationships can be key determinants of survival and reproductive success. However, difficulties integrating models used in demography and network analysis have limited research at this interface. We introduce the R package genNetDem for simulating integrated network-demographic datasets. It can be used to create longitudinal social networks and/or capture-recapture datasets with known properties. It incorporates the ability to generate populations and their social networks, generate grouping events using these networks, simulate social network effects on individual survival, and flexibly sample these longitudinal datasets of social associations. By generating co-capture data with known statistical relationships it provides functionality for methodological research. We demonstrate its use with case studies testing how imputation and sampling design influence the success of adding network traits to conventional Cormack-Jolly-Seber (CJS) models. We show that incorporating social network effects in CJS models generates qualitatively accurate results, but with downward-biased parameter estimates when network position influences survival. Biases are greater when fewer interactions are sampled or fewer individuals are observed in each interaction. While our results indicate the potential of incorporating social effects within demographic models, they show that imputing missing network measures alone is insufficient to accurately estimate social effects on survival, pointing to the importance of incorporating network imputation approaches. genNetDem provides a flexible tool to aid these methodological advancements and help researchers test other sampling considerations in social network studies. Methods The dataset and code stored here is for Case Studies 1 and 2 in the paper. Datsets were generated using simulations in R. Here we provide 1) the R code used for the simulations; 2) the simulation outputs (as .RDS files); and 3) the R code to analyse simulation outputs and generate the tables and figures in the paper.
R Code of Simulations
catalog.data.gov
cloud.csiss.gmu.edu
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). R Code of Simulations [Dataset]. https://catalog.data.gov/dataset/r-code-of-simulations
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
The sims zip file contains R code and accompanying files needed to run the R code. Overall this code demonstrates the R code used in the study is fully functional, documented, and reproducible and that this code could reproduce the simulation results from the study with sufficient computing time. The code as presented is for a single simulated dataset and will produce estimates and confidence intervals produced by all the methods used within the study when run on that one dataset. This dataset is associated with the following publication: Nethery, R., F. Mealli, J. Sacks, and F. Dominici. Evaluation of the Health Impacts of the 1990 Clean Air Act Amendments Using Causal Inference and Machine Learning. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION. Taylor & Francis Group, London, UK, 1-12, (2020).
i
Exponential Distribution Simulated Dataset
ieee-dataport.org
Updated Jul 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabiriele Bulivou (2024). Exponential Distribution Simulated Dataset [Dataset]. https://ieee-dataport.org/documents/exponential-distribution-simulated-dataset
Explore at:
Dataset updated
Jul 17, 2024
Authors
Gabiriele Bulivou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
featuring n=5000n = 5000n=5000 data points
Simulated data
figshare.com
txt
Updated Feb 5, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vincent Guillemot (2018). Simulated data [Dataset]. http://doi.org/10.6084/m9.figshare.5854659.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5854659.v1
Dataset updated
Feb 5, 2018
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Vincent Guillemot
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a simulated dataset used for test.
f
Simulated dataset for I = 2.26 % and RR = 3
figshare.com
zip
Updated Jan 19, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aline Guttmann (2016). Simulated dataset for I = 2.26 % and RR = 3 [Dataset]. http://doi.org/10.6084/m9.figshare.1308494.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1308494.v1
Dataset updated
Jan 19, 2016
Dataset provided by
figshare
Authors
Aline Guttmann
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A collection of 221 datasets in R format (rda), each corresponding to 1000 simulations of one cluster with a relative risk of 3 for a base incidence of 2.26 % births per year. Each dataset is a table of 221 000 rows and 6 columns.The rows contain: -the coordinates (longitude and latitude) of a SU, the observed number of cases, -the size of the at-risk population (i.e., the number of live births), -the expected number of cases in the specified SU assuming an inhomogeneous Poisson process for the cases distribution and -an indicator for the simulation ranging from 1 to 1000.
H
PotSim: A Large-Scale Simulated Dataset for Benchmarking AI Techniques on...
dataverse.harvard.edu
Updated May 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Satya Krishna Pothapragada; Rishabh Gupta; Kumar k Kumar Goel; Alina Zare; Joel Harley; Lincoln Zotarelli (2025). PotSim: A Large-Scale Simulated Dataset for Benchmarking AI Techniques on Potato Crop [Dataset]. http://doi.org/10.7910/DVN/GQMDOV
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/GQMDOV
Dataset updated
May 8, 2025
Dataset provided by
Harvard Dataverse
Authors
Satya Krishna Pothapragada; Rishabh Gupta; Kumar k Kumar Goel; Alina Zare; Joel Harley; Lincoln Zotarelli
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dataset funded by
FDACS( Florida Department of Agriculture and Consumer Services)
Description
PotSim is a large-scale simulated agricultural dataset specifically designed for AI-driven research on potato cultivation. This dataset is grounded in real-world crop management scenarios and extrapolated to approximately 4.9 million hypothetical crop management scenarios. It encompasses diverse factors including varying planting dates, fertilizer application rates and timings, irrigation strategies, and 24 years of weather data. The resulting dataset comprises over 675 million daily simulation records, offering an extensive and realistic framework for agricultural AI research.
Z
simulated datasets for evaluating polygenic detection methods
data.niaid.nih.gov
zenodo.org
Updated Oct 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tripathi, Devashish (2024). simulated datasets for evaluating polygenic detection methods [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12752104
Explore at:
Dataset updated
Oct 28, 2024
Dataset authored and provided by
Tripathi, Devashish
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains simulation files corresponding to a combination of each demographic model (1/2/3), environment (linear/quadratic), selection duration (200/400/600/800/1000), and simulation replicate(1-20). This resulted in 600 simulation files with 600 unique combinations of demographic models, environments, selection durations, and simulation replicates. For each individual in the genotype data file, we have the files containing the values of selective pressure(linear and quadratic environment) in the metadata folder.

The variant position are 1-based which is default SLiM output. To compare the results with the causal loci user must make the positions 0-based (i.e. POS-1). The details are provided in a github tutorial.

Please refer to the documentation for a detailed description of the files and folder structure.

The article describing the simulated data and its application is accepted for publication in Nucleic Acids Research (https://doi.org/10.1093/nar/gkae1027).
4
Simulation dataset for research project Metropolis 2
data.4tu.nl
zip
Updated Mar 31, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrei Badea; Andres Morfin Veytia; Joost Ellerbroek (2022). Simulation dataset for research project Metropolis 2 [Dataset]. http://doi.org/10.4121/19323263.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/19323263.v1
Dataset updated
Mar 31, 2022
Dataset provided by
4TU.ResearchData
Authors
Andrei Badea; Andres Morfin Veytia; Joost Ellerbroek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset funded by
European Commission
Description
Data produced by simulating traffic scenarios using the BlueSky Open Air Traffic Simulator. The dataset was generated by applying three ATM operational concepts to urban airspace traffic scenarios: decentralised, hybrid and centralised.

The dataset consists of logs of information gathered during the simulations.
m
Data from: Simulated Dataset
data.mendeley.com
Updated Jul 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yang Pan (2024). Simulated Dataset [Dataset]. http://doi.org/10.17632/ts6cbgw9fg.1
Explore at:
Unique identifier
https://doi.org/10.17632/ts6cbgw9fg.1
Dataset updated
Jul 22, 2024
Authors
Yang Pan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Simulated Dataset
NODE simulated data
figshare.com
zip
Updated Jan 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zedong Wang (2025). NODE simulated data [Dataset]. http://doi.org/10.6084/m9.figshare.28252061.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28252061.v1
Dataset updated
Jan 22, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Zedong Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
NODE simulated dataset.
Z
Data from: Simulated Well Production Data using a Transient Well Model and a...
data.niaid.nih.gov
zenodo.org
Updated Nov 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AlHammad, Yousef K. (2023). Simulated Well Production Data using a Transient Well Model and a Developed Simulator [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8128888
Explore at:
Dataset updated
Nov 17, 2023
Dataset authored and provided by
AlHammad, Yousef K.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a simulated dataset of transient well production data. This dataset was used in my Masters thesis at King Abullah University of Science and Technology (KAUST), and it is shared for academic use and research work.

The dataset has 100 wells simulated at time steps of 0.2 hours for an entire year. This gives 43,800 observations per well, and grand total of 4,380,000 observations in the entire dataset. The resulting production data is then perturbed with systemic and random gauge errors to better simulate real-world gauge readings.

The simulator code used to generate this dataset can be found at: https://github.com/ykh-1992/TransientNodalAnalysis.jl

The data consists of three files: - "wells.csv": This file details the input parameters for each simulated well. - "data.zip": This file houses an 850 MB "data.csv" that includes the simulated well production data. - "auxiliary.csv": This file includes information related to the simulation run.
d
Simulated dataset from 'Quantifying the causal pathways contributing to...
datadryad.org
search.datacite.org
zip
Updated Sep 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan Henshaw (2020). Simulated dataset from 'Quantifying the causal pathways contributing to natural selection' [Dataset]. http://doi.org/10.5061/dryad.j0zpc86c8
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.j0zpc86c8
Dataset updated
Sep 8, 2020
Dataset provided by
Dryad
Authors
Jonathan Henshaw
Time period covered
2020
Description
The following files are included:

The code used to generate the dataset (written for Wolfram Mathematica version 12.1.0.0)

The code used to analyse the causal structure of selection in the dataset (written for R version 1.1.456).
d
Data from: Simulated dataset
dataone.org
Updated Sep 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tellaroli, Paola (2024). Simulated dataset [Dataset]. http://doi.org/10.7910/DVN/OLGPT6
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/OLGPT6
Dataset updated
Sep 24, 2024
Dataset provided by
Harvard Dataverse
Authors
Tellaroli, Paola
Description
Simulated data with max variability cited in 'Cross-Clustering: A Partial Clustering Algorithm with Automatic Estimation of the Number of Clusters'
t
Simulated Dataset for Testing - Dataset - LDM
service.tib.eu
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Simulated Dataset for Testing - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/simulated-dataset-for-testing
Explore at:
Dataset updated
Dec 16, 2024
Description
The dataset used in the paper is a simulated dataset for testing the proposed algorithms.
i
Normal Distribution Simulated Dataset 1
ieee-dataport.org
Updated Apr 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabiriele Bulivou (2022). Normal Distribution Simulated Dataset 1 [Dataset]. https://ieee-dataport.org/documents/normal-distribution-simulated-dataset-1
Explore at:
Dataset updated
Apr 25, 2022
Authors
Gabiriele Bulivou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset consists of a simulated normal distribution data having n = 500 data points and mean = 80 and standard deviation = 2.
CMAPSS Jet Engine Simulated Data - Dataset - NASA Open Data Portal
data.nasa.gov
data.staging.idas-ds1.appdat.jsc.nasa.gov
Updated Oct 15, 2008
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2008). CMAPSS Jet Engine Simulated Data - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/cmapss-jet-engine-simulated-data
Explore at:
Dataset updated
Oct 15, 2008
Dataset provided by
NASAhttp://nasa.gov/
Description
Data sets consists of multiple multivariate time series. Each data set is further divided into training and test subsets. Each time series is from a different engine i.e., the data can be considered to be from a fleet of engines of the same type. Each engine starts with different degrees of initial wear and manufacturing variation which is unknown to the user. This wear and variation is considered normal, i.e., it is not considered a fault condition. There are three operational settings that have a substantial effect on engine performance. These settings are also included in the data. The data is contaminated with sensor noise. The engine is operating normally at the start of each time series, and develops a fault at some point during the series. In the training set, the fault grows in magnitude until system failure. In the test set, the time series ends some time prior to system failure. The objective of the competition is to predict the number of remaining operational cycles before failure in the test set, i.e., the number of operational cycles after the last cycle that the engine will continue to operate. Also provided a vector of true Remaining Useful Life (RUL) values for the test data. The data are provided as a zip-compressed text file with 26 columns of numbers, separated by spaces. Each row is a snapshot of data taken during a single operational cycle, each column is a different variable. The columns correspond to: 1) unit number 2) time, in cycles 3) operational setting 1 4) operational setting 2 5) operational setting 3 6) sensor measurement 1 7) sensor measurement 2 ... 26) sensor measurement 26 Data Set: FD001 Train trjectories: 100 Test trajectories: 100 Conditions: ONE (Sea Level) Fault Modes: ONE (HPC Degradation) Data Set: FD002 Train trjectories: 260 Test trajectories: 259 Conditions: SIX Fault Modes: ONE (HPC Degradation) Data Set: FD003 Train trjectories: 100 Test trajectories: 100 Conditions: ONE (Sea Level) Fault Modes: TWO (HPC Degradation, Fan Degradation) Data Set: FD004 Train trjectories: 248 Test trajectories: 249 Conditions: SIX Fault Modes: TWO (HPC Degradation, Fan Degradation) Reference: A. Saxena, K. Goebel, D. Simon, and N. Eklund, ‘Damage Propagation Modeling for Aircraft Engine Run-to-Failure Simulation’, in the Proceedings of the 1st International Conference on Prognostics and Health Management (PHM08), Denver CO, Oct 2008.
Simulated dataset for Olley&Pakes
kaggle.com
Updated Oct 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anton Morozov (2023). Simulated dataset for Olley&Pakes [Dataset]. https://www.kaggle.com/datasets/antmorozov/simulated-dataset-for-olley-and-pakes
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 16, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Anton Morozov
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset

This dataset was created by Anton Morozov

Released under Apache 2.0

Contents

Facebook

Twitter

Click to copy link

Link copied

Cite

U.S. EPA Office of Research and Development (ORD) (2020). Simulation Data Set [Dataset]. https://catalog.data.gov/dataset/simulation-data-set

Simulation Data Set

Explore at:

Dataset updated

Nov 12, 2020

Dataset provided by

United States Environmental Protection Agencyhttp://www.epa.gov/

Description

These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; “Simulated_Dataset.RData”. Metadata (including data dictionary) • y: Vector of binary responses (1: adverse outcome, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (“CWVS_LMC.txt”) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (“Results_Summary.txt”) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description “CWVS_LMC.txt”: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the “Simulated_Dataset.RData” workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. “Results_Summary.txt”: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the “CWVS_LMC.txt” code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: • For running “CWVS_LMC.txt”: • msm: Sampling from the truncated normal distribution • mnormt: Sampling from the multivariate normal distribution • BayesLogit: Sampling from the Polya-Gamma distribution • For running “Results_Summary.txt”: • plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: • Load the “Simulated_Dataset.RData” workspace • Run the code contained in “CWVS_LMC.txt” • Once the “CWVS_LMC.txt” code is complete, run “Results_Summary.txt”. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).

Clear search

Close search

Google apps

Main menu

Simulation Data Set

Data from: Simulated dataset

Data from: Simulated dataset

Data and code for: Generation and applications of simulated datasets to...

R Code of Simulations

Exponential Distribution Simulated Dataset

Simulated data

Simulated dataset for I = 2.26 % and RR = 3

PotSim: A Large-Scale Simulated Dataset for Benchmarking AI Techniques on...

simulated datasets for evaluating polygenic detection methods

Simulation dataset for research project Metropolis 2

Data from: Simulated Dataset

NODE simulated data

Data from: Simulated Well Production Data using a Transient Well Model and a...

Simulated dataset from 'Quantifying the causal pathways contributing to...

Data from: Simulated dataset

Simulated Dataset for Testing - Dataset - LDM

Normal Distribution Simulated Dataset 1

CMAPSS Jet Engine Simulated Data - Dataset - NASA Open Data Portal

Simulated dataset for Olley&Pakes

Dataset

Contents

Simulation Data Set