54 datasets found

f
Parameter Settings of Synthetic Data Generation.
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leung-Yau Lo; Man-Leung Wong; Kin-Hong Lee; Kwong-Sak Leung (2023). Parameter Settings of Synthetic Data Generation. [Dataset]. http://doi.org/10.1371/journal.pone.0138596.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0138596.t002
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Leung-Yau Lo; Man-Leung Wong; Kin-Hong Lee; Kwong-Sak Leung
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Parameter Settings of Synthetic Data Generation.
Z
Dataset Artifact for paper "Root Cause Analysis for Microservice System...
data.niaid.nih.gov
zenodo.org
Updated Aug 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pham, Luan (2024). Dataset Artifact for paper "Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13305662
Explore at:
Dataset updated
Aug 25, 2024
Dataset provided by
Zhang, Hongyu
Ha, Huong
Pham, Luan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Artifacts for the paper titled Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?.

This artifact repository contains 9 compressed folders, as follows:

ID File Name Description

1 syn_circa.zip CIRCA10, and CIRCA50 datasets for Causal Discovery

2 syn_rcd.zip RCD10, and RCD50 datasets for Causal Discovery

3 syn_causil.zip CausIL10, and CausIL50 datasets for Causal Discovery

4 rca_circa.zip CIRCA10, and CIRCA50 datasets for RCA

5 rca_rcd.zip RCD10, and RCD50 datasets for RCA

6 online-boutique.zip Online Boutique dataset for RCA

7 sock-shop-1.zip Sock Shop 1 dataset for RCA

8 sock-shop-2.zip Sock Shop 2 dataset for RCA

9 train-ticket.zip Train Ticket dataset for RCA

Each zip file contains the generated/collected data from the corresponding data generator or microservice benchmark systems (e.g., online-boutique.zip contains metrics data collected from the Online Boutique system).

Details about the generation of our datasets

Synthetic datasets

We use three different synthetic data generators from three previous RCA studies [15, 25, 28] to create the synthetic datasets: CIRCA, RCD, and CausIL data generators. Their mechanisms are as follows:1. CIRCA datagenerator [28] generates a random causal directed acyclic graph (DAG) based on a given number of nodes and edges. From this DAG, time series data for each node is generated using a vector auto-regression (VAR) model. A fault is injected into a node by altering the noise term in the VAR model for two timestamps. 2. RCD data generator [25] uses the pyAgrum package [3] to generate a random DAG based on a given number of nodes, subsequently generating discrete time series data for each node, with values ranging from 0 to 5. A fault is introduced into a node by changing its conditional probability distribution.3. CausIL data generator [15] generates causal graphs and time series data that simulate the behavior of microservice systems. It first constructs a DAG of services and metrics based on domain knowledge, then generates metric data for each node of the DAG using regressors trained on real metrics data. Unlike the CIRCA and RCD data generators, the CausIL data generator does not have the capability to inject faults.To create our synthetic datasets, we first generate 10 DAGs whose nodes range from 10 to 50 for each of the synthetic data generators. Next, we generate fault-free datasets using these DAGs with different seedings, resulting in 100 cases for the CIRCA and RCD generators and 10 cases for the CausIL generator. We then create faulty datasets by introducing ten faults into each DAG and generating the corresponding faulty data, yielding 100 cases for the CIRCA and RCD data generators. The fault-free datasets (e.g. syn_rcd, syn_circa) are used to evaluate causal discovery methods, while the faulty datasets (e.g. rca_rcd, rca_circa) are used to assess RCA methods.

Data collected from benchmark microservice systems

We deploy three popular benchmark microservice systems: Sock Shop [6], Online Boutique [4], and Train Ticket [8], on a four-node Kubernetes cluster hosted by AWS. Next, we use the Istio service mesh [2] with Prometheus [5] and cAdvisor [1] to monitor and collect resource-level and service-level metrics of all services, as in previous works [ 25 , 39, 59 ]. To generate traffic, we use the load generators provided by these systems and customise them to explore all services with 100 to 200 users concurrently. We then introduce five common faults (CPU hog, memory leak, disk IO stress, network delay, and packet loss) into five different services within each system. Finally, we collect metrics data before and after the fault injection operation. An overview of our setup is presented in the Figure below.

Code

The code to reproduce the experimental results in the paper is available at https://github.com/phamquiluan/RCAEval.

References

As in our paper.
f
datasheet1_Causal Datasheet for Datasets: An Evaluation Guide for Real-World...
frontiersin.figshare.com
pdf
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bradley Butcher; Vincent S. Huang; Christopher Robinson; Jeremy Reffin; Sema K. Sgaier; Grace Charles; Novi Quadrianto (2023). datasheet1_Causal Datasheet for Datasets: An Evaluation Guide for Real-World Data Analysis and Data Collection Design Using Bayesian Networks.pdf [Dataset]. http://doi.org/10.3389/frai.2021.612551.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2021.612551.s001
Dataset updated
Jun 3, 2023
Dataset provided by
Frontiers
Authors
Bradley Butcher; Vincent S. Huang; Christopher Robinson; Jeremy Reffin; Sema K. Sgaier; Grace Charles; Novi Quadrianto
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Developing data-driven solutions that address real-world problems requires understanding of these problems’ causes and how their interaction affects the outcome–often with only observational data. Causal Bayesian Networks (BN) have been proposed as a powerful method for discovering and representing the causal relationships from observational data as a Directed Acyclic Graph (DAG). BNs could be especially useful for research in global health in Lower and Middle Income Countries, where there is an increasing abundance of observational data that could be harnessed for policy making, program evaluation, and intervention design. However, BNs have not been widely adopted by global health professionals, and in real-world applications, confidence in the results of BNs generally remains inadequate. This is partially due to the inability to validate against some ground truth, as the true DAG is not available. This is especially problematic if a learned DAG conflicts with pre-existing domain doctrine. Here we conceptualize and demonstrate an idea of a “Causal Datasheet” that could approximate and document BN performance expectations for a given dataset, aiming to provide confidence and sample size requirements to practitioners. To generate results for such a Causal Datasheet, a tool was developed which can generate synthetic Bayesian networks and their associated synthetic datasets to mimic real-world datasets. The results given by well-known structure learning algorithms and a novel implementation of the OrderMCMC method using the Quotient Normalized Maximum Likelihood score were recorded. These results were used to populate the Causal Datasheet, and recommendations could be made dependent on whether expected performance met user-defined thresholds. We present our experience in the creation of Causal Datasheets to aid analysis decisions at different stages of the research process. First, one was deployed to help determine the appropriate sample size of a planned study of sexual and reproductive health in Madhya Pradesh, India. Second, a datasheet was created to estimate the performance of an existing maternal health survey we conducted in Uttar Pradesh, India. Third, we validated generated performance estimates and investigated current limitations on the well-known ALARM dataset. Our experience demonstrates the utility of the Causal Datasheet, which can help global health practitioners gain more confidence when applying BNs.
h
Data from: CausalDynamics
huggingface.co
Updated May 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
kausable GmbH (2025). CausalDynamics [Dataset]. https://huggingface.co/datasets/kausable/CausalDynamics
Explore at:
Dataset updated
May 3, 2025
Dataset authored and provided by
kausable GmbH
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
CausalDynamics: A large-scale benchmark for structural discovery of dynamical causal models

A comprehensive benchmark framework designed to rigorously evaluate state-of-the-art causal discovery algorithms for dynamical systems.

Key Features

1️⃣ Large-Scale Benchmark. Systematically evaluate state-of-the-art causal discovery algorithms on thousands of graph challenges with increasing difficulty. 2️⃣ Customizable Data Generation. Scalable, user-friendly… See the full description on the dataset page: https://huggingface.co/datasets/kausable/CausalDynamics.

Multimodal3DIdent

zenodo.org
data.niaid.nih.gov

application/gzip

Updated Mar 29, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

Imant Daunhawer; Alice Bizeul; Emanuele Palumbo; Alexander Marx; Julia E. Vogt; Imant Daunhawer; Alice Bizeul; Emanuele Palumbo; Alexander Marx; Julia E. Vogt (2023). Multimodal3DIdent [Dataset]. http://doi.org/10.5281/zenodo.7678231

Explore at:

application/gzipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.7678231

Dataset updated

Mar 29, 2023

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Imant Daunhawer; Alice Bizeul; Emanuele Palumbo; Alexander Marx; Julia E. Vogt; Imant Daunhawer; Alice Bizeul; Emanuele Palumbo; Alexander Marx; Julia E. Vogt

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This upload contains the Multimodal3DIdent dataset introduced in the paper Identifiability Results for Multimodal Contrastive Learning presented at ICLR 2023. The dataset provides an identifiability benchmark with image/text pairs generated from controllable ground truth factors, some of which are shared between image and text modalities. The training, validation, and test sets contain 125000, 10000, and 10000 image/text pairs and ground truth factors, respectively. The code for the data generation is publicly available: https://github.com/imantdaunhawer/Multimodal3DIdent.

Description
------------------

The generated dataset contains image and text data as well as the ground truth factors of variation for each modality. Each split (train/val/test) of the dataset is structured as follows:

.
├── images
│  ├── 000000.png
│  ├── 000001.png
│  └── etc.
├── text
│  └── text_raw.txt
├── latents_image.csv
└── latents_text.csv

The directories images and text contain the generated image and text data, whereas the CSV files latents_image.csv and latents_text.csv contain the values of the respective latent factors. There is an index-wise correspondence between images, sentences, and latent factors. For example, the first line in the file text_raw.txt is the sentence that corresponds to the first image in the images directory.

Latent factors: We use the following ground truth latent factors to generate image and text data. Each factor is sampled from a uniform distribution defined on the specified set of values for the respective factor.

Modality	Latent Factor	Values	Details
Image	Object shape	{0, 1, ..., 6}	Mapped to Blender shapes like "Teapot", "Hare", etc.
Image	Object x-position	{0, 1, 2}	Mapped to {-3, 0, 3} for Blender
Image	Object y-position	{0, 1, 2}	Mapped to {-3, 0, 3} for Blender
Image	Object z-position	{0}	Constant
Image	Object alpha-rotation	[0, 1]-interval	Linearly transformed to [-pi/2, pi/2] for Blender
Image	Object beta-rotation	[0, 1]-interval	Linearly transformed to [-pi/2, pi/2] for Blender
Image	Object gamma-rotation	[0, 1]-interval	Linearly transformed to [-pi/2, pi/2] for Blender
Image	Object color	[0, 1]-interval	Hue value in HSV transformed to RGB for Blender
Image	Spotlight position	[0, 1]-interval	Transformed to a unique position on a semicircle
Image	Spotlight color	[0, 1]-interval	Hue value in HSV transformed to RGB for Blender
Image	Background color	[0, 1]-interval	Hue value in HSV transformed to RGB for Blender
Text	Object shape	{0, 1, ..., 6}	Mapped to strings like "teapot", "hare", etc.
Text	Object x-position	{0, 1, 2}	Mapped to strings "left", "center", "right"
Text	Object y-position	{0, 1, 2}	Mapped to strings "top", "mid", "bottom"
Text	Object color	string values	Color names from 3 different color palettes
Text	Text phrasing	{0, 1, ..., 4}	Mapped to 5 different English sentences

Image rendering: We use the Blender rendering engine to create visually complex images depicting a 3D scene. Each image in the dataset shows a colored 3D object of a certain shape or class (i.e., teapot, hare, cow, armadillo, dragon, horse, or head) in front of a colored background and illuminated by a colored spotlight that is focused on the object and located on a semicircle above the scene. The resulting RGB images are of size 224 x 224 x 3.

Text generation: We generate a short sentence describing the respective scene. Each sentence describes the object's shape or class (e.g., teapot), position (e.g., bottom-left), and color. The color is represented in a human-readable form (e.g., "lawngreen", "xkcd:bright aqua", etc.) as the name of the color (from a randomly sampled palette) that is closest to the sampled color value in RGB space. The sentence is constructed from one of five pre-configured phrases with placeholders for the respective ground truth factors.

Relation between modalities: Three latent factors (object shape, x-position, y-position) are shared between image/text pairs. The object color also exhibits a dependence between modalities; however, it is not a 1-to-1 correspondence because the color palette is sampled randomly from a set of multiple palettes. Additionally, there is a causal dependence of object color on object x-position since the range of hue values [0, 1] is split into three equally sized intervals, each of which is associated with a fixed x-position of the object. For instance, if x-position is “left”, we sample the hue value from the interval [0, 1/3]. Consequently, the color of the object can be predicted to some degree from the object's position.

Acknowledgements
-------------------------------

The Multimodal3DIdent dataset builds on the following resources:
- 3DIdent dataset
- Causal3DIdent dataset
- CLEVR dataset
- Blender open-source 3D creation suite

Causal AI Market By Application (Service, Supply Chain Optimization,...
verifiedmarketresearch.com
Updated Sep 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
VERIFIED MARKET RESEARCH (2024). Causal AI Market By Application (Service, Supply Chain Optimization, Marketing & Sales Optimization), Vertical (Healthcare, BFSI, Manufacturing, Retail & E-commerce, Transportation, Automotive), & Region for 2024-2031 [Dataset]. https://www.verifiedmarketresearch.com/product/causal-ai-market/
Explore at:
Dataset updated
Sep 15, 2024
Dataset provided by
Verified Market Researchhttps://www.verifiedmarketresearch.com/
Authors
VERIFIED MARKET RESEARCH
License
https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/
Description
Causal AI Market size was valued at USD 11.77 Million in 2024 and is projected to reach USD 256.73 Million by 2031, growing at a CAGR of 47.1% during the forecast period 2024-2031.

Causal AI also known as causal artificial intelligence is a significant innovation in the fields of artificial intelligence and machine learning that focuses on identifying and harnessing cause-and-effect linkages in data. Traditional AI models generally use correlation-based methods to detect patterns and generate predictions. While these methods can be quite useful in specific applications, they frequently fall short in situations where understanding the underlying causal mechanisms is critical. Causal AI overcomes this issue by incorporating principles from causal inference, a branch of statistics and philosophy that investigates how to infer causal correlations from data.

Causal AI is a huge leap in the field of artificial intelligence allowing us to go beyond correlation to discover the true drivers of observed occurrences. Its applications are broad and diverse including healthcare, finance, marketing, policymaking, operations, education, the environment, and social sciences. Causal AI improves decision-making and allows for the development of focused solutions to meet difficult situations by offering a richer grasp of causality.

Causal AI (Artificial Intelligence) has the potential to change a wide range of domains by providing more precise and actionable insights than typical machine learning models. Causal AI differs from traditional AI in that it focuses on understanding the cause-and-effect relationships underlying data rather than correlations and patterns. This change from correlation to causation is a huge step forward with the potential to improve decision-making processes make better forecasts, and maximize outcomes in a variety of industries including healthcare, finance, marketing, and others.
d
Synthetic Medicare Data for Environmental Health Studies
search.dataone.org
Updated Nov 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khoshnevis, Naeem; Wu, Xiao; Braun, Danielle (2023). Synthetic Medicare Data for Environmental Health Studies [Dataset]. http://doi.org/10.7910/DVN/L7YF2G
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/L7YF2G
Dataset updated
Nov 12, 2023
Dataset provided by
Harvard Dataverse
Authors
Khoshnevis, Naeem; Wu, Xiao; Braun, Danielle
Description
We present a synthetic medicare claims dataset linked to environmental exposures and potential confounders. In most environmental health studies relying on claims data, data restrictions exist and the data cannot be shared publicly. Centers for Medicare and Medicaid services (CMS) has generated synthetic publicly available Medicare claims data for 2008-2010. In this dataset, we link the 2010 synthetic Medicare claims data to environmental exposures and potential confounders. We aggregated the Medicare claims synthetic data for 2010 to the county level. Data is compiled for the contiguous United States, which in 2010, included 3109 counties. We merged the Medicare claims synthetic data with air pollution exposure data, more specifically with estimates of 𝑃𝑀2.5 exposures obtained from Di et al., 2019, 2021, which provided daily and annual estimates of PM2.5 exposure at 1 km×1 km grid cells in the contiguous United States. We use Census Bureau (United States Census Bureau, 2021), the Center for Disease Control (Centers for Disease Control and Prevention (CDC), 2021), and GridMET (Abatzoglou, 2013) to obtain data on potential confounders. The mortality rate, as the outcome, was computed using the synthetic Medicare data (CMS, 2021). We use the average of surrounding counties to impute missing observations, except in the case of the CDC confounders, where we imputed missing values by generating a normal distribution for each state and randomly imputing from this distribution. The steps for generating the merged dataset are provided at NSAPH Synthetic Data Github Repository (https://github.com/NSAPH/synthetic_data). Analytic inferences based on this synthetic dataset should not be made. The aggregated dataset is composed of 46 columns and 3109 rows.
Dataset: Screening Causal Assessment of Brook Trout Occurrence and Road...
s.cnmilf.com
catalog.data.gov
Updated Apr 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2025). Dataset: Screening Causal Assessment of Brook Trout Occurrence and Road Runoff 20250218 [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/dataset-screening-causal-assessment-of-brook-trout-occurrence-and-road-runoff-20250218
Explore at:
Dataset updated
Apr 25, 2025
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Pedigree of all data and processing included in the manuscript. Open zip file then access pedigree folder for file describing all other folders, links, and data dictionary Items: NOTES: Description of work and other worksheets. Pedigree: Summary source files used to create figures and tables. DataFiles: Data files used in the R code for creating the figures and tables. DataDictionary: Data file titles in all data files Data: Data file uploaded to Science Hub Output: Files generated from R scripts Plot: Plots generated from R scripts and other software R_Scripts: Clean R scripts used to analyze the data, generate figures and tables Result: Tables generated from R scripts
Causal Effects of AI-Powered Digital Assistance on Generation Z Online...
figshare.com
pdf
Updated Mar 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mengchen Zhang (2025). Causal Effects of AI-Powered Digital Assistance on Generation Z Online Shopping: A Data Mining Approach [Dataset]. http://doi.org/10.6084/m9.figshare.28660496.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28660496.v1
Dataset updated
Mar 25, 2025
Dataset provided by
figshare
Authors
Mengchen Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This study investigated the effects of AI-powered online shopping service attributes and Generation Z consumer characteristics on the attitudes and behaviors of Generation Z consumers. Using focus groups and inductive analysis, six key attributes of AI-powered digital assistance were identified from the perspective of Generation Z consumers. Rough Set Analysis was applied to establish causal relationships among the conditional attributes of AI-powered digital assistance, consumer characteristics, and Generation Z consumers’ attitudes and behaviors, resulting in ten decision-making rules. The findings extend academic research on the response mechanisms of Generation Z consumers to novel technology services and provide managers strategic insights for improving online services.
d
Replication Data for: causalizeR: A text mining algorithm to identify causal...
dataone.org
dataverse.azure.uit.no
+1more
Updated Jul 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ancin-Murguzur, Francisco Javier; Hausner, Vera Helene (2024). Replication Data for: causalizeR: A text mining algorithm to identify causal relationships in scientific literature [Dataset]. http://doi.org/10.18710/PTQ8X7
Explore at:
Unique identifier
https://doi.org/10.18710/PTQ8X7
Dataset updated
Jul 29, 2024
Dataset provided by
DataverseNO
Authors
Ancin-Murguzur, Francisco Javier; Hausner, Vera Helene
Description
Complex interactions among multiple abiotic and biotic drivers result in rapid changes in ecosystems worldwide. Predicting how specific interactions can cause ripple effects potentially resulting in abrupt shifts in ecosystems is of high relevance to policymakers, but difficult to quantify using data from singular cases. We present causalizeR (https://github.com/fjmurguzur/causalizeR), a text-processing algorithm that extracts causal relations from literature based on simple grammatical rules that can be used to synthesize evidence in unstructured texts in a structured manner. The algorithm extracts causal links using the relative position of nouns relative to the keyword of choice to extract the cause and effects of interest. The resulting database can be combined with network analysis tools to estimate the direct and indirect effects of multiple drivers at the network level, which is useful for synthesizing available knowledge and for hypothesis creation and testing. We illustrate the use of the algorithm by detecting causal relationships in scientific literature relating to the tundra ecosystem.
m
Data from: How biased media generate support for the ruling authorities:...
data.mendeley.com
Updated Aug 7, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elena Sirotkina (2020). How biased media generate support for the ruling authorities: Causal mediation analysis of evidence from Russia [Dataset]. http://doi.org/10.17632/59d4n56792.2
Explore at:
Unique identifier
https://doi.org/10.17632/59d4n56792.2
Dataset updated
Aug 7, 2020
Authors
Elena Sirotkina
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Russia
Description
If a medium has a monopoly in covering political news and daily distorts the news in favor of the ruling autocrat, how large will the persuasion effect be? Through which channels will such persuasion operate most? Working with a representative sample of the Russian population, I use a causal mediation analysis to figure out whether (1) frequency of exposure and/or (2) reliance on biased reporting mediate the link between how people voted for incumbent elites and how they evaluate these elites in the present. Perceiving explicitly biased information as credible transmits a large and robust effect from voting to evaluation, while frequent exposure to this information produces an insignificant mediating effect. Another important finding is that the effect of perceived news credibility overrides the effect of electoral support: accepting state propaganda as credible information converts people into regime supporters regardless of their previous voting.
f
Data from: Causal Inference with Multilevel Data: A Comparison of Different...
tandf.figshare.com
txt
Updated May 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alvaro Fuentes; Oliver Lüdtke; Alexander Robitzsch (2024). Causal Inference with Multilevel Data: A Comparison of Different Propensity Score Weighting Approaches [Dataset]. http://doi.org/10.6084/m9.figshare.14786207.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14786207.v1
Dataset updated
May 31, 2024
Dataset provided by
Taylor & Francis
Authors
Alvaro Fuentes; Oliver Lüdtke; Alexander Robitzsch
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Propensity score methods are a widely recommended approach to adjust for confounding and to recover treatment effects with non-experimental, single-level data. This article reviews propensity score weighting estimators for multilevel data in which individuals (level 1) are nested in clusters (level 2) and nonrandomly assigned to either a treatment or control condition at level 1. We address the choice of a weighting strategy (inverse probability weights, trimming, overlap weights, calibration weights) and discuss key issues related to the specification of the propensity score model (fixed-effects model, multilevel random-effects model) in the context of multilevel data. In three simulation studies, we show that estimates based on calibration weights, which prioritize balancing the sample distribution of level-1 and (unmeasured) level-2 covariates, should be preferred under many scenarios (i.e., treatment effect heterogeneity, presence of strong level-2 confounding) and can accommodate covariate-by-cluster interactions. However, when level-1 covariate effects vary strongly across clusters (i.e., under random slopes), and this variation is present in both the treatment and outcome data-generating mechanisms, large cluster sizes are needed to obtain accurate estimates of the treatment effect. We also discuss the implementation of survey weights and present a real-data example that illustrates the different methods.
H
Replication Data for: Estimating and Evaluating Treatment Effect...
dataverse.harvard.edu
Updated Jan 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Li Zheng; Weiwen Yin (2023). Replication Data for: Estimating and Evaluating Treatment Effect Heterogeneity: A Causal Forests Approach [Dataset]. http://doi.org/10.7910/DVN/PGHIF5
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/PGHIF5
Dataset updated
Jan 9, 2023
Dataset provided by
Harvard Dataverse
Authors
Li Zheng; Weiwen Yin
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
In this paper, we introduce the causal forests method ests method (Athey et al., 2019) and illustrate how to apply it in social sciences to addressing treatment effect heterogeneity. Compared with existing parametric methods such as the multiplicative interaction model and traditional semi-/non-parametric estimation, causal forests are more flexible for complex data generating processes. Specifically, causal forests allow for nonparametric estimation and inference on heterogeneous treatment effects in the presence of many moderators. To reveal its usefulness, we revisit existing studies in political science and economics. We uncover new information hidden by original estimation strategies while producing findings that are consistent with conventional methods. Through these replication efforts, we provide a step-by-step practice guide for applying causal forests in evaluating treatment effect heterogeneity.
d
Data from: Current approaches using genetic distances produce poor estimates...
datadryad.org
zip
Updated May 14, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tabitha A. Graves; Paul Beier; Jeffrey Andrew Royle (2013). Current approaches using genetic distances produce poor estimates of landscape resistance to interindividual dispersal [Dataset]. http://doi.org/10.5061/dryad.b8842
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.b8842
Dataset updated
May 14, 2013
Dataset provided by
Dryad
Authors
Tabitha A. Graves; Paul Beier; Jeffrey Andrew Royle
Time period covered
2013
Description
Simulation FilesSims900inds_L3Thresh250i.csv is CDPOP input file. C2C2_900Inds64PixelsIDXY.csv is file of individual locations used in CDPOP. CD3_900Inds64Pixels_R20.csv is input file of cost distances with b1=1 and b2=20. GDmatrix.csv is an output file from the 100th generation simulated by CDPOP.Archive.zip
H
Replication Data for: Does Regression Produce Representative Estimates of...
dataverse.harvard.edu
application/dbf +4
Updated Feb 26, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harvard Dataverse (2016). Replication Data for: Does Regression Produce Representative Estimates of Causal Effects? [Dataset]. http://doi.org/10.7910/DVN/29098
Explore at:
application/dbf(152452), text/plain; charset=us-ascii(3518), text/plain; charset=us-ascii(3577), tsv(756106), application/shx(1788), application/shp(1537592), text/plain; charset=us-ascii(1220), tsv(275947)Available download formats
Unique identifier
https://doi.org/10.7910/DVN/29098
Dataset updated
Feb 26, 2016
Dataset provided by
Harvard Dataverse
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
With an unrepresentative sample, the estimate of a causal effect may fail to characterize how effects operate in the population of interest. What is less well understood is that conventional estimation practices for observational studies may produce the same problem even with a representative sample. Causal effects estimated via multiple regression differentially weight each unit's contribution. The effective sample'' that regression uses to generate the estimate may bear little resemblance to the population of interest, and the results may be nonrepresentative in a manner similar to what quasi-experimental methods or experiments with convenience samples produce. There is no general external validity basis for preferring multiple regression on representative samples over quasi-experimental or experimental methods. We show how to estimate themultiple regression weights'' that allow one to study the effective sample. We discuss alternative approaches that, under certain conditions, recover representative average causal effects. The requisite conditions cannot always be met.
d
Data from: A causal role for right frontopolar cortex in directed, but not...
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zajkowski, Wojciech; Kossut, Malgorzata; Wilson, Robert (2023). A causal role for right frontopolar cortex in directed, but not random, exploration [Dataset]. http://doi.org/10.7910/DVN/CZT6EE
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/CZT6EE
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Zajkowski, Wojciech; Kossut, Malgorzata; Wilson, Robert
Description
Data and Matlab code used to produce figures in "A causal role for right frontopolar cortex in directed, but not random, exploration" Raw data is in: TMS_horizonTask.csv Each row corresponds to a single game. Each column corresponds to a separate variable * expt_name - stimulation condition = "vertex" or "RFPC" * replicationFlag - 0 for first set of subjects, 1 for second set. * subjectID - subject number * order - stimulation order * age - participant age in years * iswoman - participant gender 1 for female, 0 for male * sessionNumber - 1 or 2 * game - game number in experiment * gameLength - number of trials in this game, including four forced trials * uc - uncertainty condition, number of times option 2 is played in forced trials * m1 - true mean of option 1 * m2 - true mean of option 2 * r1, r2, etc ... - reward outcome on each trial, = nan if no outcome (e.g. on trial 6 in horizon 1 games) * c1, c2, etc ... - choice on trial t, 1 for left, 2 for right * rt1, rt2, etc ... - reaction time on trial t in seconds To generate figures from paper run: main_TMSanalysis_v3.m
J
Towards causal estimates of children's time allocation on skill development...
journaldata.zbw.eu
jda-test.zbw.eu
pdf, txt, zip
Updated Dec 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gregorio Caetano; Josh Kinsler; Hao Teng; Gregorio Caetano; Josh Kinsler; Hao Teng (2022). Towards causal estimates of children's time allocation on skill development (replication data) [Dataset]. http://doi.org/10.15456/jae.2022327.0708174712
Explore at:
pdf(1847834), txt(2892), zip(12684205)Available download formats
Unique identifier
https://doi.org/10.15456/jae.2022327.0708174712
Dataset updated
Dec 7, 2022
Dataset provided by
ZBW - Leibniz Informationszentrum Wirtschaft
Authors
Gregorio Caetano; Josh Kinsler; Hao Teng; Gregorio Caetano; Josh Kinsler; Hao Teng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In this paper we examine how children's time allocation affects their accumulation of cognitive skill. Children's time allocation is endogenous in a model of skill production since it is chosen by parents and children. We apply a recently developed test of exogeneity to search for specifications that yield causal estimates of the impact time inputs have on child skills. The test exploits bunching in time inputs induced by a nonnegativity time constraint and it has power to detect a variety of sources of endogeneity. We find that with a sufficiently rich set of controls we are unable to reject exogeneity in our most detailed production function specifications. The estimates from these specifications indicate that active time with adult family members, such as parents and grandparents, are the most productive in generating cognitive skill.
Additional file 4: Table S3. of A new statistical framework for genetic...
springernature.figshare.com
xlsx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Panpan Wang; Mohammad Rahman; Li Jin; Momiao Xiong (2023). Additional file 4: Table S3. of A new statistical framework for genetic pleiotropic analysis of high dimensional phenotype data [Dataset]. http://doi.org/10.6084/m9.figshare.c.3639509_D7.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.c.3639509_D7.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Panpan Wang; Mohammad Rahman; Li Jin; Momiao Xiong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Direct, indirect, total and marginal effects for Fig.Â 4. (XLSX 94 kb)
d
Data from: Forest tree breeding using genomic Markov causal models: A new...
search.dataone.org
datadryad.org
Updated Mar 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esteban J. Jurcic; JoaquÃn Dutour; Pamela V. Villalba; Carmelo CenturiÃ³n; Rodolfo J. C. Cantet; SebastiÃ¡n Munilla; Eduardo P. Cappa (2025). Forest tree breeding using genomic Markov causal models: A new approach to genomic tree breeding improvement [Dataset]. http://doi.org/10.5061/dryad.pzgmsbczh
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.pzgmsbczh
Dataset updated
Mar 13, 2025
Dataset provided by
Dryad Digital Repository
Authors
Esteban J. Jurcic; JoaquÃn Dutour; Pamela V. Villalba; Carmelo CenturiÃ³n; Rodolfo J. C. Cantet; SebastiÃ¡n Munilla; Eduardo P. Cappa
Description
Traditionally, a pedigree-based individual-tree mixed model (ABLUP) has been used in forest genetic evaluations to identify individuals with the highest breeding values (BVs). ABLUP is a Markovian causal model, as any individual BV can be expressed as a linear regression on its parental BVs. The regression coefficients are based on the genealogical parent-offspring relationship and are equal to one-half. This study aimed to develop and apply two new causal models that replace these fixed coefficients with ones calculated using genomic information, specifically derived from the genomic-based relationship matrix. We compared the performance of these genomic-based causal models with ABLUP and non-causal GBLUP models. To do so, we evaluated a four-generation population of Eucalyptus grandis, consisting of 3,082 genotyped trees with 14,033 single nucleotide polymorphism markers. Six traits were assessed in 1,219 trees across the first three breeding cycles. The heritability and genetic means..., , , # Forest tree breeding using genomic Markov causal models: A new approach to genomic tree breeding improvement

https://doi.org/10.5061/dryad.pzgmsbczh

Description of the data and file structure

GENERAL INFORMATION

1. Title of Dataset: Forest tree breeding using genomic Markov causal models: A new approach to genomic tree breeding improvement

2. Author Information

A. Principal Investigator Contact Information

Name: Esteban Javier Jurcic

Institution: Instituto Nacional de TecnologÃa Agropecuaria (INTA)

Address: De Los Reseros y Dr. NicolÃ¡s Repetto s/n, 1686, Hurlingham, Buenos Aires, Argentina.

Email: jurcic.esteban@inta.gob.ar

B. Associate or Co-investigator Contact Information

Name: Eduardo Pablo Cappa

Institution: Instituto Nacional de TecnologÃa Agropecuaria (INTA) - CONICET

Address: De Los Reseros y Dr. NicolÃ¡s Repetto s/n, 1686, Hurlingham, Buenos Aires, Argentina.

Email: [cappa.eduar...,
n
Data from: Forecasting species distributions: Correlation does not equal...
data.niaid.nih.gov
datadryad.org
zip
Updated Jul 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexej Sirén; Chris S. Sutherland; Ambarish V. Karmalkar; Matthew J. Duveneck; Toni Lyn Morelli (2023). Forecasting species distributions: Correlation does not equal causation [Dataset]. http://doi.org/10.5061/dryad.k0p2ngf9j
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.k0p2ngf9j
Dataset updated
Jul 28, 2023
Dataset provided by
New England Conservatory Boston Massachusetts USA
University of Massachusetts Amherst
Authors
Alexej Sirén; Chris S. Sutherland; Ambarish V. Karmalkar; Matthew J. Duveneck; Toni Lyn Morelli
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Aim: Identifying the mechanisms influencing species’ distributions is critical for accurate climate change forecasts. However, current approaches are limited by correlative models that cannot distinguish between direct and indirect effects. Location: New Hampshire and Vermont, USA. Methods: Using causal and correlational models and new theory on range limits, we compared current (2014–2019) and future (2080s) distributions of ecologically important mammalian carnivores and competitors along range limits in the northeastern US under two global climate models (GCMs) and a high-emissions scenario (RCP8.5) of projected snow and forest biomass change. Results: Our hypothesis that causal models of climate-mediated competition would result in different distribution predictions than correlational models, both in the current and future periods, was well-supported by our results; however, these patterns were prominent only for species pairs that exhibited strong interactions. The causal model predicted the current distribution of Canada lynx (Lynx canadensis) more accurately, likely because it incorporated the influence of competitive interactions mediated by snow with the closely related bobcat (Lynx rufus). Both modeling frameworks predicted an overall decline in lynx occurrence in the central high elevation regions and increased occurrence in the northeastern region in the 2080s due to changes in land use that provided optimal habitat. However, these losses and gains were less substantial in the causal model due to the inclusion of an indirect buffering effect of snow on lynx. Main conclusions: Our comparative analysis indicates that a causal framework, steeped in ecological theory, can be used to generate spatially-explicit predictions of species distributions. This approach can be used to disentangle correlated predictors that have previously hampered understanding of range limits and species’ response to climate change. Methods We used data from 257 camera-trap sites spaced in non-overlapping grids based on the home range size of the smallest carnivore species (Martes americana = 2x2 km). Each site included a remote camera positioned facing north on a tree, 1–2 m above the snow surface, and pointed at a slight downward angle towards a stake positioned 3–5 m from the camera. Commercial skunk lure and turkey feathers were used as attractants and placed directly on the snow stakes. Cameras were set to take 1–3 consecutive pictures every 1–10 sec when triggered, depending on the brand and model, and checked on average 3 (range = 1–9) times each season to download data, refresh attractants, and to ensure cameras were working properly.

We used camera data from autumn to spring (16 October–15 May) for each year (2014–2019). This seasonal range was chosen as it approximates demographic (i.e., births and deaths) and geographic closure (i.e., dispersal) and is based on species’ ecological responses to snowpack and leaf phenology of the region. We identified species in photographs by their unique morphology and field marks and used consensus from multiple observers when identification was uncertain. We organized camera data into weekly occasions using CPW Photo Warehouse and recorded whether or not each species was detected during the occasion.

Facebook

Twitter

Click to copy link

Link copied

Cite

Leung-Yau Lo; Man-Leung Wong; Kin-Hong Lee; Kwong-Sak Leung (2023). Parameter Settings of Synthetic Data Generation. [Dataset]. http://doi.org/10.1371/journal.pone.0138596.t002

Parameter Settings of Synthetic Data Generation.

Explore at:

6 scholarly articles cite this dataset (View in Google Scholar)

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0138596.t002

Dataset updated

Jun 1, 2023

Dataset provided by

PLOS ONE

Authors

Leung-Yau Lo; Man-Leung Wong; Kin-Hong Lee; Kwong-Sak Leung

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Parameter Settings of Synthetic Data Generation.

Clear search

Close search

Google apps

Main menu

Parameter Settings of Synthetic Data Generation.

Dataset Artifact for paper "Root Cause Analysis for Microservice System...

datasheet1_Causal Datasheet for Datasets: An Evaluation Guide for Real-World...

Data from: CausalDynamics

Multimodal3DIdent

Causal AI Market By Application (Service, Supply Chain Optimization,...

Synthetic Medicare Data for Environmental Health Studies

Dataset: Screening Causal Assessment of Brook Trout Occurrence and Road...

Causal Effects of AI-Powered Digital Assistance on Generation Z Online...

Replication Data for: causalizeR: A text mining algorithm to identify causal...

Data from: How biased media generate support for the ruling authorities:...

Data from: Causal Inference with Multilevel Data: A Comparison of Different...

Replication Data for: Estimating and Evaluating Treatment Effect...

Data from: Current approaches using genetic distances produce poor estimates...

Replication Data for: Does Regression Produce Representative Estimates of...

Data from: A causal role for right frontopolar cortex in directed, but not...

Towards causal estimates of children's time allocation on skill development...

Additional file 4: Table S3. of A new statistical framework for genetic...

Data from: Forest tree breeding using genomic Markov causal models: A new...

Description of the data and file structure

Data from: Forecasting species distributions: Correlation does not equal...

Parameter Settings of Synthetic Data Generation.