Title: | Causal Discovery under a Confounder Blanket |
---|---|
Description: | Methods for learning causal relationships among a set of foreground variables X based on signals from a (potentially much larger) set of background variables Z, which are known non-descendants of X. The confounder blanket learner (CBL) uses sparse regression techniques to simultaneously perform many conditional independence tests, with complementary pairs stability selection to guarantee finite sample error control. CBL is sound and complete with respect to a so-called "lazy oracle", and works with both linear and nonlinear systems. For details, see Watson & Silva (2022) <arXiv:2205.05715>. |
Authors: | David Watson [aut, cre] |
Maintainer: | David Watson <[email protected]> |
License: | GPL (>=3) |
Version: | 0.1.2 |
Built: | 2025-03-10 03:17:55 UTC |
Source: | https://github.com/dswatson/cbl |
Simulated dataset of samples with 2 foreground variables and 10
background variables. The design follows that of Watson & Silva (2022), with
drawn from a multivariate Gaussian distribution with a Toeplitz
covariance matrix of autocorrelation
. Expected sparsity is
0.5, signal-to-noise ratio is 2, and structural equations are linear. The
ground truth for foreground variables is
.
data(bipartite)
data(bipartite)
A list with two elements: x
(foreground variables), and
z
(background variables).
Watson, D.S. & Silva, R. (2022). Causal discovery under a confounder blanket. To appear in Proceedings of the 38th Conference on Uncertainty in Artificial Intelligence. arXiv preprint, 2205.05715.
# Load data data(bipartite) x <- bipartite$x z <- bipartite$z # Set seed set.seed(42) # Run CBL cbl(x, z)
# Load data data(bipartite) x <- bipartite$x z <- bipartite$z # Set seed set.seed(42) # Run CBL cbl(x, z)
This function performs the confounder blanket learner (CBL) algorithm for causal discovery.
cbl( x, z, s = "lasso", B = 50, gamma = 0.5, maxiter = NULL, params = NULL, parallel = FALSE, ... )
cbl( x, z, s = "lasso", B = 50, gamma = 0.5, maxiter = NULL, params = NULL, parallel = FALSE, ... )
x |
Matrix or data frame of foreground variables. |
z |
Matrix or data frame of background variables. |
s |
Feature selection method. Includes native support for sparse linear
regression ( |
B |
Number of complementary pairs to draw for stability selection. Following Shah & Samworth (2013), we recommend leaving this fixed at 50. |
gamma |
Omission threshold. If either of two foreground variables is
omitted from the model for the other with frequency |
maxiter |
Maximum number of iterations to loop through if convergence is elusive. |
params |
Optional list to pass to |
parallel |
Compute stability selection subroutine in parallel? Must
register backend beforehand, e.g. via |
... |
Extra parameters to be passed to the feature selection subroutine. |
The CBL algorithm (Watson & Silva, 2022) learns a partial order over
foreground variables x
via relations of minimal conditional
(in)dependence with respect to a set of background variables z
. The
method is sound and complete with respect to a so-called "lazy oracle", who
only answers independence queries about variable pairs conditioned on the
intersection of their respective non-descendants.
For computational tractability, CBL performs conditional independence tests
via supervised learning with feature selection. The current implementation
includes support for sparse linear models (s = "lasso"
) and gradient
boosting machines (s = "boost"
). For statistical inference, CBL uses
complementary pairs stability selection (Shah & Samworth, 2013), which bounds
the probability of errors of commission.
A square, lower triangular ancestrality matrix. Call this matrix m
.
If CBL infers that , then
m[j, i] = 1
. If CBL
infers that , then
m[j, i] = 0.5
. If CBL infers
that , then
m[j, i] = 0
. Otherwise,
m[j, i] = NA
.
Watson, D.S. & Silva, R. (2022). Causal discovery under a confounder blanket. To appear in Proceedings of the 38th Conference on Uncertainty in Artificial Intelligence. arXiv preprint, 2205.05715.
Shah, R. & Samworth, R. (2013). Variable selection with error control: Another look at stability selection. J. R. Statist. Soc. B, 75(1):55–80, 2013.
# Load data data(bipartite) x <- bipartite$x z <- bipartite$z # Set seed set.seed(123) # Run CBL cbl(x, z) # With user-supplied feature selection subroutine s_new <- function(x, y) { # Fit model, extract coefficients df <- data.frame(x, y) f_full <- lm(y ~ 0 + ., data = df) f_reduced <- step(f_full, trace = 0) keep <- names(coef(f_reduced)) # Return bit vector out <- ifelse(colnames(x) %in% keep, 1, 0) return(out) } cbl(x, z, s = s_new)
# Load data data(bipartite) x <- bipartite$x z <- bipartite$z # Set seed set.seed(123) # Run CBL cbl(x, z) # With user-supplied feature selection subroutine s_new <- function(x, y) { # Fit model, extract coefficients df <- data.frame(x, y) f_full <- lm(y ~ 0 + ., data = df) f_reduced <- step(f_full, trace = 0) keep <- names(coef(f_reduced)) # Return bit vector out <- ifelse(colnames(x) %in% keep, 1, 0) return(out) } cbl(x, z, s = s_new)
Computer the consistency lower bound
epsilon_fn(df, B)
epsilon_fn(df, B)
df |
Table of (de)activation rates. |
B |
Number of complementary pairs to draw for stability selection. |
This function fits a potentially sparse supervised learning model and returns a bit vector indicating which features were selected.
l0(x, y, s, params, ...)
l0(x, y, s, params, ...)
x |
Design matrix. |
y |
Outcome vector. |
s |
Regression method. Current options are |
params |
Optional list of parameters to use when |
... |
Extra parameters to be passed to the feature selection subroutine. |
Compute the min-D factor of Shah & Samworth's Eq. 8 (2013). Code taken verbatim from Rajen Shah's personal website: http://www.statslab.cam.ac.uk/~rds37/papers/r_concave_tail.R.
minD(theta, B, r = c(-1/2, -1/4))
minD(theta, B, r = c(-1/2, -1/4))
theta |
Low rate threshold. |
B |
Number of complementary pairs for stability selection. |
r |
Of r-concavity fame. |
Compute the tail probability of an r-concave random variable. Code taken verbatim from Rajen Shah's personal website: http://www.statslab.cam.ac.uk/~rds37/papers/r_concave_tail.R.
r.TailProbs(eta, B, r)
r.TailProbs(eta, B, r)
eta |
Upper bound on the expectation of the r-concave random variable. |
B |
Number of complementary pairs for stability selection. |
r |
Of r-concavity fame. |
Infer causal direction using stability selection
ss_fn(df, epsilon, order, rule, B)
ss_fn(df, epsilon, order, rule, B)
df |
Table of (de)activation rates. |
epsilon |
Consistency lower bound, as computed by |
order |
Causal order of interest, either |
rule |
Inference rule, either |
B |
Number of complementary pairs to draw for stability selection. |
This function executes one loop of the model quartet for a given pair of foreground variables and records any disconnections and/or (de)activations.
sub_loop(b, i, j, x, z_t, s, params, ...)
sub_loop(b, i, j, x, z_t, s, params, ...)
b |
Subsample index. |
i |
First foreground variable index. |
j |
Second foreground variable index. |
x |
Matrix of foreground variables. |
z_t |
Intersection of iteration-t known non-descendants for foreground
variables |
s |
Regression method. Current options are |
params |
Optional list of parameters to use when |
... |
Extra parameters to be passed to the feature selection subroutine. |