MRC Knockoffs Primer

Given a set of \(p\) features \(X = (X_1, \dots, X_p)\), how should one generate knockoffs \(\tilde{X}\)? A “conventional” method is to minimize the mean absolute correlation, defined as

\[\mathrm{MAC} = \sum_{j=1}^p |\text{Cor}(X_j, \tilde{X}_j)|.\]

As discussed in Spector and Janson (2020), MAC-minimizing knockoffs (known as “SDP knockoffs”) perform poorly when the features are correlated. In particular, minimizing the MAC often makes it possible to reconstruct each feature \(X_j\) using the other features \(X_{-j}\) and the knockoffs \(\tilde{X}\). In other words, SDP knockoffs often ensures that no features contain any unique information, which makes it very hard to determine which features are important and which are unimportant.

A better approach to constructing knockoffs is to minimizing the reconstructability (MRC) of each feature \(X_j\). Intuitively, this framework maximizes the amount of unique information each feature contains, making it easier to determine which features are important. MVR knockoffs are one instantiation of this framework, and as illustrated below, they have much higher power than SDP knockoffs in a variety of settings.

Example 1: As an example, we consider a case where \(X\) is Gaussian and the correlation between each pair of features is \(50\%\). In this situation, SDP knockoffs are provably nearly powerless to select any features.

[1]:
import numpy as np
import knockpy as kpy
from knockpy.knockoff_filter import KnockoffFilter

p = 300
n = 600
np.random.seed(110)

# Covariance matrix of X
rho = 0.5
Sigma = (1-rho) * np.eye(p) + rho * np.ones((p, p))
X = np.random.multivariate_normal(np.zeros(p), cov=Sigma, size=(n,))

# Sample y given X
beta = kpy.dgp.create_sparse_coefficients(
    p=p, sparsity=0.2, coeff_size=1, coeff_dist="uniform"
)
y = X @ beta + np.random.randn(n)

# SDP knockoff filter
kfilter_sdp = KnockoffFilter(
    fstat='lasso', ksampler='gaussian', knockoff_kwargs={"method":"sdp"}
)
selections_sdp = kfilter_sdp.forward(X=X, y=y, Sigma=Sigma, fdr=0.1)
print(f"SDP knockoffs made {selections_sdp.sum()} discoveries!")
SDP knockoffs made 0.0 discoveries!
[2]:
# Run knockoff filter
kfilter_mvr = KnockoffFilter(fstat='lasso', ksampler='gaussian')
selections_mvr = kfilter_mvr.forward(X=X, y=y, Sigma=Sigma, fdr=0.1)

# calculate power and false positive rate
power = np.around(100*np.dot(selections_mvr, beta != 0) / max(1, (beta != 0).sum()), 2)
fdp = np.around(100*np.dot(selections_mvr, beta == 0) / max(1, selections_mvr.sum()), 2)
print(f"MVR knockoffs made {selections_mvr.sum()} discoveries.")
print(f"MVR knockoffs had a power of {power}% and false positive rate of {fdp}%.")
MVR knockoffs made 67.0 discoveries.
MVR knockoffs had a power of 100.0% and false positive rate of 10.45%.

Example 2: As shown below, SDP knockoffs also have low power when X has an AR1 covariance matrix. See Spector and Janson, 2020 for more discussion on this setting.

[3]:
# Generate data where X is Gaussian with an AR1 covariance matrix
# where Cov(Xj, X_{j+1}) is sampled from Beta(a,b)
# and y | X is a sparse Gaussian linear model
np.random.seed(111)
data_gen_process = kpy.dgp.DGP()
data_gen_process.sample_data(
    method='ar1',
    a=3,
    b=1,
    n=650, # number of data-points
    p=500, # dimensionality
    sparsity=0.1, # number of non-null coefficients
    coeff_dist='uniform', # distribution of size of non-null coefficients
    coeff_size=1, # size of non-null coefficients
    corr_signals=True # non-nulls features are correlated
)
X = data_gen_process.X
y = data_gen_process.y
beta = data_gen_process.beta
Sigma = data_gen_process.Sigma
[4]:
# run SDP knockoff filter
kfilter_sdp = KnockoffFilter(
    fstat='lasso', ksampler='gaussian', knockoff_kwargs={"method":"sdp"}
)
selections_sdp = kfilter_sdp.forward(X=X, y=y, Sigma=Sigma, fdr=0.1)

# Calculate power/FDR
power = np.around(100*np.dot(selections_sdp, beta != 0) / max(1, (beta != 0).sum()), 2)
fdp = np.around(100*np.dot(selections_sdp, beta == 0) / max(1, selections_sdp.sum()), 2)
print(f"SDP knockoffs made {selections_sdp.sum()} discoveries!")
print(f"SDP knockoffs had a power of {power}% and false positive rate of {fdp}%.")
SDP knockoffs made 0.0 discoveries!
SDP knockoffs had a power of 0.0% and false positive rate of 0.0%.
[5]:
# Run knockoff filter
kfilter_mvr = KnockoffFilter(fstat='lasso', ksampler='gaussian')
selections_mvr = kfilter_mvr.forward(X=X, y=y, Sigma=Sigma, fdr=0.1)

# calculate power and false positive rate
power = np.around(100*np.dot(selections_mvr, beta != 0) / max(1, (beta != 0).sum()), 2)
fdp = np.around(100*np.dot(selections_mvr, beta == 0) / max(1, selections_mvr.sum()), 2)
print(f"MVR knockoffs made {selections_mvr.sum()} discoveries.")
print(f"MVR knockoffs had a power of {power}% and false positive rate of {fdp}%.")
MVR knockoffs made 55.0 discoveries.
MVR knockoffs had a power of 94.0% and false positive rate of 14.55%.