API Reference

The KnockoffFilter

class knockpy.knockoff_filter.KnockoffFilter(fstat='lasso', ksampler='gaussian', fstat_kwargs={}, knockoff_kwargs={})[source]

Performs knockoff-based inference, from start to finish.

This wraps both the knockoffs.KnockoffSampler and knockoff_stats.FeatureStatistic classes.

Parameters
fstatstr or knockpy.knockoff_stats.FeatureStatistic

The feature statistic to use in the knockoff filter. Defaults to a lasso statistic. This may also be a string identifier, including:

  • ‘lasso’ or ‘lcd’: cross-validated lasso coefficients differences

  • ‘mlr’: the masked likelihood ratio (MLR) statistic, which has provable optimality properties for knockoffs

  • ‘lsm’: signed maximum of the lasso path statistic as in Barber and Candes 2015

  • ‘dlasso’: Cross-validated debiased lasso coefficients

  • ‘ridge’: Cross validated ridge coefficients

  • ‘ols’: Ordinary least squares coefficients

  • ‘margcorr’: marginal correlations between features and response

  • ‘deeppink’: The deepPINK statistic as in Lu et al. 2018

  • ‘mlrspline’ or ‘spline’: The MLR statistic applied to regression spline basis functions.

  • ‘randomforest’: A random forest with swap importances

Note that when using FX knockoffs, the lcd statistic does not use cross-valdilidation and sets alphas = sigma2 * sqrt(log(p) / n).

ksamplerstr or knockpy.knockoffs.KnockoffSampler

The knockoff sampler to use in the knockoff filter. This may also be a string identifier, including:

  • ‘gaussian’: Gaussian Model-X knockoffs

  • ‘fx’: Fixed-X knockoffs.

  • ‘metro’: Generic metropolized knockoff sampler

  • ‘artk’: t-tailed Markov chain.

  • ‘blockt’: Blocks of t-distributed

  • ‘gibbs_grid’: Discrete gibbs grid

An alternative to specifying the ksampler is to simply pass in a knockoff matrix during the forward call.

fstat_kwargsdict

Kwargs to pass to the feature statistic fit function, excluding the required arguments, defaults to {}

knockoff_kwargsdict

Kwargs for instantiating the knockoff sampler argument if the ksampler argument is a string identifier. This can be the empt dict for some identifiers such as “gaussian” or “fx”, but additional keyword arguments are required for complex samplers such as the “metro” identifier. Defaults to {}

Examples

Here we fit the KnockoffFilter on fake data from a Gaussian linear model:

# Fake data-generating process for Gaussian linear model
import knockpy as kpy
dgprocess = kpy.dgp.DGP()
dgprocess.sample_data(n=500, p=500, sparsity=0.1)

# LCD statistic with Gaussian MX knockoffs
# This uses LedoitWolf covariance estimation by default.
from knockpy.knockoff_filter import KnockoffFilter 
kfilter = KnockoffFilter( 
    fstat='lcd', 
    ksampler='gaussian', 
    knockoff_kwargs={"method":"mvr"}, 
)
rejections = kfilter.forward(X=dgprocess.X, y=dgprocess.y)
Attributes
fstatknockpy.knockoff_stats.FeatureStatistic

The feature statistics to use for inference. This inherits from knockoff_stats.FeatureStatistic.

ksamplerknockpy.knockoffs.KnockoffSampler

The knockoff sampler to use during inference. This eventually inherits from knockoffs.KnockoffSampler.

fstat_kwargsdict

Dictionary of kwargs to pass to the fit call of self.fstat.

knockoff_kwargsdict

If ksampler is not yet initialized, kwargs to pass to ksampler.

Znp.ndarray

a 2p-dimsional array of feature and knockoff importances. The first p coordinates correspond to features, the last p correspond to knockoffs.

Wnp.ndarray

an array of feature statistics. This is (p,)-dimensional for regular knockoffs and (num_groups,)-dimensional for group knockoffs.

Snp.ndarray

the (p, p)-shaped knockoff S-matrix used to generate knockoffs.

Xnp.ndarray

the (n, p)-shaped design matrix

Xknp.ndarray

the (n, p)-shaped matrix of knockoffs

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

rejectionsnp.ndarray

a (p,)-shaped boolean array where rejections[j] == 1 iff the the knockoff filter rejects the jth feature.

Gnp.ndarray

the (2p, 2p)-shaped feature-knockoff covariance matrix

thresholdfloat

the knockoff data-dependent threshold used to select variables

Methods

forward(X, y[, Xk, mu, Sigma, groups, fdr, ...])

Runs the knockoff filter; returns whether each feature was rejected.

make_selections(W, fdr)

" Calculate data dependent threshhold and selections

sample_knockoffs()

Samples knockoffs during forward.

seqstep_plot([max_rank])

Visualizes the knockoff filter.

forward(X, y, Xk=None, mu=None, Sigma=None, groups=None, fdr=0.1, fstat_kwargs={}, knockoff_kwargs={}, shrinkage='ledoitwolf', num_factors=None, recycle_up_to=None)[source]

Runs the knockoff filter; returns whether each feature was rejected.

Parameters
Xnp.ndarray

(n, p)-shaped design matrix

ynp.ndarray

(n,)-shaped response vector

Xknp.ndarray

(n, p)-shaped knockoff matrix. If None, this will construct knockoffs using self.ksampler.

munp.ndarray

(p, )-shaped mean of the features. If None, this defaults to the empirical mean of the features.

Sigmanp.ndarray

(p, p)-shaped covariance matrix of the features. If None, this is estimated using the shrinkage option. This is ignored for fixed-X knockoffs.

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

fdrfloat

The desired level of false discovery rate control.

fstat_kwargsdict

Extra kwargs to pass to the feature statistic fit function, excluding the required arguments.

knockoff_kwargsdict

Extra kwargs for instantiating the knockoff sampler argument if the ksampler argument is a string identifier. This can be the empty dict for some identifiers such as “gaussian” or “fx”, but additional keyword arguments are required for complex samplers such as the “metro” identifier. Defaults to {}

shrinkagestr

Shrinkage method if estimating the covariance matrix. Defaults to “LedoitWolf.” Other options are “MLE” and “glasso” (graphical lasso).

num_factorsint

If num_factors is not None and Sigma is estimated, assumes that the ground-truth Sigma is a factor model with rank num_factors to speed up computation.

recycle_up_toint or float

Three options:

  • if None, does nothing.

  • if an integer > 1, uses the first “recycle_up_to” rows of X as the the first recycle_up_to rows of knockoffs

  • if a float between 0 and 1 (inclusive), interpreted as the proportion of rows to recycle.

For more on recycling, see https://arxiv.org/abs/1602.03574

make_selections(W, fdr)[source]

” Calculate data dependent threshhold and selections

sample_knockoffs()[source]

Samples knockoffs during forward.

seqstep_plot(max_rank=None)[source]

Visualizes the knockoff filter.

Parameters
max_rankint

Only plot the top max_rank feature statistics. Defaults to plotting all feature statistics.

Feature Statistics

class knockpy.knockoff_stats.DeepPinkStatistic(model=None)[source]

Methods

cv_score_model(features, y, cv_score[, ...])

Similar to score_model, but uses cross-validated scoring if cv_score=True.

fit(X, Xk, y[, groups, feature_importance, ...])

Wraps the FeatureStatistic class using DeepPINK to generate variable importances.

score_model(features, y[, y_dist])

Computes mean-squared error of self.model on (features, y) when y is nonbinary, and computes 1 - accuracy otherwise.

swap_feature_importances(features, y)

Given a model of the features and y, calculates feature importances as follows.

swap_path_feature_importances(features, y[, ...])

Similar to swap_feature_importances; see http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf

fit(X, Xk, y, groups=None, feature_importance='deeppink', antisym='cd', group_agg='sum', cv_score=False, train_kwargs={'verbose': False}, **kwargs)[source]

Wraps the FeatureStatistic class using DeepPINK to generate variable importances.

Parameters
Xnp.ndarray

the (n, p)-shaped design matrix

Xknp.ndarray

the (n, p)-shaped matrix of knockoffs

ynp.ndarray

(n,)-shaped response vector

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

feature_importancestr

Specifies how to create feature importances from model. Four options:

  • “deeppink”: Use the deeppink feature importance

defined in https://arxiv.org/abs/1809.01185 - “unweighted”: Use the Z weights from the deeppink paper without weighting them using the layers from the MLP. Deeppink usually outperforms this feature importance (but not always). - “swap”: The default swap-statistic from http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf - “swapint”: The swap-integral defined from http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf

Defaults to deeppink, which is often both the most powerful and the most computationally efficient.

antisymstr

The antisymmetric function used to create (ungrouped) feature statistics. Three options: - “CD” (Difference of absolute vals of coefficients), - “SM” (signed maximum). - “SCD” (Simple difference of coefficients - NOT recommended)

group_aggstr

For group knockoffs, specifies how to turn individual feature statistics into grouped feature statistics. Two options: “sum” and “avg”.

cv_scorebool

If true, score the feature statistic’s predictive accuracy using cross validation. This is extremely expensive for deeppink.

kwargsdict

Extra kwargs to pass to underlying deeppink class (in kpytorch)

Returns
Wnp.ndarray

an array of feature statistics. This is (p,)-dimensional for regular knockoffs and (num_groups,)-dimensional for group knockoffs.

class knockpy.knockoff_stats.FeatureStatistic(model=None)[source]

The base knockoff feature statistic class, which can wrap any generic prediction algorithm.

Parameters
model

An instance of a class with a “train” or “fit” method and a “predict” method.

Attributes
model

A (predictive) model class underlying the variable importance measures.

indsnp.ndarray

(2p,)-dimensional array of indices representing the random permutation applied to the concatenation of [X, Xk] before fitting gl.

rev_indsnp.ndarray:

Indices which reverse the effect of inds. In particular, if M is any (n, 2p)-dimensional array, then `M==M[:, inds][:, rev_inds]`

scorefloat

A metric of the model’s performance, as defined by score_type.

score_typestr

One of MSE, CVMSE, accuracy, or cvaccuracy. (cv stands for cross-validated)

Znp.ndarray

a 2p-dimsional array of feature and knockoff importances. The first p coordinates correspond to features, the last p correspond to knockoffs.

Wnp.ndarray

an array of feature statistics. This is (p,)-dimensional for regular knockoffs and (num_groups,)-dimensional for group knockoffs.

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

Methods

cv_score_model(features, y, cv_score[, ...])

Similar to score_model, but uses cross-validated scoring if cv_score=True.

fit(X, Xk, y[, groups, feature_importance, ...])

Trains the model and creates feature importances.

score_model(features, y[, y_dist])

Computes mean-squared error of self.model on (features, y) when y is nonbinary, and computes 1 - accuracy otherwise.

swap_feature_importances(features, y)

Given a model of the features and y, calculates feature importances as follows.

swap_path_feature_importances(features, y[, ...])

Similar to swap_feature_importances; see http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf

cv_score_model(features, y, cv_score, logistic_flag=False)[source]

Similar to score_model, but uses cross-validated scoring if cv_score=True.

fit(X, Xk, y, groups=None, feature_importance='swap', antisym='cd', group_agg='avg', **kwargs)[source]

Trains the model and creates feature importances.

Parameters
Xnp.ndarray

the (n, p)-shaped design matrix

Xknp.ndarray

the (n, p)-shaped matrix of knockoffs

ynp.ndarray

(n,)-shaped response vector

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

feature_importancestr

Specifies how to create feature importances from model. Two options:

  • “swap”: The default swap-statistic from

http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf. These are good measures of feature importance but slightly slower. - “swapint”: The swap-integral defined from http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf

Defaults to ‘swap’

antisymstr

The antisymmetric function used to create (ungrouped) feature statistics. Three options: - “CD” (Difference of absolute vals of coefficients), - “SM” (signed maximum). - “SCD” (Simple difference of coefficients - NOT recommended)

group_aggstr

For group knockoffs, specifies how to turn individual feature statistics into grouped feature statistics. Two options: “sum” and “avg”.

**kwargs**dict

kwargs to pass to the ‘train’ or ‘fit’ method of the model.

Returns
Wnp.ndarray

an array of feature statistics. This is (p,)-dimensional for regular knockoffs and (num_groups,)-dimensional for group knockoffs.

score_model(features, y, y_dist=None)[source]

Computes mean-squared error of self.model on (features, y) when y is nonbinary, and computes 1 - accuracy otherwise.

Returns
lossfloat

Either the MSE or one minus the accuracy of the model, depending on whether y is continuous or binary.

swap_feature_importances(features, y)[source]

Given a model of the features and y, calculates feature importances as follows.

For feature i, replace the feature with its knockoff and calculate the relative increase in the loss. Similarly, for knockoff i, replace the knockoffs with its feature and calculate the relative increase in the loss.

Parameters
featuresnp.ndarray

(n, 2p)-shaped array of concatenated features and knockoffs, which must be permuted by self.inds.

ynp.ndarray

(n,)-shaped response vector

Returns
Z_swapnp.ndarray

(2p,)-shaped array of variable importances such that Z_swap is in the same permuted as features initially were.

swap_path_feature_importances(features, y, step_size=0.5, max_lambda=5)[source]

Similar to swap_feature_importances; see http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf

class knockpy.knockoff_stats.LassoStatistic[source]

Lasso Statistic wrapper class

Methods

cv_score_model(features, y, cv_score[, ...])

Similar to score_model, but uses cross-validated scoring if cv_score=True.

fit(X, Xk, y[, groups, zstat, antisym, ...])

Wraps the FeatureStatistic class but uses cross-validated Lasso coefficients or Lasso path statistics as variable importances.

score_model(features, y[, y_dist])

Computes mean-squared error of self.model on (features, y) when y is nonbinary, and computes 1 - accuracy otherwise.

swap_feature_importances(features, y)

Given a model of the features and y, calculates feature importances as follows.

swap_path_feature_importances(features, y[, ...])

Similar to swap_feature_importances; see http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf

fit(X, Xk, y, groups=None, zstat='coef', antisym='cd', group_agg='avg', cv_score=False, debias=False, Ginv=None, **kwargs)[source]

Wraps the FeatureStatistic class but uses cross-validated Lasso coefficients or Lasso path statistics as variable importances.

Parameters
Xnp.ndarray

the (n, p)-shaped design matrix

Xknp.ndarray

the (n, p)-shaped matrix of knockoffs

ynp.ndarray

(n,)-shaped response vector

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

zstatstr:

Two options for the variable importance measure: - If ‘coef’, uses to cross-validated (group) lasso coefficients. - If ‘lars_path’, uses the lambda value where each feature/knockoff enters the lasso path (meaning becomes nonzero). This defaults to coef.

antisymstr

The antisymmetric function used to create (ungrouped) feature statistics. Three options: - “CD” (Difference of absolute vals of coefficients), - “SM” (signed maximum). - “SCD” (Simple difference of coefficients - NOT recommended)

group_aggstr

For group knockoffs, specifies how to turn individual feature statistics into grouped feature statistics. Two options: “sum” and “avg”.

cv_scorebool

If true, score the feature statistic’s predictive accuracy using cross validation.

debiasbool:

If true, debias the lasso. See https://arxiv.org/abs/1508.02757

Ginvnp.ndarray

(2p, 2p)-shaped precision matrix for the feature-knockoff covariate distribution. This must be specified if debias=True.

kwargsdict

Extra kwargs to pass to underlying Lasso classes.

Returns
Wnp.ndarray

an array of feature statistics. This is (p,)-dimensional for regular knockoffs and (num_groups,)-dimensional for group knockoffs.

Notes

When not using the group lasso, one can specify choice(s) of regularization parameter using the alphas keyword argument. Otherwise, use the reg_vals keyword argument.

class knockpy.knockoff_stats.MargCorrStatistic[source]

Lasso Statistic wrapper class

Methods

cv_score_model(features, y, cv_score[, ...])

Similar to score_model, but uses cross-validated scoring if cv_score=True.

fit(X, Xk, y[, groups])

Wraps the FeatureStatistic class using marginal correlations between X, Xk and y as variable importances.

score_model(features, y[, y_dist])

Computes mean-squared error of self.model on (features, y) when y is nonbinary, and computes 1 - accuracy otherwise.

swap_feature_importances(features, y)

Given a model of the features and y, calculates feature importances as follows.

swap_path_feature_importances(features, y[, ...])

Similar to swap_feature_importances; see http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf

fit(X, Xk, y, groups=None, **kwargs)[source]

Wraps the FeatureStatistic class using marginal correlations between X, Xk and y as variable importances.

Parameters
Xnp.ndarray

the (n, p)-shaped design matrix

Xknp.ndarray

the (n, p)-shaped matrix of knockoffs

ynp.ndarray

(n,)-shaped response vector

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

kwargsdict

Extra kwargs to pass to underlying combine_Z_stats

Returns
Wnp.ndarray

an array of feature statistics. This is (p,)-dimensional for regular knockoffs and (num_groups,)-dimensional for group knockoffs.

class knockpy.knockoff_stats.OLSStatistic[source]

Lasso Statistic wrapper class

Methods

cv_score_model(features, y, cv_score[, ...])

Similar to score_model, but uses cross-validated scoring if cv_score=True.

fit(X, Xk, y[, groups, cv_score])

Wraps the FeatureStatistic class with OLS coefs as variable importances.

score_model(features, y[, y_dist])

Computes mean-squared error of self.model on (features, y) when y is nonbinary, and computes 1 - accuracy otherwise.

swap_feature_importances(features, y)

Given a model of the features and y, calculates feature importances as follows.

swap_path_feature_importances(features, y[, ...])

Similar to swap_feature_importances; see http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf

fit(X, Xk, y, groups=None, cv_score=False, **kwargs)[source]

Wraps the FeatureStatistic class with OLS coefs as variable importances.

Parameters
Xnp.ndarray

the (n, p)-shaped design matrix

Xknp.ndarray

the (n, p)-shaped matrix of knockoffs

ynp.ndarray

(n,)-shaped response vector

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

cv_scorebool

If true, score the feature statistic’s predictive accuracy using cross validation.

kwargsdict

Extra kwargs to pass to combine_Z_stats.

Returns
Wnp.ndarray

an array of feature statistics. This is (p,)-dimensional for regular knockoffs and (num_groups,)-dimensional for group knockoffs.

class knockpy.knockoff_stats.RandomForestStatistic(model=None)[source]

Methods

cv_score_model(features, y, cv_score[, ...])

Similar to score_model, but uses cross-validated scoring if cv_score=True.

fit(X, Xk, y[, groups, feature_importance, ...])

Wraps the FeatureStatistic class using a Random Forest to generate variable importances.

score_model(features, y[, y_dist])

Computes mean-squared error of self.model on (features, y) when y is nonbinary, and computes 1 - accuracy otherwise.

swap_feature_importances(features, y)

Given a model of the features and y, calculates feature importances as follows.

swap_path_feature_importances(features, y[, ...])

Similar to swap_feature_importances; see http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf

fit(X, Xk, y, groups=None, feature_importance='swap', antisym='cd', group_agg='sum', cv_score=False, **kwargs)[source]

Wraps the FeatureStatistic class using a Random Forest to generate variable importances.

Parameters
Xnp.ndarray

the (n, p)-shaped design matrix

Xknp.ndarray

the (n, p)-shaped matrix of knockoffs

ynp.ndarray

(n,)-shaped response vector

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

feature_importancestr

Specifies how to create feature importances from model. Three options:

  • “sklearn”: Use sklearn feature importances. These

are very poor measures of feature importance, but very fast. - “swap”: The default swap-statistic from http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf. These are good measures of feature importance but slightly slower. - “swapint”: The swap-integral defined from http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf

Defaults to ‘swap’

antisymstr

The antisymmetric function used to create (ungrouped) feature statistics. Three options: - “CD” (Difference of absolute vals of coefficients), - “SM” (signed maximum). - “SCD” (Simple difference of coefficients - NOT recommended)

group_aggstr

For group knockoffs, specifies how to turn individual feature statistics into grouped feature statistics. Two options: “sum” and “avg”.

cv_scorebool

If true, score the feature statistic’s predictive accuracy using cross validation. This is extremely expensive for random forests.

kwargsdict

Extra kwargs to pass to underlying RandomForest class

Returns
Wnp.ndarray

an array of feature statistics. This is (p,)-dimensional for regular knockoffs and (num_groups,)-dimensional for group knockoffs.

class knockpy.knockoff_stats.RidgeStatistic[source]

Ridge statistic wrapper class

Methods

cv_score_model(features, y, cv_score[, ...])

Similar to score_model, but uses cross-validated scoring if cv_score=True.

fit(X, Xk, y[, groups, antisym, group_agg, ...])

Wraps the FeatureStatistic class but uses cross-validated Ridge coefficients as variable importances.

score_model(features, y[, y_dist])

Computes mean-squared error of self.model on (features, y) when y is nonbinary, and computes 1 - accuracy otherwise.

swap_feature_importances(features, y)

Given a model of the features and y, calculates feature importances as follows.

swap_path_feature_importances(features, y[, ...])

Similar to swap_feature_importances; see http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf

fit(X, Xk, y, groups=None, antisym='cd', group_agg='avg', cv_score=False, **kwargs)[source]

Wraps the FeatureStatistic class but uses cross-validated Ridge coefficients as variable importances.

Parameters
Xnp.ndarray

the (n, p)-shaped design matrix

Xknp.ndarray

the (n, p)-shaped matrix of knockoffs

ynp.ndarray

(n,)-shaped response vector

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

antisymstr

The antisymmetric function used to create (ungrouped) feature statistics. Three options: - “CD” (Difference of absolute vals of coefficients), - “SM” (signed maximum). - “SCD” (Simple difference of coefficients - NOT recommended)

group_aggstr

For group knockoffs, specifies how to turn individual feature statistics into grouped feature statistics. Two options: “sum” and “avg”.

cv_scorebool

If true, score the feature statistic’s predictive accuracy using cross validation.

kwargsdict

Extra kwargs to pass to underlying Lasso classes

Returns
Wnp.ndarray

an array of feature statistics. This is (p,)-dimensional for regular knockoffs and (num_groups,)-dimensional for group knockoffs.

knockpy.knockoff_stats.calc_lars_path(X, Xk, y, groups=None, **kwargs)[source]

Calculates locations at which X/knockoffs enter lasso model when regressed on y.

Parameters
Xnp.ndarray

the (n, p)-shaped design matrix

Xknp.ndarray

the (n, p)-shaped matrix of knockoffs

ynp.ndarray

(n,)-shaped response vector

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

**kwargs

kwargs for sklearn.linear_model.lars_path

Returns
Znp.ndarray

(2p,)-shaped array indicating the lasso path statistic for each variable. (This means the maximum lambda such that the lasso coefficient on variable j is nonzero.)

knockpy.knockoff_stats.calc_mse(model, X, y)[source]

Gets MSE of a model

knockpy.knockoff_stats.combine_Z_stats(Z, groups=None, antisym='cd', group_agg='sum')[source]

Given Z scores (variable importances), returns (grouped) feature statistics

Parameters
Znp.ndarray

(2p,)-shaped numpy array of Z-statistics. The first p values correspond to true features, and the last p correspond to knockoffs (in the same order as the true features).

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

antisymstr

The antisymmetric function used to create (ungrouped) feature statistics. Three options: - “CD” (Difference of absolute vals of coefficients), - “SM” (signed maximum). - “SCD” (Simple difference of coefficients - NOT recommended)

group_aggstr

For group knockoffs, specifies how to turn individual feature statistics into grouped feature statistics. Two options: “sum” and “avg”.

Returns
Wnp.ndarray

an array of feature statistics. This is (p,)-dimensional for regular knockoffs and (num_groups,)-dimensional for group knockoffs.

knockpy.knockoff_stats.compute_residual_variance(X, Xk, y)[source]

Estimates sigma2 using residual variance of y given [X, Xk]. Uses a penalized linear model if n < 2p + 10 or p >= 1500 (for efficiency).

Returns
sigma2float

Estimated residual variance

knockpy.knockoff_stats.data_dependent_threshhold(W, fdr=0.1, offset=1)[source]

Calculate data-dependent threshhold given W statistics.

Parameters
Wnp.ndarray

p-length numpy array of feature statistics OR (p, batch_length) shaped array.

fdrfloat

desired level of false discovery rate control

offsetint

If offset = 0, control the modified FDR. If offset = 1 (default), controls the FDR exactly.

Returns
Tfloat or np.ndarray

The data-dependent threshhold. Either a float or a (batch_length,) dimensional array.

knockpy.knockoff_stats.fit_lasso(X, Xk, y, y_dist=None, alphas=None, use_lars=False, **kwargs)[source]

Fits cross-validated lasso on [X, Xk] and y.

Parameters
Xnp.ndarray

the (n, p)-shaped design matrix.

Xknp.ndarray

the (n, p)-shaped matrix of knockoffs

ynp.ndarray

(n,)-shaped response vector

y_diststr

One of “binomial” or “gaussian”

alphasfloat

The regularization parameter(s) to try. Selects one alpha using cross-validation by default.

use_larsbool

If True, uses a LARS-based solver for Gaussian data. If False, uses a gradient based solver (default).

**kwargs

kwargs for sklearn model.

Returns
glsklearn Lasso/LassoCV/LassoLarsCV/LogisticRegressionCV object

The sklearn model fit through cross-validation.

indsnp.ndarray

(2p,)-dimensional array of indices representing the random permutation applied to the concatenation of [X, Xk] before fitting gl.

rev_indsnp.ndarray:

Indices which reverse the effect of inds. In particular, if M is any (n, 2p)-dimensional array, then `M==M[:, inds][:, rev_inds]`

Notes

To avoid FDR control violations, this randomly permutes the features before fitting the lasso.

knockpy.knockoff_stats.fit_ridge(X, Xk, y, y_dist=None, **kwargs)[source]

Fits cross-validated ridge on [X, Xk] and y.

Parameters
Xnp.ndarray

the (n, p)-shaped design matrix

Xknp.ndarray

the (n, p)-shaped matrix of knockoffs

ynp.ndarray

(n,)-shaped response vector

y_diststr

One of “binomial” or “gaussian”

**kwargs

kwargs for sklearn model.

Returns
glsklearn.linear_model.RidgeCV/LogisticRegressionCV

The sklearn model fit through cross-validation.

indsnp.ndarray

(2p,)-dimensional array of indices representing the random permutation applied to the concatenation of [X, Xk] before fitting gl.

rev_indsnp.ndarray:

Indices which reverse the effect of inds. In particular, if M is any (n, 2p)-dimensional array, then `M==M[:, inds][:, rev_inds]`

Notes

To avoid FDR control violations, this randomly permutes the features before fitting the ridge.

knockpy.knockoff_stats.parse_logistic_flag(kwargs)[source]

Checks whether y_dist is binomial

knockpy.knockoff_stats.parse_y_dist(y)[source]

Checks if y is binary; else assumes it is continuous

Gaussian and Fixed-X Knockoff Samplers

class knockpy.knockoffs.FXSampler(X, groups=None, sample_tol=1e-05, S=None, method=None, verbose=False, **kwargs)[source]

Samples FX knockoffs. See the GaussianSampler documentation for description of the arguments.

Methods

check_PSD_condition(Sigma, S)

Checks that the feature-knockoff cov matrix is PSD.

check_xk_validity(X, Xk[, testname, alpha])

Runs a variety of KS tests on X and Xk to (informally) check that Xk are valid knockoffs for X.

fetch_S()

Rescales S to the same scale as the initial X input

many_ks_tests(sample1s, sample2s)

Samples1s, Sample2s = list of arrays Gets p values by running ks tests and then does a multiple testing correction.

sample_knockoffs([check_psd])

Samples knockoffs.

fetch_S()[source]

Rescales S to the same scale as the initial X input

sample_knockoffs(check_psd=False)[source]

Samples knockoffs. returns n x p knockoff matrix.

Parameters
check_psdbool

If True, will check and enforce that S is a valid S-matrix. Defalts to False.

class knockpy.knockoffs.GaussianSampler(X, mu=None, Sigma=None, invSigma=None, groups=None, sample_tol=1e-05, S=None, method=None, verbose=False, **kwargs)[source]

Samples MX Gaussian (group) knockoffs.

Parameters
Xnp.ndarray

the (n, p)-shaped design

munp.ndarray

(p, )-shaped mean of the features. If None, this defaults to the empirical mean of the features.

Sigmanp.ndarray

(p, p)-shaped covariance matrix of the features. If None, this is estimated using the utilities.estimate_covariance function.

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

Snp.ndarray

the (p, p)-shaped knockoff S-matrix used to generate knockoffs. This is defined such that Cov(X, tilde(X)) = Sigma - S. When None, will be constructed by knockoff generator. Defaults to None.

methodstr

Specifies how to construct S matrix. This will be ignored if S is not None. There are several options:

  • ‘mvr’: Minimum Variance-Based Reconstructability knockoffs.

  • ‘mmi’: Minimizes the mutual information between X and the knockoffs.

  • ‘ci’: Conditional independence knockoffs.

  • ‘sdp’: minimize the mean absolute covariance (MAC) between the features

and the knockoffs. - ‘equicorrelated’: Minimizes the MAC under the constraint that the the correlation between each feature and its knockoff is the same.

The default is to use mvr for non-group knockoffs, and to use the group-SDP for grouped knockoffs (the implementation for group mvr knockoffs is currently fairly slow). In both cases we use a block-diagonal approximation if the number if features is greater than 1000.

objectivestr

How to optimize the S matrix if using the SDP for group knockoffs. There are several options:

  • ‘abs’: minimize sum(abs(Sigma - S))

between groups and the group knockoffs. - ‘pnorm’: minimize Lp-th matrix norm. Equivalent to abs when p = 1. - ‘norm’: minimize different type of matrix norm (see norm_type below).

sample_tolfloat

Minimum eigenvalue allowed for feature-knockoff covariance matrix. Keep this small but nonzero (1e-5) to prevent numerical errors.

verbosebool

If True, prints progress over time

rec_propfloat

The proportion of knockoffs to recycle (see Barber and Candes 2018, https://arxiv.org/abs/1602.03574). If method = ‘mvr’, then S_generation takes this into account and should increase the power of recycled knockoffs. sparsely-correlated, high-dimensional settings.

kwargsdict

Other kwargs for S-matrix solvers.

Methods

check_PSD_condition(Sigma, S)

Checks that the feature-knockoff cov matrix is PSD.

check_xk_validity(X, Xk[, testname, alpha])

Runs a variety of KS tests on X and Xk to (informally) check that Xk are valid knockoffs for X.

fetch_S()

Fetches knockoff S-matrix.

many_ks_tests(sample1s, sample2s)

Samples1s, Sample2s = list of arrays Gets p values by running ks tests and then does a multiple testing correction.

sample_knockoffs([check_psd])

Samples knockoffs.

fetch_S()[source]

Fetches knockoff S-matrix.

sample_knockoffs(check_psd=False)[source]

Samples knockoffs. returns n x p knockoff matrix.

Parameters
check_psdbool

If True, will check and enforce that S is a valid S-matrix. Defalts to False.

class knockpy.knockoffs.KnockoffSampler[source]

Base class for sampling knockoffs.

Methods

check_PSD_condition(Sigma, S)

Checks that the feature-knockoff cov matrix is PSD.

check_xk_validity(X, Xk[, testname, alpha])

Runs a variety of KS tests on X and Xk to (informally) check that Xk are valid knockoffs for X.

fetch_S()

Fetches knockoff S-matrix.

many_ks_tests(sample1s, sample2s)

Samples1s, Sample2s = list of arrays Gets p values by running ks tests and then does a multiple testing correction.

sample_knockoffs

check_PSD_condition(Sigma, S)[source]

Checks that the feature-knockoff cov matrix is PSD.

Parameters
Sigmanp.ndarray

(p, p)-shaped covariance matrix of the features. If None, this is estimated using the shrinkage option. This is ignored for fixed-X knockoffs.

Snp.ndarray

the (p, p)-shaped knockoff S-matrix used to generate knockoffs.

Raises
Raises an error if S is not PSD or 2 Sigma - S is not PSD.
check_xk_validity(X, Xk, testname='', alpha=0.001)[source]

Runs a variety of KS tests on X and Xk to (informally) check that Xk are valid knockoffs for X. Uses the BHQ adjustment for multiple testing.

Parameters
Xnp.ndarray

the (n, p)-shaped design

Xknp.ndarray

the (n, p)-shaped matrix of knockoffs

testnamestr

a testname that shows up in the error

alphafloat

The significance level. Defaults to 0.001

fetch_S()[source]

Fetches knockoff S-matrix.

many_ks_tests(sample1s, sample2s)[source]

Samples1s, Sample2s = list of arrays Gets p values by running ks tests and then does a multiple testing correction.

knockpy.knockoffs.produce_FX_knockoffs(X, invSigma, S, copies=1)[source]

See equation (1.4) of https://arxiv.org/pdf/1404.5609.pdf

Metropolized Samplers

The metropolized knockoff sampler for an arbitrary probability density and graphical structure using covariance-guided proposals.

See https://arxiv.org/abs/1903.00434 for a description of the algorithm and proof of validity and runtime.

This code was based on initial code written by Stephen Bates in October 2019, which was released in combination with https://arxiv.org/abs/1903.00434.

class knockpy.metro.ARTKSampler(X, Sigma, df_t, **kwargs)[source]

Samples knockoffs for autoregressive T-distributed designs. (Hence, ARTK). See https://arxiv.org/pdf/1903.00434.pdf for details.

Parameters
Xnp.ndarray

the (n, p)-shaped design matrix

Sigmanp.ndarray

(p, p)-shaped covariance matrix of the features. The first diagonal should be the pairwise correlations which define the Markov chain.

df_tfloat

The degrees of freedom for the t-distributions.

kwargsdict

kwargs to pass to the constructor method of the generic MetropolizedKnockoffSampler class.

Methods

cache_conditional_proposal_params([verbose, ...])

Caches some of the conditional means for Xjstar | Xtemp.

center(M[, active_inds])

Centers an n x j matrix M.

check_PSD_condition(Sigma, S)

Checks that the feature-knockoff cov matrix is PSD.

check_xk_validity(X, Xk[, testname, alpha])

Runs a variety of KS tests on X and Xk to (informally) check that Xk are valid knockoffs for X.

compute_F(x_flags, j)

Computes the F function from Page 33 pf the paper: Pr(tildeXj=tildexj, Xjstar=xjstar | Xtemp, tildeX_{1:j-1}, Xjstar_{1:j-1}) Note that tildexj and xjstar are NOT inputs because they do NOT change during the junction tree DP process.

compute_acc_prob(x_flags, j[, log_q1, ...])

Computes acceptance probability for variable j given a particular rejection pattern x_flags.

create_proposal_params(**kwargs)

Constructs the covariance-guided proposal.

fetch_S()

Fetches knockoff S-matrix.

fetch_cached_proposal_params(Xtemp, x_flags, j)

Same as above, but uses caching to speed up computation.

fetch_proposal_params(X, prev_proposals)

Returns mean and variance of proposal j given X and previous proposals.

lf(X)

Reordered likelihood function

lf_ratio(X, Xjstar, j)

Calculates the log of the likelihood ratio between two observations: X where X[:,j] is replaced with Xjstar, divided by the likelihood of X.

log_q12(x_flags, j)

Computes q1 and q2 as specified by page 33 of the paper.

many_ks_tests(sample1s, sample2s)

Samples1s, Sample2s = list of arrays Gets p values by running ks tests and then does a multiple testing correction.

q_ll(Xjstar, X, prev_proposals[, cond_mean, ...])

Calculates the log-likelihood of a proposal Xjstar given X and the previous proposals. Xjstar : np.ndarray (n,)-shaped numpy array of values to evaluate the proposal likelihood at. X : np.ndarray (n, p)-shaped array of observed data, in the order used to sample knockoff variables. prev_proposals : np.ndarray (n, j-1)-shaped array of previous proposals, in the order used to sample knockoff variables. If None, assumes j = 0.

sample_knockoffs([clip, cache])

Samples knockoffs using the metropolized knockoff sampler.

sample_proposals(X, prev_proposals[, ...])

Samples a continuous or discrete proposal given the design matrix and the previous proposals.

class knockpy.metro.BlockTSampler(X, Sigma, df_t, **kwargs)[source]

Methods

check_PSD_condition(Sigma, S)

Checks that the feature-knockoff cov matrix is PSD.

check_xk_validity(X, Xk[, testname, alpha])

Runs a variety of KS tests on X and Xk to (informally) check that Xk are valid knockoffs for X.

fetch_S()

Fetches knockoff S-matrix.

many_ks_tests(sample1s, sample2s)

Samples1s, Sample2s = list of arrays Gets p values by running ks tests and then does a multiple testing correction.

sample_knockoffs(**kwargs)

Actually samples knockoffs sequentially for each block.

fetch_S()[source]

Fetches knockoff S-matrix.

sample_knockoffs(**kwargs)[source]

Actually samples knockoffs sequentially for each block.

Parameters
kwargsdict

kwargs for the MetropolizedKnockoffSampler.sample_knockoffs call for each block.

Returns
Xknp.ndarray

A (n, p)-shaped knockoff matrix in the original order the variables were passed in.

class knockpy.metro.GibbsGridSampler(X, gibbs_graph, Sigma, Q=None, mu=None, max_width=6, **kwargs)[source]

Samples knockoffs for a discrete gibbs grid using the divide-and-conquer algorithm plus metropolized knockoff sampling.

Parameters
Xnp.ndarray

the (n, p)-shaped design matrix

gibbs_graphnp.ndarray

(p, p)-shaped matrix specifying the distribution of the gibbs grid: see knockpy.dgp.sample_gibbs. This must correspond to a grid-like undirected graphical model.

Sigmanp.ndarray

(p, p)-shaped estimated covariance matrix of the data.

max_widthint

The maximum treewidth to allow in the divide-and-conquer algorithm.

Notes

Unlike the attributes of a MetropolizedKnockoffSampler class, the attributes of a BlockTSampler class are stored in the same order that the design matrix is initially passed in. E.g., self.Xk corresponds with self.X.

Methods

check_PSD_condition(Sigma, S)

Checks that the feature-knockoff cov matrix is PSD.

check_xk_validity(X, Xk[, testname, alpha])

Runs a variety of KS tests on X and Xk to (informally) check that Xk are valid knockoffs for X.

fetch_S()

Returns None because the divide-and-conquer approach means there is no one S-matrix.

many_ks_tests(sample1s, sample2s)

Samples1s, Sample2s = list of arrays Gets p values by running ks tests and then does a multiple testing correction.

sample_knockoffs(**kwargs)

Samples knockoffs using divide-and-conquer approach.

coords2num

num2coords

fetch_S()[source]

Returns None because the divide-and-conquer approach means there is no one S-matrix.

sample_knockoffs(**kwargs)[source]

Samples knockoffs using divide-and-conquer approach.

Parameters
kwargsdict

Keyword args for MetropolizedKnockoffSampler.sample_knockoffs.

Returns
Xknp.ndarray

A (n, p)-shaped knockoff matrix in the original order the variables were passed in.

class knockpy.metro.MetropolizedKnockoffSampler(lf, X, mu=None, Sigma=None, undir_graph=None, order=None, active_frontier=None, gamma=0.999, metro_verbose=False, cliques=None, log_potentials=None, buckets=None, **kwargs)[source]

A metropolized knockoff sampler for arbitrary random variables using covariance-guided proposals.

Group knockoffs are not yet supported.

Parameters
lffunction

log-probability density. This function should take a (n, p)-shaped numpy array (n independent samples of a p-dimensional vector) and return a (n,) shaped array of log-probabilities. This can also be supplied as None if cliques and log-potentials are supplied.

Xnp.ndarray

the (n, p)-shaped design matrix

Xknp.ndarray

the (n, p)-shaped matrix of knockoffs

munp.ndarray

The (estimated) mean of X. Exact FDR control is maintained even when this vector is incorrect. Defaults to the mean of X, e.g., X.mean(axis=0).

Sigmanp.ndarray

(p, p)-shaped covariance matrix of the features. If None, this is estimated using the data using a naive method to ensure compatability with the proposals. Exact FDR control is maintained even when Sigma is incorrect.

undir_graphnp.ndarray or nx.Graph

An undirected graph specifying the conditional independence structure of the data-generating process. This must be specified if either of the order or active_frontier params are not specified. One of two options: - A networkx undirected graph object - A (p, p)-shaped numpy array, where nonzero elements represent edges.

ordernp.ndarray

A p-length numpy array specifying the ordering to sample the variables. Should be a vector with unique entries 0,…,p-1.

active_fontierA list of lists of length p where entry j is the set of

entries > j that are in V_j. This specifies the conditional independence structure of the distribution given by lf. See page 34 of the paper.

gammafloat

A tuning parameter to increase / decrease the acceptance ratio. See appendix F.2. Defaults to 0.999.

bucketsnp.ndarray or list

A list or array of discrete values that X can take. Covariance-guided proposals will be rounded to these values. If None, Metro assumes the domain of each feature is all real numbers.

kwargsdict

kwargs to pass to the smatrix.compute_smatrix method for sampling proposals.

Notes

All attributes of the MetropolizedKnockoffSampler are stored in the order that knockoffs are sampled, NOT the order that variables are initially passed in. For example, the X attribute will not necessarily equal the X argument: instead, self.X = X[:, self.order]. To reorder attributes to the initial order of the X argument, use the syntax self.attribute[:, self.inv_order].

Attributes
ordernp.ndarray

(p,)-shaped array of indices which reorders X into the order for sampling knockoffs.

inv_ordernp.ndarray

(p,)-shaped array of indices which takes a set of variables which have been reordered for metropolized sampling and returns them to their initial order. For example, X == X[:, self.order][:, self.inv_order].

Xnp.ndarray

(n, p) design matrix reordered according to the order for sampling knockoffs

X_propnp.ndarray

(n, p)-shaped array of knockoff proposals

Xknp.ndarray

the (n, p)-shaped array of knockoffs

acceptancesnp.ndarray

a (n, p)-shaped boolean array where acceptances[i, j] == 1 indicates that X_prop[i, j] was accepted.

final_acc_probsnp.ndarray

a (n, p)-shaped array where final_acc_probs[i, j] is the acceptance probability for X_prop[i, j].

Sigmanp.ndarray

the (p, p)-shaped estimated covariance matrix of X. The class constructor guarantees this is compatible with the conditional independence structure of the data.

Snp.ndarray

the (p, p)-shaped knockoff S-matrix used to generate the covariance-guided proposals.

Methods

cache_conditional_proposal_params([verbose, ...])

Caches some of the conditional means for Xjstar | Xtemp.

center(M[, active_inds])

Centers an n x j matrix M.

check_PSD_condition(Sigma, S)

Checks that the feature-knockoff cov matrix is PSD.

check_xk_validity(X, Xk[, testname, alpha])

Runs a variety of KS tests on X and Xk to (informally) check that Xk are valid knockoffs for X.

compute_F(x_flags, j)

Computes the F function from Page 33 pf the paper: Pr(tildeXj=tildexj, Xjstar=xjstar | Xtemp, tildeX_{1:j-1}, Xjstar_{1:j-1}) Note that tildexj and xjstar are NOT inputs because they do NOT change during the junction tree DP process.

compute_acc_prob(x_flags, j[, log_q1, ...])

Computes acceptance probability for variable j given a particular rejection pattern x_flags.

create_proposal_params(**kwargs)

Constructs the covariance-guided proposal.

fetch_S()

Fetches knockoff S-matrix.

fetch_cached_proposal_params(Xtemp, x_flags, j)

Same as above, but uses caching to speed up computation.

fetch_proposal_params(X, prev_proposals)

Returns mean and variance of proposal j given X and previous proposals.

lf(X)

Reordered likelihood function

lf_ratio(X, Xjstar, j)

Calculates the log of the likelihood ratio between two observations: X where X[:,j] is replaced with Xjstar, divided by the likelihood of X.

log_q12(x_flags, j)

Computes q1 and q2 as specified by page 33 of the paper.

many_ks_tests(sample1s, sample2s)

Samples1s, Sample2s = list of arrays Gets p values by running ks tests and then does a multiple testing correction.

q_ll(Xjstar, X, prev_proposals[, cond_mean, ...])

Calculates the log-likelihood of a proposal Xjstar given X and the previous proposals. Xjstar : np.ndarray (n,)-shaped numpy array of values to evaluate the proposal likelihood at. X : np.ndarray (n, p)-shaped array of observed data, in the order used to sample knockoff variables. prev_proposals : np.ndarray (n, j-1)-shaped array of previous proposals, in the order used to sample knockoff variables. If None, assumes j = 0.

sample_knockoffs([clip, cache])

Samples knockoffs using the metropolized knockoff sampler.

sample_proposals(X, prev_proposals[, ...])

Samples a continuous or discrete proposal given the design matrix and the previous proposals.

cache_conditional_proposal_params(verbose=False, expensive_cache=True)[source]

Caches some of the conditional means for Xjstar | Xtemp. If expensive_cache = True, this will be quite memory intensive in order to achieve a 2-3x speedup. Otherwise, achieves a a 20-30% speedup at a more modest memory cost.

center(M, active_inds=None)[source]

Centers an n x j matrix M. For mu = 0, does not perform this computation, which actually is a bottleneck for large n and p.

compute_F(x_flags, j)[source]

Computes the F function from Page 33 pf the paper: Pr(tildeXj=tildexj, Xjstar=xjstar | Xtemp, tildeX_{1:j-1}, Xjstar_{1:j-1}) Note that tildexj and xjstar are NOT inputs because they do NOT change during the junction tree DP process.

compute_acc_prob(x_flags, j, log_q1=None, log_q2=None, Xtemp=None)[source]

Computes acceptance probability for variable j given a particular rejection pattern x_flags.

Mathematically, this is: Pr(tildeXj = Xjstar | Xtemp, Xtilde_{1:j-1}, Xstar_{1:j})

create_proposal_params(**kwargs)[source]

Constructs the covariance-guided proposal.

Parameters
kwargsdict

kwargs for the smatrix.compute_smatrix function

fetch_S()[source]

Fetches knockoff S-matrix.

fetch_cached_proposal_params(Xtemp, x_flags, j)[source]

Same as above, but uses caching to speed up computation. This caching can be cheap (if self.cache is False) or extremely expensive (if self.cache is True) in terms of memory.

fetch_proposal_params(X, prev_proposals)[source]

Returns mean and variance of proposal j given X and previous proposals. Both X and prev_proposals must be in the order used to sample knockoff variables.

lf(X)[source]

Reordered likelihood function

lf_ratio(X, Xjstar, j)[source]

Calculates the log of the likelihood ratio between two observations: X where X[:,j] is replaced with Xjstar, divided by the likelihood of X. This is equivalent to (but often faster) than:

>>> ld_obs = self.lf(X)
>>> Xnew = X.copy()
>>> Xnew[:, j] = Xjstar
>>> ld_prop = self.lf(Xnew)
>>> ld_ratio = ld_prop - ld_obs

When node potentials have been passed, this is much faster than calculating the log-likelihood function and subtracting.

Parameters
  • X – a n x p matrix of observations

  • Xjstar – New observations for column j of X

  • j – an int between 0 and p - 1, telling us which column to replace

log_q12(x_flags, j)[source]

Computes q1 and q2 as specified by page 33 of the paper.

q_ll(Xjstar, X, prev_proposals, cond_mean=None, cond_var=None)[source]

Calculates the log-likelihood of a proposal Xjstar given X and the previous proposals. Xjstar : np.ndarray

(n,)-shaped numpy array of values to evaluate the proposal likelihood at.

Xnp.ndarray

(n, p)-shaped array of observed data, in the order used to sample knockoff variables.

prev_proposalsnp.ndarray

(n, j-1)-shaped array of previous proposals, in the order used to sample knockoff variables. If None, assumes j = 0.

sample_knockoffs(clip=1e-05, cache=None)[source]

Samples knockoffs using the metropolized knockoff sampler.

Parameters
clipfloat

To provide numerical stability, we make the minimum acceptance probability clip. If clip=0, some acceptance probabilities may become negative due to floating point errors.

cachebool

If True, uses a very memory intensive caching system to get a 2-3x speedup when calculating conditional means for the proposals. Defaults to true if n * (p**2) < 1e9.

Returns
Xknp.ndarray

A (n, p)-shaped knockoff matrix in the original order the variables were passed in.

sample_proposals(X, prev_proposals, cond_mean=None, cond_var=None)[source]

Samples a continuous or discrete proposal given the design matrix and the previous proposals. Can pass in the conditional mean and variance of the new proposals, if cached, to save computation.

knockpy.metro.gaussian_log_likelihood(X, mu, var)[source]

Somehow this is faster than scipy

knockpy.metro.get_ordering(T)[source]

Takes a junction tree and returns a variable ordering for the metro knockoff sampler. The code from this function is adapted from the code distributed with https://arxiv.org/abs/1903.00434.

Parameters
TA networkx graph that is a junction tree.
Nodes must be sets with elements 0,…,p-1.
Returns
ordera numpy array with unique elements 0,…,p-1
active_frontierlist of lists

a list of length p gwhere entry j is the set of entries > j that are in V_j. This specifies the conditional independence structure of a joint covariate distribution. See page 34 of https://arxiv.org/abs/1903.00434.

knockpy.metro.t_log_likelihood(X, df_t)[source]

UNNORMALIZED t loglikelihood. This is also faster than scipy

knockpy.metro.t_markov_loglike(X, rhos, df_t=3)[source]

Calculates log-likelihood for markov chain specified in https://arxiv.org/pdf/1903.00434.pdf

knockpy.metro.t_mvn_loglike(X, invScale, mu=None, df_t=3)[source]

Calculates multivariate t log-likelihood up to normalizing constant. :param X: n x p array of data :param invScale: p x p array, inverse multivariate t scale matrix :param mu: p-length array, location parameter :param df_t: degrees of freedom

S-matrix computation

knockpy.smatrix.compute_smatrix(Sigma, groups=None, method=None, solver=None, how_approx='blockdiag', max_block=1000, num_factors=20, num_processes=1, line_search=True, D=None, U=None, **kwargs)[source]

Wraps a variety of S-matrix generation functions. For mvr, maxent, mmi, and sdp methods, this can use a block-diagonal approximation of Sigma if the dimension of Sigma exceeds max_block or a factor approximation.

Parameters
Sigmanp.ndarray
``(p, p)``-shaped covariance matrix of X
groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

methodstr

Method for constructing S-matrix. One of mvr, maxent, mmi, sdp, equicorrelated, ci.

solverstr

Method for solving for knockoffs. One of ‘cd’ (coordinate descent), ‘sdp’ (default sdp solver), or ‘psgd’ (projected gradient descent). Coordinate descent is recommended for MRC knockoffs.

how_approxstr

How to approximate the covariance matrix to speed up computation. - If ‘blockdiag’, approximates Sigma as a block-diagonal matrix. - If ‘factor’, approximates Sigma = np.diag(D) + np.dot(U, U.T), a factor model.

max_blockint

The maximum size of a block if how_approx=’blockdiag’. Defaults to 1000.

num_factorsint

The number of factors if how_approx=’factor’. Defaults to 20.

num_processesint

Number of parallel process to use if Sigma is approximated as a block-diagonal matrix.

Dnp.ndarray

p-shaped array of diagonal elements for factor model. Only used if how_approx=’factor’. This is optional if Sigma is not None.

Unp.ndarray

(p, k)-shaped matrix for factor model. Usually k << p. Only used if how_approx=’factor’. This is optional if Sigma is not None.

line_searchbool

If false, mrc methods do not do a line search after blockdiag approximation.

kwargsdict

kwargs to pass to one of the wrapped S-matrix solvers.

Returns
Snp.ndarray

(p, p)-shaped (block) diagonal matrix used to generate knockoffs

Notes

mmi stands for minimum mutual information, which is the same as maximizing entropy.

knockpy.smatrix.compute_smatrix_factored(Sigma, D=None, U=None, method='mvr', num_factors=20, **kwargs)[source]

Wraps S-matrix generation functions which approximate Sigma = np.diag(D) + np.dot(U, U.T).

Parameters
Sigmanp.ndarray
``(p, p)``-shaped covariance matrix of X
Dnp.ndarray

p-shaped array of diagonal elements for factor model. Only used if how_approx=’factor’. This is optional if Sigma is not None.

Unp.ndarray

(p, k)-shaped matrix for factor model. Usually k << p. Only used if how_approx=’factor’. This is optional if Sigma is not None.

methodstr

Method for constructing S-matrix. One of mvr, maxent, mmi.

num_factorsint

The number of factors if how_approx=’factor’. Defaults to 20.

kwargsdict

kwargs to pass to one of the wrapped S-matrix solvers.

Returns
Snp.ndarray

(p, p)-shaped (block) diagonal matrix used to generate knockoffs

Notes

mmi stands for minimum mutual information, which is the same as maximizing entropy.

knockpy.smatrix.divide_computation(Sigma, max_block)[source]

Approximates a correlation matrix Sigma as a block-diagonal matrix using hierarchical clustering. Roughly follows the R knockoff package.

Parameters
Sigmanp.ndarray

(p, p)-shaped covariance matrix of X

max_sizeint

Maximum size of a block in the block-diagonal approximation.

Returns
blocksnp.ndarray

(p, )-shaped numpy array where blocks[i] == j indicates that variable i belongs to block j.

knockpy.smatrix.merge_groups(groups, max_block)[source]

Merges groups of variables together while ensuring all new groups have size less than max_block.

knockpy.smatrix.parse_method(method, groups, p=None)[source]

Decides which method to use to create the knockoff S matrix

Methods for minimum-reconstructability knockoffs.

knockpy.mrc.cholupdate(R, x, add=True)[source]

Performs rank one updates to a cholesky factor R in place.

Parameters
Rnp.ndarray

A (p,p)-shaped upper-triangular matrix.

xnp.ndarray

A (p,)-shaped vector.

addbool

If True, performs a rank one update; else performs a rank one downdate.

Returns
Rnp.ndarray

Suppose the parameter R was a cholesky factor of a matrix V. Upon return, R is the cholesky factor of V +/- np.outer(x, x).

Notes

  • This function modifies both R and x in place.

  • The choldate package is a much faster and more

numerically stable alternative.

knockpy.mrc.maxent_loss(Sigma, S, smoothing=0)[source]

Computes the log determinant of the feature-knockoff precision matrix, which is proportional to the negative entropy of [X, tilde{X}].

Parameters
Sigmanp.ndarray

(p, p)-shaped covariance matrix of X

Snp.ndarray

(p, p)-shaped S-matrix used to generate knockoffs

smoothingfloat

Add smoothing to all eigenvalues of the feature-knockoff precision matrix before taking the log determinant to avoid numerical instability. Defaults to 0.

Returns
lossfloat

The maxent loss for Sigma and S. This is infinite if S is not feasible.

knockpy.mrc.mmi_loss(*args, **kwargs)[source]

Computes the log determinant of the feature-knockoff precision matrix, which is proportional mutual information between X and knockoffs.

This is identical to maxent_loss and exists only for backwards compatability.

knockpy.mrc.mvr_loss(Sigma, S, smoothing=0)[source]

Computes minimum variance-based reconstructability loss for knockoffs, e.g., the trace of the feature-knockoff precision matrix.

Parameters
Sigmanp.ndarray

(p, p)-shaped covariance matrix of X

Snp.ndarray

(p, p)-shaped S-matrix used to generate knockoffs

smoothingfloat

Add smoothing to all eigenvalues of the feature-knockoff precision matrix before inverting to avoid numerical instability. Defaults to 0.

Returns
lossfloat

The MVR loss for Sigma and S. This is infinite if S is not feasible.

knockpy.mrc.solve_ciknock(Sigma, tol=1e-05, num_iter=10)[source]

Computes S-matrix used to generate conditional independence knockoffs.

Parameters
Sigmanp.ndarray

(p, p)-shaped covariance matrix of X

tolfloat

Minimum permissible eigenvalue of 2Sigma - S and S.

num_iterint

The number of iterations in the binary search to ensure S is feasible.

Returns
Snp.ndarray

(p, p)-shaped (block) diagonal matrix used to generate knockoffs

Notes

When the S-matrix corresponding to conditional independence knockoffs is not feasible, this computes that S matrix and then does a binary search to find the maximum gamma such that gamma * S is feasible.

knockpy.mrc.solve_maxent(Sigma, tol=1e-05, verbose=False, num_iter=50, converge_tol=0.0001, choldate_warning=True)[source]

Computes S-matrix used to generate maximum entropy knockoffs using coordinate descent.

Parameters
Sigmanp.ndarray

(p, p)-shaped covariance matrix of X

tolfloat

Minimum permissible eigenvalue of 2Sigma - S and S.

verbosebool

If True, prints updates during optimization.

num_iterint

The number of coordinate descent iterations. Defaults to 50.

converge_tolfloat

A parameter specifying the criteria for convergence.

choldate_warningbool

If True, will warn the user if choldate is not installed. Defaults to True

Returns
Snp.ndarray

(p, p)-shaped (block) diagonal matrix used to generate knockoffs

knockpy.mrc.solve_maxent_factored(D, U, tol=1e-05, verbose=False, num_iter=50, converge_tol=0.0001)[source]

Computes S-matrix used to generate maximum entropy knockoffs using coordinate descent assuming that the covariance matrix follows a factor model. This means Sigma = D + UU^T for a p x p diagonal matrix D and a p x k matrix U.

Parameters
Dnp.ndarray

p-shaped array of diagonal elements.

Unp.ndarray

(p, k)-shaped matrix. Usually k << p.

tolfloat

Minimum permissible eigenvalue of 2Sigma - S and S.

verbosebool

If True, prints updates during optimization.

num_iterint

The number of coordinate descent iterations. Defaults to 50.

converge_tolfloat

A parameter specifying the criteria for convergence.

Returns
Snp.ndarray

(p, p)-shaped (block) diagonal matrix used to generate knockoffs

knockpy.mrc.solve_mmi(*args, **kwargs)[source]

Computes S-matrix used to generate minimum mutual information knockoffs. This is identical to solve_maxent and exists only for backwards compatability.

knockpy.mrc.solve_mvr(Sigma, groups=None, tol=1e-05, verbose=False, num_iter=50, smoothing=0, rej_rate=0, converge_tol=0.001, choldate_warning=True)[source]

Computes S-matrix used to generate minimum variance-based reconstructability knockoffs using coordinate descent.

Parameters
Sigmanp.ndarray

(p, p)-shaped covariance matrix of X

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

tolfloat

Minimum permissible eigenvalue of 2Sigma - S and S.

verbosebool

If True, prints updates during optimization.

num_iterint

The number of coordinate descent iterations. Defaults to 50.

smoothingfloat

Add smoothing to all eigenvalues of the feature-knockoff precision matrix before inverting to avoid numerical instability. Defaults to 0.

converge_tolfloat

A parameter specifying the criteria for convergence.

choldate_warningbool

If True and groups is None, will warn the user if choldate is not installed. Defaults to True.

Returns
Snp.ndarray

(p, p)-shaped (block) diagonal matrix used to generate knockoffs

knockpy.mrc.solve_mvr_factored(D, U, tol=1e-05, verbose=False, num_iter=15, converge_tol=0.0001)[source]

Computes S-matrix used to generate mvr knockoffs using coordinate descent assuming that the covariance matrix follows a factor model. This means Sigma = D + UU^T for a p x p diagonal matrix D and a p x k matrix U.

Parameters
Dnp.ndarray

p-shaped array of diagonal elements.

Unp.ndarray

(p, k)-shaped matrix. Usually k << p.

tolfloat

Minimum permissible eigenvalue of 2Sigma - S and S.

verbosebool

If True, prints updates during optimization.

num_iterint

The number of coordinate descent iterations. Defaults to 50.

converge_tolfloat

A parameter specifying the criteria for convergence.

Returns
Snp.ndarray

(p, p)-shaped (block) diagonal matrix used to generate knockoffs

knockpy.mac.calc_min_group_eigenvalue(Sigma, groups, tol=1e-05, verbose=False)[source]

Calculates the minimum “group” eigenvalue of a covariance matrix Sigma: see Dai and Barber 2016. This is useful for constructing equicorrelated (group) knockoffs.

knockpy.mac.solve_SDP(Sigma, verbose=False, num_iter=10, tol=1e-05)[source]

Solves ungrouped SDP to create S-matrix for MAC-minimizing knockoffs.

Parameters
Sigmanp.ndarray

(p, p)-shaped covariance matrix of X

verbosebool

If True, prints updates during optimization.

num_iterint

Number of iterations in a final binary search to account for numerical errors and ensure 2Sigma - S is PSD.

tolfloat

Minimum permissible eigenvalue of 2Sigma - S and S.

Returns
Snp.ndarray

(p, p)-shaped diagonal S-matrix used to generate knockoffs

knockpy.mac.solve_equicorrelated(Sigma, groups, tol=1e-05, verbose=False, num_iter=10)[source]

Calculates the block diagonal matrix S using the equicorrelated method described by Dai and Barber 2016.

Parameters
Sigmanp.ndarray

(p, p)-shaped covariance matrix of X

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

tolfloat

Minimum permissible eigenvalue of 2Sigma - S and S.

Returns
Snp.ndarray

(p, p)-shaped (block) diagonal matrix used to generate knockoffs

knockpy.mac.solve_group_SDP(Sigma, groups=None, verbose=False, objective='abs', norm_type=2, num_iter=10, tol=1e-05, dsdp_warning=True, **kwargs)[source]

Solves the MAC-minimizng SDP formulation for group knockoffs: extends Barer and Candes 2015/ Candes et al 2018.

Sigmanp.ndarray

(p, p)-shaped covariance matrix of X

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

verbosebool

If True, prints updates during optimization.

objectivestr

How to optimize the S matrix for group knockoffs. There are several options: - ‘abs’: minimize sum(abs(Sigma - S)) - ‘pnorm’: minimize Lp-th matrix norm. - ‘norm’: minimize different type of matrix norm (see norm_type below).

norm_typestr or int
  • When objective == ‘pnorm’, a float specifying which Lp-th matrix norm

to use. Can be any float >= 1. - When objective == ‘norm’, can be ‘fro’, ‘nuc’, np.inf, or 1.

num_iterint

Number of iterations in a final binary search to account for numerical errors and ensure 2Sigma - S is PSD.

tolfloat

Minimum permissible eigenvalue of 2Sigma - S and S.

kwargsdict

Keyword arguments to pass to the cvxpy.Problem.solve() method.

Returns
Snp.ndarray

(p, p)-shaped (block) diagonal matrix used to generate knockoffs

Kpytorch

The kpytorch submodule includes a miscallaneous set of classes and methods which rely on pytorch for automatic differentation.

class knockpy.kpytorch.deeppink.DeepPinkModel(p, hidden_sizes=[64], y_dist='gaussian', normalize_Z=True)[source]

Methods

add_module(name, module)

Adds a child module to the current module.

apply(fn)

Applies fn recursively to every submodule (as returned by .children()) as well as self.

bfloat16()

Casts all floating point parameters and buffers to bfloat16 datatype.

buffers([recurse])

Returns an iterator over module buffers.

children()

Returns an iterator over immediate children modules.

cpu()

Moves all model parameters and buffers to the CPU.

cuda([device])

Moves all model parameters and buffers to the GPU.

double()

Casts all floating point parameters and buffers to double datatype.

eval()

Sets the module in evaluation mode.

extra_repr()

Set the extra representation of the module

float()

Casts all floating point parameters and buffers to float datatype.

forward(features)

Note: features are now shuffled

get_buffer(target)

Returns the buffer given by target if it exists, otherwise throws an error.

get_extra_state()

Returns any extra state to include in the module's state_dict.

get_parameter(target)

Returns the parameter given by target if it exists, otherwise throws an error.

get_submodule(target)

Returns the submodule given by target if it exists, otherwise throws an error.

half()

Casts all floating point parameters and buffers to half datatype.

load_state_dict(state_dict[, strict])

Copies parameters and buffers from state_dict into this module and its descendants.

modules()

Returns an iterator over all modules in the network.

named_buffers([prefix, recurse])

Returns an iterator over module buffers, yielding both the name of the buffer as well as the buffer itself.

named_children()

Returns an iterator over immediate children modules, yielding both the name of the module as well as the module itself.

named_modules([memo, prefix, remove_duplicate])

Returns an iterator over all modules in the network, yielding both the name of the module as well as the module itself.

named_parameters([prefix, recurse])

Returns an iterator over module parameters, yielding both the name of the parameter as well as the parameter itself.

parameters([recurse])

Returns an iterator over module parameters.

predict(features)

Wraps forward method, for compatibility with sklearn classes.

register_backward_hook(hook)

Registers a backward hook on the module.

register_buffer(name, tensor[, persistent])

Adds a buffer to the module.

register_forward_hook(hook)

Registers a forward hook on the module.

register_forward_pre_hook(hook)

Registers a forward pre-hook on the module.

register_full_backward_hook(hook)

Registers a backward hook on the module.

register_module(name, module)

Alias for add_module().

register_parameter(name, param)

Adds a parameter to the module.

requires_grad_([requires_grad])

Change if autograd should record operations on parameters in this module.

set_extra_state(state)

This function is called from load_state_dict() to handle any extra state found within the state_dict.

share_memory()

See torch.Tensor.share_memory_()

state_dict([destination, prefix, keep_vars])

Returns a dictionary containing a whole state of the module.

to(*args, **kwargs)

Moves and/or casts the parameters and buffers.

to_empty(*, device)

Moves the parameters and buffers to the specified device without copying storage.

train([mode])

Sets the module in training mode.

type(dst_type)

Casts all parameters and buffers to dst_type.

xpu([device])

Moves all model parameters and buffers to the XPU.

zero_grad([set_to_none])

Sets gradients of all model parameters to zero.

__call__

feature_importances

l1norm

l2norm

forward(features)[source]

Note: features are now shuffled

predict(features)[source]

Wraps forward method, for compatibility with sklearn classes.

Gradient-based methods for solving MRC problems. Currently only used for group-knockoffs.

class knockpy.kpytorch.mrcgrad.MVRLoss(Sigma, groups, init_S=None, invSigma=None, rec_prop=0, smoothing=0.01, min_smoothing=0.0001, method='mvr')[source]

A pytorch class to compute S-matrices for (gaussian) MX knockoffs which minimizes the trace of the feature-knockoff precision matrix (the inverse of the feature-knockoff covariance/Grahm matrix, G).

Parameters

Sigma – p x p numpy matrix. Must already

be sorted by groups. :param groups: p length numpy array of groups. These must already be sorted and correspond to the Sigma. :param init_S: The initialization values for the S block-diagonal matrix. - A p x p matrix. The block-diagonal of this matrix, as specified by groups, will be the initial values for the S matrix. - A list of square numpy matrices, with the ith element corresponding to the block of the ith group in S. Default: Half of the identity. :param rec_prop: The proportion of data you are planning to recycle. (The optimal S matrix depends on the recycling proportion.) :param rec_prop: The proportion of knockoffs that will be recycled. :param smoothing: Calculate the loss as sum 1/(eigs + smoothing) as opposed to sum 1/eigs. This is helpful if fitting lasso statistics on extremely degenerate covariance matrices. Over the course of optimization, this smoothing parameter will go to 0. :param method: One of mvr or maxent (mmi for backwards compatability).

Methods

add_module(name, module)

Adds a child module to the current module.

apply(fn)

Applies fn recursively to every submodule (as returned by .children()) as well as self.

bfloat16()

Casts all floating point parameters and buffers to bfloat16 datatype.

buffers([recurse])

Returns an iterator over module buffers.

children()

Returns an iterator over immediate children modules.

cpu()

Moves all model parameters and buffers to the CPU.

cuda([device])

Moves all model parameters and buffers to the GPU.

double()

Casts all floating point parameters and buffers to double datatype.

eval()

Sets the module in evaluation mode.

extra_repr()

Set the extra representation of the module

float()

Casts all floating point parameters and buffers to float datatype.

forward([smoothing])

Calculates trace of inverse grahm feature-knockoff matrix

get_buffer(target)

Returns the buffer given by target if it exists, otherwise throws an error.

get_extra_state()

Returns any extra state to include in the module's state_dict.

get_parameter(target)

Returns the parameter given by target if it exists, otherwise throws an error.

get_submodule(target)

Returns the submodule given by target if it exists, otherwise throws an error.

half()

Casts all floating point parameters and buffers to half datatype.

load_state_dict(state_dict[, strict])

Copies parameters and buffers from state_dict into this module and its descendants.

modules()

Returns an iterator over all modules in the network.

named_buffers([prefix, recurse])

Returns an iterator over module buffers, yielding both the name of the buffer as well as the buffer itself.

named_children()

Returns an iterator over immediate children modules, yielding both the name of the module as well as the module itself.

named_modules([memo, prefix, remove_duplicate])

Returns an iterator over all modules in the network, yielding both the name of the module as well as the module itself.

named_parameters([prefix, recurse])

Returns an iterator over module parameters, yielding both the name of the parameter as well as the parameter itself.

parameters([recurse])

Returns an iterator over module parameters.

project(**kwargs)

Project by scaling sqrt_S

pull_S()

Returns the S matrix

register_backward_hook(hook)

Registers a backward hook on the module.

register_buffer(name, tensor[, persistent])

Adds a buffer to the module.

register_forward_hook(hook)

Registers a forward hook on the module.

register_forward_pre_hook(hook)

Registers a forward pre-hook on the module.

register_full_backward_hook(hook)

Registers a backward hook on the module.

register_module(name, module)

Alias for add_module().

register_parameter(name, param)

Adds a parameter to the module.

requires_grad_([requires_grad])

Change if autograd should record operations on parameters in this module.

scale_sqrt_S(tol, num_iter)

Scales sqrt_S such that 2 Sigma - S is PSD.

set_extra_state(state)

This function is called from load_state_dict() to handle any extra state found within the state_dict.

share_memory()

See torch.Tensor.share_memory_()

state_dict([destination, prefix, keep_vars])

Returns a dictionary containing a whole state of the module.

to(*args, **kwargs)

Moves and/or casts the parameters and buffers.

to_empty(*, device)

Moves the parameters and buffers to the specified device without copying storage.

train([mode])

Sets the module in training mode.

type(dst_type)

Casts all parameters and buffers to dst_type.

update_sqrt_S()

Updates sqrt_S using the block parameters

xpu([device])

Moves all model parameters and buffers to the XPU.

zero_grad([set_to_none])

Sets gradients of all model parameters to zero.

__call__

forward(smoothing=None)[source]

Calculates trace of inverse grahm feature-knockoff matrix

project(**kwargs)[source]

Project by scaling sqrt_S

pull_S()[source]

Returns the S matrix

scale_sqrt_S(tol, num_iter)[source]

Scales sqrt_S such that 2 Sigma - S is PSD.

update_sqrt_S()[source]

Updates sqrt_S using the block parameters

class knockpy.kpytorch.mrcgrad.PSGDSolver(Sigma, groups, losscalc=None, lr=0.01, verbose=False, max_epochs=100, tol=1e-05, line_search_iter=10, convergence_tol=0.1, **kwargs)[source]

Projected gradient descent to solve for MRC knockoffs. This will work for non-convex loss objectives as well, although it’s a heuristic optimization method. :param Sigma: p x p numpy array, the correlation matrix :param groups: p-length numpy array specifying groups :param losscalc: A pytorch class wrapping nn.module which contains the following methods: - .forward() which calculates the loss based on the internally stored S matrix. - .project() which ensures that both the internally-stored S matrix as well as (2*Sigma - S) are PSD. - .pull_S(), which returns the internally-stored S matrix. If None, creates a MVRLoss class. :param lr: Initial learning rate (default 1e-2) :param verbose: if true, reports progress :param max_epochs: Maximum number of epochs in SGD :param tol: Mimimum eigenvalue allowed for PSD matrices :param line_search_iter: Number of line searches to do when scaling sqrt_S. :param convergence_tol: After each projection, we calculate improvement = 2/3 * ||prev_opt_S - opt_S||_1 + 1/3 * (improvement) When improvement < convergence_tol, we return. :param kwargs: Passed to MVRLoss

Methods

optimize()

See __init__ for arguments.

cache_S

optimize()[source]

See __init__ for arguments.

knockpy.kpytorch.mrcgrad.block_diag_sparse(*arrs)[source]

Given a list of 2D torch tensors, creates a sparse block-diagonal matrix See https://github.com/pytorch/pytorch/issues/31942

knockpy.kpytorch.mrcgrad.solve_mrc_psgd(Sigma, groups=None, method='mvr', **kwargs)[source]

Wraps the PSGDSolver class. :param Sigma: Covariance matrix :param groups: groups for group knockoffs :param method: MRC loss (mvr or maxent) :param init_kwargs: kwargs to pass to PSGDSolver. :param optimize_kwargs: kwargs to pass to optimizer method. :returns: opt_S

Gaussian graphical model discovery

Knockoffs for detecting edges in gaussian graphical models. See https://arxiv.org/pdf/1908.11611.pdf.

class knockpy.ggm.KnockoffGGM(fstat='lcd', fstat_kwargs=None, knockoff_kwargs=None)[source]

Tests for edges in a Gaussian Graphical Model. See Li and Maathuis (2019) for details.

Parameters
fstatstr

The feature statistic to use in the knockoff filter: this must be a string and must be a valid fixed-X knockoff feature statistic. Identifiers include: - ‘lasso’ or ‘lcd’: lasso coefficients differences - ‘lsm’: signed maximum of the lasso path statistic as

in Barber and Candes 2015

  • ‘ols’: Ordinary least squares coefficients

  • ‘margcorr’: marginal correlations between features and response

fstat_kwargsdict

Kwargs to pass to the feature statistic fit function, excluding the required arguments, defaults to {}

knockoff_kwargsdict

Kwargs for instantiating the knockoff sampler argument if the ksampler argument is a string identifier. Defaults to {}

Notes

There is no known way to use model-X knockoffs for this application.

Examples

Here we fit KnockoffGGM under the global null when the true Gaussian graphical model has no edges:

# Fake data-generating process for Gaussian graphical model import numpy as np X = np.random.randn(300, 30)

# LCD statistic with FX knockoffs from knockpy.ggm import KnockoffGGM gkf = KnockoffGGM(

fstat=’lcd’, knockoff_kwargs={“method”:”mvr”},

) edges = gkf.forward(X=X, verbose=True)

Attributes
Wnp.array

(p,p)-shaped array of knockoff statistics with zeros along the diagonal. The rows of W obey the flip-sign property, i.e., W[0] obeys the flip-sign property.

kfslist

A list of KnockoffFilter classes corresponding to the regression run on each covariate.

Methods

forward(X[, logic, fdr, ggm_kwargs, verbose])

Runs the GGM filter by applying fixed-X knockoffs

forward(X, logic='and', fdr=0.1, ggm_kwargs=None, verbose=True)[source]

Runs the GGM filter by applying fixed-X knockoffs to each column of X using the other columns as covariates.

Parameters
Xnp.ndarray

(n, p)-shaped design matrix.

fdrfloat

Nominal level at which to control the FDR. logic : string

One of ‘and’ or ‘or’. This is a hyperparameter used to determine the rejection set. Defaults to “and”.

ggm_kwargsdict

Dictionary of hyperparameters to pass to the ggm.compute_ggm_threshold function. Defaults to {}.

verbosebool

If true, log progress over time.

Returns
edgesnp.array

(p,p)-shaped symmetric boolean array where edges[i,j] is true if and only if edge (i,j) has been discovered.

Notes

This requires fitting knockoffs p times, so it is quite expensive.

knockpy.ggm.compute_ggm_threshold(W, fdr=0.1, logic='and', a=1, offset=1)[source]
Parameters
Wnp.array

(p,p)-shaped array of knockoff statistics with zeros along the diagonal. The rows of W must obey the flip-sign property.

fdrfloat

Desired level of FDR control.

logicstring

One of ‘and’ or ‘or’. This is a hyperparameter used to determine the rejection set. Defaults to “and”.

afloat

One of 0.01 or 1. Hyperparameter used to determine the rejection threshold. See Li and Maathuis (2019) for discussion.

offsetint

If offset = 0, control the modified FDR. If offset = 1 (default), controls the FDR exactly.

Returns
Tnp.array

(p,)-shaped array of knockoff thresholds

Notes

See Algorithm 2, https://arxiv.org/pdf/1908.11611.pdf

knockpy.ggm.discovered_edges(W, T, logic='and')[source]
Parameters
Wnp.array

(p,p)-shaped array of knockoff statistics with zeros along the diagonal. The rows of W must obey the flip-sign property.

Tnp.array

(p,)-shaped array of knockoff thresholds

logicstring

One of ‘and’ or ‘or’. This is a hyperparameter used to determine the rejection set. Defaults to “and”.

Returns
edgesnp.array

(p,p)-shaped symmetric boolean array where edges[i,j] is true if and only if edge (i,j) has been discovered.

Quickly creating data-generating processes

A collection of functions for generating synthetic datasets.

knockpy.dgp.AR1(p=30, a=1, b=1, tol=None, max_corr=1, rho=None)[source]

Generates p-dimensional correlation matrix for AR(1) Gaussian process, where successive correlations are drawn from Beta(a,`b`) independelty. If rho is specified, then the process is stationary with correlation rho.

class knockpy.dgp.DGP(mu=None, Sigma=None, invSigma=None, beta=None, gibbs_graph=None)[source]

A utility class which creates a (random) data-generating process for a design matrix X and a response y. If the parameters are not specified, they will be randomly generated during the sample_data method.

Parameters
munp.ndarray

(p,)-shaped mean vector for X data

Sigmanp.ndarray

(p, p)-shaped covariance matrix for X data

invSigmanp.ndarray

(p, p)-shaped precision matrix for X data

betanp.ndarray

coefficients used to generate y from X in a single index or sparse additive model.

gibbs_graphnp.ndarray

(p, p)-shaped matrix of coefficients for gibbs grid method.

Attributes
munp.ndarray

See above

Sigmanp.ndarray

See above

invSigmanp.ndarray

See above

betanp.ndarray

See above

gibbs_Graphnp.ndarray

See above

Xnp.ndarray

(n, p)-shaped design matrix

ynp.ndarray

(n, )-shaped response vector

Methods

sample_data([p, n, x_dist, method, y_dist, ...])

(Possibly) generates random data-generating parameters and then samples data using those parameters.

sample_data(p=100, n=50, x_dist='gaussian', method='AR1', y_dist='gaussian', cond_mean='linear', coeff_size=1, coeff_dist=None, sparsity=0.5, groups=None, sign_prob=0.5, iid_signs=True, corr_signals=False, df_t=3, **kwargs)[source]

(Possibly) generates random data-generating parameters and then samples data using those parameters. By default, (X, y) are jointly Gaussian and y is a linear response to X.

Parameters
nint

The number of data points

pint

The dimensionality of the data

x_diststr

Specifies the distribution of X. One of “Gaussian”, “blockt”, “ar1t”, or “gibbs”.

methodstr

How to generate the covariance matrix of X. One of “AR1”, “NestedAR1”, “partialcorr”, “factor”, “blockequi”, “ver”, “qer”, “dirichlet”, or “uniformdot”. See the docs for each of these methods.

y_diststr

Specifies the distribution of y. One of “Gaussian” or “binomial”.

cond_meanstr

How to calculate the conditional mean of y given X. Defaults to “linear”. See knockpy.dgp.sample_response for more options.

coeff_size: float

Size of non-zero coefficients

coeff_diststr

Specifies the distribution of nonzero coefficients. Three options: - None: all coefficients have absolute value coeff_size. - normal: all nonzero coefficients are drawn from Normal(coeff_size, 1). - uniform: nonzero coeffs are drawn from Unif(coeff_size/2, coeff_size).

sparsityfloat

Proportion of non-null coefficients. Generates np.floor(p*sparsity) non-nulls.

groupsnp.ndarray

A p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None. Else, floor(sparsity * num_groups) groups will be chosen to be non-null, where all elements of each group are non-null.

sign_probfloat

The probability that each nonzero coefficient will be positive.

iid_signsbool

If True, the signs of the coeffs are assigned independently. Else, exactly sign_prob*sparsity*p coefficients will be positive.

corr_signalsbool

If true, all of the nonzero coefficients will lie in a consecutive block.

df_tfloat

If the X variables are marginally t-distributed, the degrees of freedom.

kwargs: dict

keyword arguments to pass to method for generating the covariance matrix.

knockpy.dgp.DirichletCorr(p=100, temp=1, tol=1e-06)[source]

Generates a correlation matrix by sampling p eigenvalues from a dirichlet distribution whose p parameters are i.i.d. uniform [tol, temp] and generating a random covariance matrix with those eigenvalues.

knockpy.dgp.ErdosRenyi(p=300, delta=0.2, lower=0.1, upper=1, tol=0.1)[source]

Randomly samples bernoulli flags as well as values for partial correlations to generate sparse square matrices. Follows https://arxiv.org/pdf/1908.11611.pdf.

knockpy.dgp.FactorModel(p=500, rank=2)[source]

Generates random correlation matrix from a factor model with dimension p and rank rank.

knockpy.dgp.NestedAR1(p=500, a=7, b=1, tol=0.001, num_nests=5, nest_size=2)[source]

Generates correlation matrix for AR(1) Gaussian process with hierarchical correlation structure.

knockpy.dgp.PartialCorr(p=300, rho=0.3)[source]

Creates a correlation matrix of dimension p with partial correlation rho.

knockpy.dgp.UniformDot(d=100, p=100, tol=0.01)[source]

Let U be a random d x p matrix with i.i.d. uniform entries. Then Sigma = ``cov2corr``(U^T U)

knockpy.dgp.Wishart(d=100, p=100, tol=0.01)[source]

Let W be a random d x p matrix with i.i.d. Gaussian entries. Then Sigma = ``cov2corr``(W^T W).

knockpy.dgp.block_equi_graph(n=3000, p=1000, block_size=5, sparsity=0.1, rho=0.5, gamma=0, coeff_size=3.5, coeff_dist=None, sign_prob=0.5, iid_signs=True, corr_signals=False, beta=None, mu=None, **kwargs)[source]

Samples data according to a block-equicorrelated Gaussian design, where rho is the within-block correlation and gamma * rho is the between-block correlation.

Parameters
nint

The number of data points

pint

The dimensionality of the data

block_sizeint

The size of blocks. Defaults to 5.

sparsityfloat

The proportion of groups which are null. Defaults to 0.1

rhofloat

The within-group correlation

gammafloat

The between-group correlation is gamma * rho

betanp.ndarray

If supplied, the (p,)-shaped set of coefficients. Else, will be generated by calling knockpy.dgp.create_sparse_coefficients.

munp.ndarray

The p-dimensional mean of the covariates. Defaults to 0.

kwargsdict

Args passed to knockpy.dgp.sample_response.

Notes

This defaults to the same data-generating process as Dai and Barber 2016 (see https://arxiv.org/abs/1602.03589).

knockpy.dgp.construct_gibbs_grid(n, p, temp=1)[source]

Creates gridlike gibbs_graph parameter for sample_gibbs. See sample_gibbs.

knockpy.dgp.coords2num(l, w, gridwidth=10)[source]

Takes coordinates of variable in a Gibbs grid, returns position

knockpy.dgp.cov2blocks(V, tol=1e-05)[source]

Decomposes a PREORDERED block-diagonal matrix V into its blocks.

knockpy.dgp.create_correlation_tree(corr_matrix, method='single')[source]

Creates hierarchical clustering (correlation tree) from a correlation matrix

Parameters
corr_matrixnp.ndarray

(p, p)-shaped correlation matrix

methodstr

the method of hierarchical clustering: ‘single’, ‘average’, ‘fro’, or ‘complete’. Defaults to ‘single’.

Returns
linknp.ndarray

The link of the correlation tree, as in scipy

knockpy.dgp.create_grouping(Sigma, cutoff=0.95, **kwargs)[source]

Creates a knockoff grouping from a covariance matrix using hierarchical clustering.

Parameters
Sigmanp.ndarray

(p, p)-shaped covariance matrix

cutoffnp.ndarray

The correlation at which to “cut” the correlation tree. For method=’single’, cutoff=0.95 ensures that no two variables in different groups have absolute correlation greater than 0.95.

Returns
groupsnp.ndarray

(p,)-shaped array of groups for each variable

knockpy.dgp.create_sparse_coefficients(p, sparsity=0.5, groups=None, coeff_size=1, coeff_dist=None, sign_prob=0.5, iid_signs=True, corr_signals=False, n=None)[source]

Generate a set of sparse coefficients for single index or sparse additive models. p : int

Dimensionality of coefficients

sparsityfloat

Proportion of non-null coefficients. Generates np.floor(p*sparsity) non-nulls.

coeff_size: float

Size of non-zero coefficients

groupsnp.ndarray

A p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None. Else, floor(sparsity * num_groups) groups will be chosen to be non-null, where all elements of each group are non-null.

sign_probfloat

The probability that each nonzero coefficient will be positive.

iid_signsbool

If True, the signs of the coeffs are assigned independently. Else, exactly sign_prob*sparsity*p coefficients will be positive.

coeff_diststr

Specifies the distribution of nonzero coefficients. Three options: - None: all coefficients have absolute value coeff_size. - normal: all nonzero coefficients are drawn from Normal(coeff_size, 1). - uniform: nonzero coeffs are drawn from Unif(coeff_size/2, coeff_size).

corr_signalsbool

If true, all of the nonzero coefficients will lie in a consecutive block.

Returns
betanp.ndarray

(p,)-shaped array of sparse coefficients

knockpy.dgp.graph2cliques(Q)[source]

Turns graph Q of connections into binary cliques for Gibbs grid

knockpy.dgp.num2coords(i, gridwidth=10)[source]

Coordinates of variable i in a Gibbs grid

Parameters
iint

Position of variable in ordering

gridwidthint

Width of the grid

Returns
length_coord, width_coord (coordinates)
knockpy.dgp.sample_ar1t(rhos, n=50, df_t=3)[source]

Samples t-distributed variables according to a Markov chain.

Parameters
rhosnp.ndarray

(p-1, )-length array of correlations between consecutive variables.

nint

The number of data points to sample

df_tfloat

The degrees of freedom of the t variables

Returns
Xnp.ndarray

(n, p)-shaped design matrix

knockpy.dgp.sample_block_tmvn(blocks, n=50, df_t=3)[source]

Samples a blocks of multivariate-t distributed variables according to a list of covariance matrices called blocks.

Parameters
blockslist

A list of square, symmetric numpy arrays. These are the covariance matrices for each block of variables.

nint

The number of data points to sample

df_tfloat

The degrees of freedom of the t-distribution

Notes

The variables are scaled such that the marginal variance of each variable equals a diagonal element in one of the blocks.

knockpy.dgp.sample_gibbs(n, p, gibbs_graph=None, temp=1, num_iter=15, K=20, max_val=2.5)[source]

Samples from a discrete Gibbs measure using a gibbs sampler.

The joint likelihood for the p-dimensional data X1 through Xp is the product of all terms of the following form:

`np.exp(-1*gibbs_graph[i,j]*np.abs(Xi - Xj))`

where gibbs_graph is assumed to be symmetric (and is usually sparse). Each feature takes values on an evenly spaced grid from -1*max_val to max_val with K values.

Parameters
nint

How many data points to sample

pint

Dimensionality of the data

gibbs_graphnp.ndarray

A symmetric (p, p)-shaped array. See likelihood equation. By default, this is corresponds to an undirected graphical model with a square grid (like an Ising model) with nonzero entries set to 1 or -1 with equal probability.

tempfloat

Governs the strength of interactions between features—see the likelihood equation.

num_iterint

Number of iterations in the Gibbs sampler; defaults to 15.

Kint

Number of discrete values each sampled feature can take.

max_valfloat

The maximum absolute value each feature can take.

Returns
Xnp.ndarray

(n, p)-shaped array of data

gibbs_graphnp.ndarray

The generated gibbs_graph.

knockpy.dgp.sample_response(X, beta, cond_mean='linear', y_dist='gaussian')[source]

Given a design matrix X and coefficients beta, samples a response y.

Xnp.ndarray

(n, p)-shaped design matrix

betanp.ndarray

(p, )-shaped coefficient vector

cond_meanstr

How to calculate the conditional mean of y given X, denoted mu(X). Six options:

  1. “linear” denotes np.dot(X, beta)

  2. “cubic” denotes np.dot(X**3, beta) - np.dot(X, beta)

3. “trunclinear” ((X * beta >= 1).sum(axis = 1)) Stands for truncated linear. 4. “pairint”: pairs up non-null coefficients according to the order of beta, multiplies them and their beta values, then sums. “pairint” stands for pairwise-interactions. 5. “cos”: mu(X) = sign(beta) * (beta != 0) * np.cos(X) 6. “quadratic”: mu(X) = np.dot(np.power(X, 2), beta)

y_diststr

If “gaussian”, y is the conditional mean plus gaussian noise. If “binomial”, Pr(y=1) = softmax(cond_mean).

Returns
ynp.ndarray

(n,)-shaped response vector

Utility functions

knockpy.utilities.apply_pool(func, constant_inputs={}, num_processes=1, **kwargs)[source]

Spawns num_processes processes to apply func to many different arguments. This wraps the multiprocessing.pool object plus the functools partial function.

Parameters
funcfunction

An arbitrary function

constant_inputsdictionary

A dictionary of arguments to func which do not change in each of the processes spawned, defaults to {}.

num_processesint

The maximum number of processes spawned, defaults to 1.

kwargsdict

Each key should correspond to an argument to func and should map to a list of different arguments.

Returns
outputslist

List of outputs for each input, in the order of the inputs.

Examples

If we are varying inputs ‘a’ and ‘b’, we might have ``apply_pool(

func=my_func, a=[1,3,5], b=[2,4,6]

)`` which would return [my_func(a=1, b=2), my_func(a=3,b=4), my_func(a=5,b=6)].

knockpy.utilities.blockdiag_to_blocks(M, groups)[source]

Given a square array M, returns a list of diagonal blocks of M as specified by groups.

Parameters
Mnp.ndarray

(p, p)-shaped array

groupsnp.ndarray

(p, )-shaped array with m unique values. If groups[i] == j, this indicates that coordinate i belongs to group j.

Returns
blockslist

A list of square np.ndarrays. blocks[i] corresponds to group identified by the ith smallest unique value of groups.

knockpy.utilities.calc_group_sizes(groups)[source]

Given a list of groups, finds the sizes of the groups.

Parameters
groupsnp.ndarray

(p, )-shaped array which takes m integer values from 1 to m. If groups[i] == j, this indicates that coordinate i belongs to group j.

:param groups: p-length array of integers between 1 and m,
Returns
sizesnp.ndarray

(m, )-length array of group sizes.

knockpy.utilities.calc_mineig(M)[source]

Calculates the minimum eigenvalue of a square symmetric matrix M

knockpy.utilities.chol2inv(X)[source]

Uses cholesky decomp to get inverse of matrix

knockpy.utilities.cov2corr(M, invM=None)[source]

Rescales a p x p cov. matrix M to be a correlation matrix. If invM is not None, also rescales invM to be the inverse of M.

knockpy.utilities.estimate_covariance(X, tol=0.0001, shrinkage='ledoitwolf', **kwargs)[source]

Estimates covariance matrix of X.

Parameters
Xnp.ndarray

(n, p)-shaped design matrix

shrinkagestr

The type of shrinkage to apply during estimation. One of “ledoitwolf”, “graphicallasso”, or None (no shrinkage).

tolfloat

If shrinkage is None but the minimum eigenvalue of the MLE is below tol, apply LedoitWolf shrinkage anyway.

kwargsdict

kwargs to pass to the shrinkage estimator.

Returns
Sigmanp.ndarray

(p, p)-shaped estimated covariance matrix of X

invSigmanp.ndarray

(p, p)-shaped estimated precision matrix of X

knockpy.utilities.estimate_factor(Sigma, num_factors=20, num_iter=10)[source]

Approximates Sigma = np.diag(D) + np.dot(U, U.T). See https://arxiv.org/pdf/2006.08790.pdf.

Parameters
Sigmanp.ndarray

(p, p)-shaped covariance matrix of X

num_factorsint

Dimensionality of U.

Returns
Dnp.ndarray

(p,)-shaped array of diagonal elements.

Unp.ndarray

(p, num_factors)-shaped array.

Notes

TODO: allow X as an input when Sigma does not fit in memory.

knockpy.utilities.fetch_group_nonnulls(non_nulls, groups)[source]

Combines feature-level null hypotheses into group-level hypothesis.

knockpy.utilities.permute_matrix_by_groups(groups)[source]

Create indices which permute a (covariance) matrix according to a list of groups.

knockpy.utilities.preprocess_groups(groups)[source]

Maps the m unique elements of a 1D “groups” array to the integers from 1 to m.

knockpy.utilities.random_permutation_inds(length)[source]

Returns indexes which will randomly permute/unpermute a numpy array of length length. Also returns indices which will undo this permutation.

Returns
indsnp.ndarray

(length,)-shaped ndarray corresponding to a random permutation from 0 to length-1.

rev_indsnp.ndarray

(length,)-shaped ndarray such that for any (length,)-shaped array called x, x[inds][rev_inds] equals x.

knockpy.utilities.scale_until_PSD(Sigma, S, tol, num_iter)[source]

Perform a binary search to find the largest gamma such that the minimum eigenvalue of 2*Sigma - gamma*S is at least tol.

Returns
gamma * Snp.ndarray

See description.

gammafloat

See description

knockpy.utilities.shift_until_PSD(M, tol)[source]

Add the identity until a p x p matrix M has eigenvalues of at least tol