API Reference

The KnockoffFilter

class knockpy.knockoff_filter.KnockoffFilter(fstat='lasso', ksampler='gaussian', fstat_kwargs={}, knockoff_kwargs={})[source]

Performs knockoff-based inference, from start to finish.

This wraps both the knockoffs.KnockoffSampler and knockoff_stats.FeatureStatistic classes.

Parameters:

fstatstr or knockpy.knockoff_stats.FeatureStatistic

The feature statistic to use in the knockoff filter. Defaults to a lasso statistic. This may also be a string identifier, including:

‘lasso’ or ‘lcd’: cross-validated lasso coefficients differences
‘mlr’: the masked likelihood ratio (MLR) statistic, which has provable optimality properties for knockoffs
‘lsm’: signed maximum of the lasso path statistic as in Barber and Candes 2015
‘dlasso’: Cross-validated debiased lasso coefficients
‘ridge’: Cross validated ridge coefficients
‘ols’: Ordinary least squares coefficients
‘margcorr’: marginal correlations between features and response
‘deeppink’: The deepPINK statistic as in Lu et al. 2018
‘mlrspline’ or ‘spline’: The MLR statistic applied to regression spline basis functions.
‘randomforest’: A random forest with swap importances

Note that when using FX knockoffs, the lcd statistic does not use cross-valdilidation and sets alphas = sigma2 * sqrt(log(p) / n).

ksamplerstr or knockpy.knockoffs.KnockoffSampler

The knockoff sampler to use in the knockoff filter. This may also be a string identifier, including:

‘gaussian’: Gaussian Model-X knockoffs
‘fx’: Fixed-X knockoffs.
‘metro’: Generic metropolized knockoff sampler
‘artk’: t-tailed Markov chain.
‘blockt’: Blocks of t-distributed
‘gibbs_grid’: Discrete gibbs grid

An alternative to specifying the ksampler is to simply pass in a knockoff matrix during the forward call.

fstat_kwargsdict

Kwargs to pass to the feature statistic fit function, excluding the required arguments, defaults to {}

knockoff_kwargsdict

Kwargs for instantiating the knockoff sampler argument if the ksampler argument is a string identifier. This can be the empt dict for some identifiers such as “gaussian” or “fx”, but additional keyword arguments are required for complex samplers such as the “metro” identifier. Defaults to {}

Examples

Here we fit the KnockoffFilter on fake data from a Gaussian linear model:

# Fake data-generating process for Gaussian linear model
import knockpy as kpy
dgprocess = kpy.dgp.DGP()
dgprocess.sample_data(n=500, p=500, sparsity=0.1)

# LCD statistic with Gaussian MX knockoffs
# This uses LedoitWolf covariance estimation by default.
from knockpy.knockoff_filter import KnockoffFilter
kfilter = KnockoffFilter(
    fstat='lcd',
    ksampler='gaussian',
    knockoff_kwargs={"method":"mvr"},
)
rejections = kfilter.forward(X=dgprocess.X, y=dgprocess.y)

Attributes:

fstatknockpy.knockoff_stats.FeatureStatistic: The feature statistics to use for inference. This inherits from knockoff_stats.FeatureStatistic.
ksamplerknockpy.knockoffs.KnockoffSampler: The knockoff sampler to use during inference. This eventually inherits from knockoffs.KnockoffSampler.
fstat_kwargsdict: Dictionary of kwargs to pass to the fit call of self.fstat.
knockoff_kwargsdict: If ksampler is not yet initialized, kwargs to pass to ksampler.
Znp.ndarray: a 2p-dimsional array of feature and knockoff importances. The first p coordinates correspond to features, the last p correspond to knockoffs.
Wnp.ndarray: an array of feature statistics. This is (p,)-dimensional for regular knockoffs and (num_groups,)-dimensional for group knockoffs.
Snp.ndarray: the (p, p)-shaped knockoff S-matrix used to generate knockoffs.
Xnp.ndarray: the (n, p)-shaped design matrix
Xknp.ndarray: the (n, p)-shaped matrix of knockoffs
groupsnp.ndarray: For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).
rejectionsnp.ndarray: a (p,)-shaped boolean array where rejections[j] == 1 iff the the knockoff filter rejects the jth feature.
Gnp.ndarray: the (2p, 2p)-shaped feature-knockoff covariance matrix
thresholdfloat: the knockoff data-dependent threshold used to select variables

Methods

`forward`(X, y[, Xk, mu, Sigma, groups, fdr, ...])	Runs the knockoff filter; returns whether each feature was rejected.
`make_selections`(W, fdr)	" Calculate data dependent threshhold and selections
`sample_knockoffs`()	Samples knockoffs during `forward`.
`seqstep_plot`([max_rank])	Visualizes the knockoff filter.

forward(X, y, Xk=None, mu=None, Sigma=None, groups=None, fdr=0.1, fstat_kwargs={}, knockoff_kwargs={}, shrinkage='ledoitwolf', num_factors=None, recycle_up_to=None)[source]

Runs the knockoff filter; returns whether each feature was rejected.

Parameters:

Xnp.ndarray

(n, p)-shaped design matrix

ynp.ndarray

(n,)-shaped response vector

Xknp.ndarray

(n, p)-shaped knockoff matrix. If None, this will construct knockoffs using self.ksampler.

munp.ndarray

(p, )-shaped mean of the features. If None, this defaults to the empirical mean of the features.

Sigmanp.ndarray

(p, p)-shaped covariance matrix of the features. If None, this is estimated using the shrinkage option. This is ignored for fixed-X knockoffs.

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

fdrfloat

The desired level of false discovery rate control.

fstat_kwargsdict

Extra kwargs to pass to the feature statistic fit function, excluding the required arguments.

knockoff_kwargsdict

Extra kwargs for instantiating the knockoff sampler argument if the ksampler argument is a string identifier. This can be the empty dict for some identifiers such as “gaussian” or “fx”, but additional keyword arguments are required for complex samplers such as the “metro” identifier. Defaults to {}

shrinkagestr

Shrinkage method if estimating the covariance matrix. Defaults to “LedoitWolf.” Other options are “MLE” and “glasso” (graphical lasso).

num_factorsint

If num_factors is not None and Sigma is estimated, assumes that the ground-truth Sigma is a factor model with rank num_factors to speed up computation.

recycle_up_toint or float

Three options:

if None, does nothing.

if an integer > 1, uses the first “recycle_up_to” rows of X as the the first recycle_up_to rows of knockoffs

if a float between 0 and 1 (inclusive), interpreted as the proportion of rows to recycle.

For more on recycling, see https://arxiv.org/abs/1602.03574

make_selections(W, fdr)[source]: “ Calculate data dependent threshhold and selections

sample_knockoffs()[source]: Samples knockoffs during forward.

seqstep_plot(max_rank=None)[source]

Visualizes the knockoff filter.

Parameters:

max_rankint: Only plot the top max_rank feature statistics. Defaults to plotting all feature statistics.

Feature Statistics

class knockpy.knockoff_stats.DeepPinkStatistic(model=None)[source]

Methods

`cv_score_model`(features, y, cv_score[, ...])	Similar to score_model, but uses cross-validated scoring if cv_score=True.
`fit`(X, Xk, y[, groups, feature_importance, ...])	Wraps the FeatureStatistic class using DeepPINK to generate variable importances.
`score_model`(features, y[, y_dist])	Computes mean-squared error of self.model on (features, y) when y is nonbinary, and computes 1 - accuracy otherwise.
`swap_feature_importances`(features, y)	Given a model of the features and y, calculates feature importances as follows.
`swap_path_feature_importances`(features, y[, ...])	Similar to `swap_feature_importances`; see http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf

fit(X, Xk, y, groups=None, feature_importance='deeppink', antisym='cd', group_agg='sum', cv_score=False, train_kwargs={'verbose': False}, **kwargs)[source]

Wraps the FeatureStatistic class using DeepPINK to generate variable importances.

Parameters:

Xnp.ndarray

the (n, p)-shaped design matrix

Xknp.ndarray

the (n, p)-shaped matrix of knockoffs

ynp.ndarray

(n,)-shaped response vector

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

feature_importancestr

Specifies how to create feature importances from model. Four options:

“deeppink”: Use the deeppink feature importance

defined in https://arxiv.org/abs/1809.01185 - “unweighted”: Use the Z weights from the deeppink paper without weighting them using the layers from the MLP. Deeppink usually outperforms this feature importance (but not always). - “swap”: The default swap-statistic from http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf - “swapint”: The swap-integral defined from http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf

Defaults to deeppink, which is often both the most powerful and the most computationally efficient.

antisymstr

The antisymmetric function used to create (ungrouped) feature statistics. Three options: - “CD” (Difference of absolute vals of coefficients), - “SM” (signed maximum). - “SCD” (Simple difference of coefficients - NOT recommended)

group_aggstr

For group knockoffs, specifies how to turn individual feature statistics into grouped feature statistics. Two options: “sum” and “avg”.

cv_scorebool

If true, score the feature statistic’s predictive accuracy using cross validation. This is extremely expensive for deeppink.

kwargsdict

Extra kwargs to pass to underlying deeppink class (in kpytorch)

Returns:

Wnp.ndarray: an array of feature statistics. This is (p,)-dimensional for regular knockoffs and (num_groups,)-dimensional for group knockoffs.

class knockpy.knockoff_stats.FeatureStatistic(model=None)[source]

The base knockoff feature statistic class, which can wrap any generic prediction algorithm.

Parameters:

model: An instance of a class with a “train” or “fit” method and a “predict” method.

Attributes:

model: A (predictive) model class underlying the variable importance measures.
indsnp.ndarray: (2p,)-dimensional array of indices representing the random permutation applied to the concatenation of [X, Xk] before fitting gl.
rev_indsnp.ndarray:: Indices which reverse the effect of inds. In particular, if M is any (n, 2p)-dimensional array, then `M==M[:, inds][:, rev_inds]`
scorefloat: A metric of the model’s performance, as defined by score_type.
score_typestr: One of MSE, CVMSE, accuracy, or cvaccuracy. (cv stands for cross-validated)
Znp.ndarray: a 2p-dimsional array of feature and knockoff importances. The first p coordinates correspond to features, the last p correspond to knockoffs.
Wnp.ndarray: an array of feature statistics. This is (p,)-dimensional for regular knockoffs and (num_groups,)-dimensional for group knockoffs.
groupsnp.ndarray: For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

Methods

`cv_score_model`(features, y, cv_score[, ...])	Similar to score_model, but uses cross-validated scoring if cv_score=True.
`fit`(X, Xk, y[, groups, feature_importance, ...])	Trains the model and creates feature importances.
`score_model`(features, y[, y_dist])	Computes mean-squared error of self.model on (features, y) when y is nonbinary, and computes 1 - accuracy otherwise.
`swap_feature_importances`(features, y)	Given a model of the features and y, calculates feature importances as follows.
`swap_path_feature_importances`(features, y[, ...])	Similar to `swap_feature_importances`; see http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf

cv_score_model(features, y, cv_score, logistic_flag=False)[source]: Similar to score_model, but uses cross-validated scoring if cv_score=True.

fit(X, Xk, y, groups=None, feature_importance='swap', antisym='cd', group_agg='avg', **kwargs)[source]

Trains the model and creates feature importances.

Parameters:

Xnp.ndarray

the (n, p)-shaped design matrix

Xknp.ndarray

the (n, p)-shaped matrix of knockoffs

ynp.ndarray

(n,)-shaped response vector

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

feature_importancestr

Specifies how to create feature importances from model. Two options:

“swap”: The default swap-statistic from http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf.

“swapint”: The swap-integral from http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf.

Defaults to ‘swap’. This is ignored for most specialized classes inheriting from FeatureStatistic (e.g. MLR/Lasso Statistics).

antisymstr

The antisymmetric function used to create (ungrouped) feature statistics. Three options:

“CD” (Difference of absolute vals of coefficients),
“SM” (signed maximum).
“SCD” (Simple difference of coefficients - NOT recommended)

group_aggstr

For group knockoffs, specifies how to turn individual feature statistics into grouped feature statistics. Two options: “sum” and “avg”.

**kwargs**dict

kwargs to pass to the ‘train’ or ‘fit’ method of the model.

Returns:

Wnp.ndarray: an array of feature statistics. This is (p,)-dimensional for regular knockoffs and (num_groups,)-dimensional for group knockoffs.

score_model(features, y, y_dist=None)[source]

Computes mean-squared error of self.model on (features, y) when y is nonbinary, and computes 1 - accuracy otherwise.

Returns:

lossfloat: Either the MSE or one minus the accuracy of the model, depending on whether y is continuous or binary.

swap_feature_importances(features, y)[source]

Given a model of the features and y, calculates feature importances as follows.

For feature i, replace the feature with its knockoff and calculate the relative increase in the loss. Similarly, for knockoff i, replace the knockoffs with its feature and calculate the relative increase in the loss.

Parameters:

featuresnp.ndarray: (n, 2p)-shaped array of concatenated features and knockoffs, which must be permuted by self.inds.
ynp.ndarray: (n,)-shaped response vector

Returns:

Z_swapnp.ndarray: (2p,)-shaped array of variable importances such that Z_swap is in the same permuted as features initially were.

swap_path_feature_importances(features, y, step_size=0.5, max_lambda=5)[source]: Similar to swap_feature_importances; see http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf

class knockpy.knockoff_stats.LassoStatistic[source]

Lasso Statistic wrapper class

Methods

`cv_score_model`(features, y, cv_score[, ...])	Similar to score_model, but uses cross-validated scoring if cv_score=True.
`fit`(X, Xk, y[, groups, zstat, antisym, ...])	Wraps the FeatureStatistic class but uses cross-validated Lasso coefficients or Lasso path statistics as variable importances.
`score_model`(features, y[, y_dist])	Computes mean-squared error of self.model on (features, y) when y is nonbinary, and computes 1 - accuracy otherwise.
`swap_feature_importances`(features, y)	Given a model of the features and y, calculates feature importances as follows.
`swap_path_feature_importances`(features, y[, ...])	Similar to `swap_feature_importances`; see http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf

fit(X, Xk, y, groups=None, zstat='coef', antisym='cd', group_agg='avg', cv_score=False, debias=False, Ginv=None, **kwargs)[source]

Wraps the FeatureStatistic class but uses cross-validated Lasso coefficients or Lasso path statistics as variable importances.

Parameters:

Xnp.ndarray: the (n, p)-shaped design matrix
Xknp.ndarray: the (n, p)-shaped matrix of knockoffs
ynp.ndarray: (n,)-shaped response vector
groupsnp.ndarray: For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).
zstatstr:: Two options for the variable importance measure: - If ‘coef’, uses to cross-validated (group) lasso coefficients. - If ‘lars_path’, uses the lambda value where each feature/knockoff enters the lasso path (meaning becomes nonzero). This defaults to coef.
antisymstr: The antisymmetric function used to create (ungrouped) feature statistics. Three options: - “CD” (Difference of absolute vals of coefficients), - “SM” (signed maximum). - “SCD” (Simple difference of coefficients - NOT recommended)
group_aggstr: For group knockoffs, specifies how to turn individual feature statistics into grouped feature statistics. Two options: “sum” and “avg”.
cv_scorebool: If true, score the feature statistic’s predictive accuracy using cross validation.
debiasbool:: If true, debias the lasso. See https://arxiv.org/abs/1508.02757
Ginvnp.ndarray: (2p, 2p)-shaped precision matrix for the feature-knockoff covariate distribution. This must be specified if debias=True.
kwargsdict: Extra kwargs to pass to underlying Lasso classes.

Returns:

Wnp.ndarray: an array of feature statistics. This is (p,)-dimensional for regular knockoffs and (num_groups,)-dimensional for group knockoffs.

Notes

When not using the group lasso, one can specify choice(s) of regularization parameter using the alphas keyword argument. Otherwise, use the reg_vals keyword argument.

class knockpy.knockoff_stats.MargCorrStatistic[source]

Lasso Statistic wrapper class

Methods

`cv_score_model`(features, y, cv_score[, ...])	Similar to score_model, but uses cross-validated scoring if cv_score=True.
`fit`(X, Xk, y[, groups])	Wraps the FeatureStatistic class using marginal correlations between X, Xk and y as variable importances.
`score_model`(features, y[, y_dist])	Computes mean-squared error of self.model on (features, y) when y is nonbinary, and computes 1 - accuracy otherwise.
`swap_feature_importances`(features, y)	Given a model of the features and y, calculates feature importances as follows.
`swap_path_feature_importances`(features, y[, ...])	Similar to `swap_feature_importances`; see http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf

fit(X, Xk, y, groups=None, **kwargs)[source]

Wraps the FeatureStatistic class using marginal correlations between X, Xk and y as variable importances.

Parameters:

Xnp.ndarray: the (n, p)-shaped design matrix
Xknp.ndarray: the (n, p)-shaped matrix of knockoffs
ynp.ndarray: (n,)-shaped response vector
groupsnp.ndarray: For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).
kwargsdict: Extra kwargs to pass to underlying combine_Z_stats

Returns:

Wnp.ndarray: an array of feature statistics. This is (p,)-dimensional for regular knockoffs and (num_groups,)-dimensional for group knockoffs.

class knockpy.knockoff_stats.OLSStatistic[source]

Lasso Statistic wrapper class

Methods

`cv_score_model`(features, y, cv_score[, ...])	Similar to score_model, but uses cross-validated scoring if cv_score=True.
`fit`(X, Xk, y[, groups, cv_score])	Wraps the FeatureStatistic class with OLS coefs as variable importances.
`score_model`(features, y[, y_dist])	Computes mean-squared error of self.model on (features, y) when y is nonbinary, and computes 1 - accuracy otherwise.
`swap_feature_importances`(features, y)	Given a model of the features and y, calculates feature importances as follows.
`swap_path_feature_importances`(features, y[, ...])	Similar to `swap_feature_importances`; see http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf

fit(X, Xk, y, groups=None, cv_score=False, **kwargs)[source]

Wraps the FeatureStatistic class with OLS coefs as variable importances.

Parameters:

Xnp.ndarray: the (n, p)-shaped design matrix
Xknp.ndarray: the (n, p)-shaped matrix of knockoffs
ynp.ndarray: (n,)-shaped response vector
groupsnp.ndarray: For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).
cv_scorebool: If true, score the feature statistic’s predictive accuracy using cross validation.
kwargsdict: Extra kwargs to pass to combine_Z_stats.

Returns:

Wnp.ndarray: an array of feature statistics. This is (p,)-dimensional for regular knockoffs and (num_groups,)-dimensional for group knockoffs.

class knockpy.knockoff_stats.RandomForestStatistic(model=None)[source]

Methods

`cv_score_model`(features, y, cv_score[, ...])	Similar to score_model, but uses cross-validated scoring if cv_score=True.
`fit`(X, Xk, y[, groups, feature_importance, ...])	Wraps the FeatureStatistic class using a Random Forest to generate variable importances.
`score_model`(features, y[, y_dist])	Computes mean-squared error of self.model on (features, y) when y is nonbinary, and computes 1 - accuracy otherwise.
`swap_feature_importances`(features, y)	Given a model of the features and y, calculates feature importances as follows.
`swap_path_feature_importances`(features, y[, ...])	Similar to `swap_feature_importances`; see http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf

fit(X, Xk, y, groups=None, feature_importance='swap', antisym='cd', group_agg='sum', cv_score=False, **kwargs)[source]

Wraps the FeatureStatistic class using a Random Forest to generate variable importances.

Parameters:

Xnp.ndarray

the (n, p)-shaped design matrix

Xknp.ndarray

the (n, p)-shaped matrix of knockoffs

ynp.ndarray

(n,)-shaped response vector

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

feature_importancestr

Specifies how to create feature importances from model. Three options:

“sklearn”: Use sklearn feature importances. These

are very poor measures of feature importance, but very fast. - “swap”: The default swap-statistic from http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf. These are good measures of feature importance but slightly slower. - “swapint”: The swap-integral defined from http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf

Defaults to ‘swap’

antisymstr

The antisymmetric function used to create (ungrouped) feature statistics. Three options: - “CD” (Difference of absolute vals of coefficients), - “SM” (signed maximum). - “SCD” (Simple difference of coefficients - NOT recommended)

group_aggstr

For group knockoffs, specifies how to turn individual feature statistics into grouped feature statistics. Two options: “sum” and “avg”.

cv_scorebool

If true, score the feature statistic’s predictive accuracy using cross validation. This is extremely expensive for random forests.

kwargsdict

Extra kwargs to pass to underlying RandomForest class

Returns:

Wnp.ndarray: an array of feature statistics. This is (p,)-dimensional for regular knockoffs and (num_groups,)-dimensional for group knockoffs.

class knockpy.knockoff_stats.RidgeStatistic[source]

Ridge statistic wrapper class

Methods

`cv_score_model`(features, y, cv_score[, ...])	Similar to score_model, but uses cross-validated scoring if cv_score=True.
`fit`(X, Xk, y[, groups, antisym, group_agg, ...])	Wraps the FeatureStatistic class but uses cross-validated Ridge coefficients as variable importances.
`score_model`(features, y[, y_dist])	Computes mean-squared error of self.model on (features, y) when y is nonbinary, and computes 1 - accuracy otherwise.
`swap_feature_importances`(features, y)	Given a model of the features and y, calculates feature importances as follows.
`swap_path_feature_importances`(features, y[, ...])	Similar to `swap_feature_importances`; see http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf

fit(X, Xk, y, groups=None, antisym='cd', group_agg='avg', cv_score=False, **kwargs)[source]

Wraps the FeatureStatistic class but uses cross-validated Ridge coefficients as variable importances.

Parameters:

Xnp.ndarray: the (n, p)-shaped design matrix
Xknp.ndarray: the (n, p)-shaped matrix of knockoffs
ynp.ndarray: (n,)-shaped response vector
groupsnp.ndarray: For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).
antisymstr: The antisymmetric function used to create (ungrouped) feature statistics. Three options: - “CD” (Difference of absolute vals of coefficients), - “SM” (signed maximum). - “SCD” (Simple difference of coefficients - NOT recommended)
group_aggstr: For group knockoffs, specifies how to turn individual feature statistics into grouped feature statistics. Two options: “sum” and “avg”.
cv_scorebool: If true, score the feature statistic’s predictive accuracy using cross validation.
kwargsdict: Extra kwargs to pass to underlying Lasso classes

Returns:

Wnp.ndarray: an array of feature statistics. This is (p,)-dimensional for regular knockoffs and (num_groups,)-dimensional for group knockoffs.

knockpy.knockoff_stats.calc_lars_path(X, Xk, y, groups=None, **kwargs)[source]

Calculates locations at which X/knockoffs enter lasso model when regressed on y.

Parameters:

Xnp.ndarray: the (n, p)-shaped design matrix
Xknp.ndarray: the (n, p)-shaped matrix of knockoffs
ynp.ndarray: (n,)-shaped response vector
groupsnp.ndarray: For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).
**kwargs: kwargs for sklearn.linear_model.lars_path

Returns:

Znp.ndarray: (2p,)-shaped array indicating the lasso path statistic for each variable. (This means the maximum lambda such that the lasso coefficient on variable j is nonzero.)

knockpy.knockoff_stats.calc_mse(model, X, y)[source]: Gets MSE of a model

knockpy.knockoff_stats.combine_Z_stats(Z, groups=None, antisym='cd', group_agg='sum')[source]

Given Z scores (variable importances), returns (grouped) feature statistics

Parameters:

Znp.ndarray: (2p,)-shaped numpy array of Z-statistics. The first p values correspond to true features, and the last p correspond to knockoffs (in the same order as the true features).
groupsnp.ndarray: For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).
antisymstr: The antisymmetric function used to create (ungrouped) feature statistics. Three options: - “CD” (Difference of absolute vals of coefficients), - “SM” (signed maximum). - “SCD” (Simple difference of coefficients - NOT recommended)
group_aggstr: For group knockoffs, specifies how to turn individual feature statistics into grouped feature statistics. Two options: “sum” and “avg”.

Returns:

Wnp.ndarray: an array of feature statistics. This is (p,)-dimensional for regular knockoffs and (num_groups,)-dimensional for group knockoffs.

knockpy.knockoff_stats.compute_residual_variance(X, Xk, y)[source]

Estimates sigma2 using residual variance of y given [X, Xk]. Uses a penalized linear model if n < 2p + 10 or p >= 1500 (for efficiency).

Returns:

sigma2float: Estimated residual variance

knockpy.knockoff_stats.data_dependent_threshhold(W, fdr=0.1, offset=1)[source]

Calculate data-dependent threshhold given W statistics.

Parameters:

Wnp.ndarray: p-length numpy array of feature statistics OR (p, batch_length) shaped array.
fdrfloat: desired level of false discovery rate control
offsetint: If offset = 0, control the modified FDR. If offset = 1 (default), controls the FDR exactly.

Returns:

Tfloat or np.ndarray: The data-dependent threshhold. Either a float or a (batch_length,) dimensional array.

knockpy.knockoff_stats.fit_lasso(X, Xk, y, y_dist=None, alphas=None, use_lars=False, **kwargs)[source]

Fits cross-validated lasso on [X, Xk] and y.

Parameters:

Xnp.ndarray: the (n, p)-shaped design matrix.
Xknp.ndarray: the (n, p)-shaped matrix of knockoffs
ynp.ndarray: (n,)-shaped response vector
y_diststr: One of “binomial” or “gaussian”
alphasfloat: The regularization parameter(s) to try. Selects one alpha using cross-validation by default.
use_larsbool: If True, uses a LARS-based solver for Gaussian data. If False, uses a gradient based solver (default).
**kwargs: kwargs for sklearn model.

Returns:

glsklearn Lasso/LassoCV/LassoLarsCV/LogisticRegressionCV object: The sklearn model fit through cross-validation.
indsnp.ndarray: (2p,)-dimensional array of indices representing the random permutation applied to the concatenation of [X, Xk] before fitting gl.
rev_indsnp.ndarray:: Indices which reverse the effect of inds. In particular, if M is any (n, 2p)-dimensional array, then `M==M[:, inds][:, rev_inds]`

Notes

To avoid FDR control violations, this randomly permutes the features before fitting the lasso.

knockpy.knockoff_stats.fit_ridge(X, Xk, y, y_dist=None, **kwargs)[source]

Fits cross-validated ridge on [X, Xk] and y.

Parameters:

Xnp.ndarray: the (n, p)-shaped design matrix
Xknp.ndarray: the (n, p)-shaped matrix of knockoffs
ynp.ndarray: (n,)-shaped response vector
y_diststr: One of “binomial” or “gaussian”
**kwargs: kwargs for sklearn model.

Returns:

glsklearn.linear_model.RidgeCV/LogisticRegressionCV: The sklearn model fit through cross-validation.
indsnp.ndarray: (2p,)-dimensional array of indices representing the random permutation applied to the concatenation of [X, Xk] before fitting gl.
rev_indsnp.ndarray:: Indices which reverse the effect of inds. In particular, if M is any (n, 2p)-dimensional array, then `M==M[:, inds][:, rev_inds]`

Notes

To avoid FDR control violations, this randomly permutes the features before fitting the ridge.

knockpy.knockoff_stats.parse_logistic_flag(kwargs)[source]: Checks whether y_dist is binomial

knockpy.knockoff_stats.parse_y_dist(y)[source]: Checks if y is binary; else assumes it is continuous

class knockpy.mlr.mlr.MLR_FX_Spikeslab(**kwargs)[source]

Masked likelihood ratio statistics using a spike-and-slab linear model. This is a specialized class designed to lead to slightly faster computation for fixed-X knockoffs.

The arguments are the same as those for MLR_Spikeslab, with the exception of the arguments listed below.

Parameters:

num_mixtureint: Number of mixtures for the “slab” component of the spike and slab. Defaults to 1.
tau2float or list of floats: Prior variance on nonzero coefficients. Default: 1.0.
tau2_a0float or list of floats: tau2 has an InvGamma(tau2_a0, tau2_b0) hyperprior. When n_mixture > 1, this can be a list of length n_mixture. Default: 2.0.
tau2_b0float or list of floats: tau2 has an InvGamma(tau2_a0, tau2_b0) hyperprior. When n_mixture > 1, this can be a list of length n_mixture. Default: 0.01.

Methods

`cv_score_model`(features, y, cv_score[, ...])	Similar to score_model, but uses cross-validated scoring if cv_score=True.
`fit`(X, Xk, groups, y, **kwargs)	Trains the model and creates feature importances.
`score_model`(features, y[, y_dist])	Computes mean-squared error of self.model on (features, y) when y is nonbinary, and computes 1 - accuracy otherwise.
`swap_feature_importances`(features, y)	Given a model of the features and y, calculates feature importances as follows.
`swap_path_feature_importances`(features, y[, ...])	Similar to `swap_feature_importances`; see http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf

calc_whiteout_statistics
compute_W

fit(X, Xk, groups, y, **kwargs)[source]

Trains the model and creates feature importances.

Parameters:

Xnp.ndarray

the (n, p)-shaped design matrix

Xknp.ndarray

the (n, p)-shaped matrix of knockoffs

ynp.ndarray

(n,)-shaped response vector

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

feature_importancestr

Specifies how to create feature importances from model. Two options:

“swap”: The default swap-statistic from http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf.

“swapint”: The swap-integral from http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf.

Defaults to ‘swap’. This is ignored for most specialized classes inheriting from FeatureStatistic (e.g. MLR/Lasso Statistics).

antisymstr

The antisymmetric function used to create (ungrouped) feature statistics. Three options:

“CD” (Difference of absolute vals of coefficients),
“SM” (signed maximum).
“SCD” (Simple difference of coefficients - NOT recommended)

group_aggstr

For group knockoffs, specifies how to turn individual feature statistics into grouped feature statistics. Two options: “sum” and “avg”.

**kwargs**dict

kwargs to pass to the ‘train’ or ‘fit’ method of the model.

Returns:

Wnp.ndarray: an array of feature statistics. This is (p,)-dimensional for regular knockoffs and (num_groups,)-dimensional for group knockoffs.

class knockpy.mlr.mlr.MLR_Spikeslab(**kwargs)[source]

Masked likelihood ratio statistics using a spike-and-slab prior for a linear model or probit model, automatically inferred based on y.

Parameters:

Xnp.ndarray: the (n, p)-shaped design matrix
Xknp.ndarray: the (n, p)-shaped matrix of knockoffs
ynp.ndarray: (n,)-shaped response vector
n_iterint: Number of samples per MCMC chain used to compute MLR statistics. Default: 2000.
chainsint: Number of MCMC chains to run. Default: 5.
burn_propfloat: The burn-in for each chain will be equal to n_iter * burn_prop.
adjusted_mlrbool: If True, computes adjusted MLR (AMLR) statistics. Default: False. See the MLR knockoffs paper for details.
p0float: Prior probability that any coefficient equals zero.
update_p0bool: If True, updates p0 using a Beta hyperprior on p0. Else, the value of p0 is fixed. Default: True.
p0_a0float: If update_p0 is True, p0 has a TruncBeta(p0_a0, p0_b0, min_p0) hyperprior. Default: 1.0.
p0_b0float: If update_p0 is True, p0 has a TruncBeta(p0_a0, p0_b0, min_p0) hyperprior. Default: 1.0.
min_p0float: Minimum value for p0 as specified by the prior. Default: 0.8.
sigma2float: Variance of y given X. Default: 1.0.
update_sigma2bool: If True, imposes an InverseGamma hyperprior on sigma2. Else, the value of sigma2 is fixed. Default: True.
sigma2_a0float: If update_sigma2 is True, sigma2 has an InvGamma(sigma2_a0, sigma2_b0) hyperprior. Default: 2.0.
sigma2_b0float: If update_sigma2 is True, sigma2 has an InvGamma(sigma2_a0, sigma2_b0) hyperprior. Default: 1.0.
tau2float: Prior variance on nonzero coefficients. Default: 1.0.
update_tau2bool: If True, imposes an InverseGamma hyperprior on tau2. Else, the value of tau2 is fixed. Default: True.
tau2_a0float: If update_tau2 is True, tau2 has an InvGamma(tau2_a0, tau2_b0) hyperprior. Default: 2.0.
tau2_b0float: If update_tau2 is True, tau2 has an InvGamma(tau2_a0, tau2_b0) hyperprior. Default: 1.0.

Returns:

Wnp.ndarray: a p-dimensional array of feature statistics.

Notes

This is a valid FX feature statistic (obeys the sufficiency property).

Methods

`cv_score_model`(features, y, cv_score[, ...])	Similar to score_model, but uses cross-validated scoring if cv_score=True.
`fit`(X, Xk, y, groups, **kwargs)	Trains the model and creates feature importances.
`score_model`(features, y[, y_dist])	Computes mean-squared error of self.model on (features, y) when y is nonbinary, and computes 1 - accuracy otherwise.
`swap_feature_importances`(features, y)	Given a model of the features and y, calculates feature importances as follows.
`swap_path_feature_importances`(features, y[, ...])	Similar to `swap_feature_importances`; see http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf

compute_W

fit(X, Xk, y, groups, **kwargs)[source]

Trains the model and creates feature importances.

Parameters:

Xnp.ndarray

the (n, p)-shaped design matrix

Xknp.ndarray

the (n, p)-shaped matrix of knockoffs

ynp.ndarray

(n,)-shaped response vector

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

feature_importancestr

Specifies how to create feature importances from model. Two options:

“swap”: The default swap-statistic from http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf.

“swapint”: The swap-integral from http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf.

Defaults to ‘swap’. This is ignored for most specialized classes inheriting from FeatureStatistic (e.g. MLR/Lasso Statistics).

antisymstr

The antisymmetric function used to create (ungrouped) feature statistics. Three options:

“CD” (Difference of absolute vals of coefficients),
“SM” (signed maximum).
“SCD” (Simple difference of coefficients - NOT recommended)

group_aggstr

For group knockoffs, specifies how to turn individual feature statistics into grouped feature statistics. Two options: “sum” and “avg”.

**kwargs**dict

kwargs to pass to the ‘train’ or ‘fit’ method of the model.

Returns:

Wnp.ndarray: an array of feature statistics. This is (p,)-dimensional for regular knockoffs and (num_groups,)-dimensional for group knockoffs.

class knockpy.mlr.mlr.MLR_Spikeslab_Splines(**kwargs)[source]

Masked likelihood ratio statistics using a spike-and-slab prior for a linear or probit model based on regression splines.

All parameters are the same as MLR_Spikeslab except the following additional parameters.

Parameters:

n_knotsint: The number of knots to used for the regression splines. Defaults to 1.
degreeint: The number of degrees to use for the regresison splines. Defaults to 3.

Methods

`cv_score_model`(features, y, cv_score[, ...])	Similar to score_model, but uses cross-validated scoring if cv_score=True.
`fit`(X, Xk, groups, y, **kwargs)	Trains the model and creates feature importances.
`score_model`(features, y[, y_dist])	Computes mean-squared error of self.model on (features, y) when y is nonbinary, and computes 1 - accuracy otherwise.
`swap_feature_importances`(features, y)	Given a model of the features and y, calculates feature importances as follows.
`swap_path_feature_importances`(features, y[, ...])	Similar to `swap_feature_importances`; see http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf

compute_W

fit(X, Xk, groups, y, **kwargs)[source]

Trains the model and creates feature importances.

Parameters:

Xnp.ndarray

the (n, p)-shaped design matrix

Xknp.ndarray

the (n, p)-shaped matrix of knockoffs

ynp.ndarray

(n,)-shaped response vector

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

feature_importancestr

Specifies how to create feature importances from model. Two options:

“swap”: The default swap-statistic from http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf.

“swapint”: The swap-integral from http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf.

Defaults to ‘swap’. This is ignored for most specialized classes inheriting from FeatureStatistic (e.g. MLR/Lasso Statistics).

antisymstr

The antisymmetric function used to create (ungrouped) feature statistics. Three options:

“CD” (Difference of absolute vals of coefficients),
“SM” (signed maximum).
“SCD” (Simple difference of coefficients - NOT recommended)

group_aggstr

For group knockoffs, specifies how to turn individual feature statistics into grouped feature statistics. Two options: “sum” and “avg”.

**kwargs**dict

kwargs to pass to the ‘train’ or ‘fit’ method of the model.

Returns:

Wnp.ndarray: an array of feature statistics. This is (p,)-dimensional for regular knockoffs and (num_groups,)-dimensional for group knockoffs.

class knockpy.mlr.mlr.OracleMLR(beta, **kwargs)[source]

Oracle masked likelihood ratio statistics for generalized additive models.

Parameters:

betanp.ndarray: p-dimensional array of linear coefficients.

Methods

`cv_score_model`(features, y, cv_score[, ...])	Similar to score_model, but uses cross-validated scoring if cv_score=True.
`fit`(X, Xk, groups, y, **kwargs)	Trains the model and creates feature importances.
`score_model`(features, y[, y_dist])	Computes mean-squared error of self.model on (features, y) when y is nonbinary, and computes 1 - accuracy otherwise.
`swap_feature_importances`(features, y)	Given a model of the features and y, calculates feature importances as follows.
`swap_path_feature_importances`(features, y[, ...])	Similar to `swap_feature_importances`; see http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf

compute_W

fit(X, Xk, groups, y, **kwargs)[source]

Trains the model and creates feature importances.

Parameters:

Xnp.ndarray

the (n, p)-shaped design matrix

Xknp.ndarray

the (n, p)-shaped matrix of knockoffs

ynp.ndarray

(n,)-shaped response vector

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

feature_importancestr

Specifies how to create feature importances from model. Two options:

“swap”: The default swap-statistic from http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf.

“swapint”: The swap-integral from http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf.

Defaults to ‘swap’. This is ignored for most specialized classes inheriting from FeatureStatistic (e.g. MLR/Lasso Statistics).

antisymstr

The antisymmetric function used to create (ungrouped) feature statistics. Three options:

“CD” (Difference of absolute vals of coefficients),
“SM” (signed maximum).
“SCD” (Simple difference of coefficients - NOT recommended)

group_aggstr

For group knockoffs, specifies how to turn individual feature statistics into grouped feature statistics. Two options: “sum” and “avg”.

**kwargs**dict

kwargs to pass to the ‘train’ or ‘fit’ method of the model.

Returns:

Wnp.ndarray: an array of feature statistics. This is (p,)-dimensional for regular knockoffs and (num_groups,)-dimensional for group knockoffs.

Gaussian and Fixed-X Knockoff Samplers

class knockpy.knockoffs.FXSampler(X, groups=None, sample_tol=1e-05, S=None, method=None, verbose=False, **kwargs)[source]

Samples FX knockoffs. See the GaussianSampler documentation for description of the arguments.

Methods

`check_PSD_condition`(Sigma, S)	Checks that the feature-knockoff cov matrix is PSD.
`check_xk_validity`(X, Xk[, testname, alpha])	Runs a variety of KS tests on X and Xk to (informally) check that Xk are valid knockoffs for X.
`fetch_S`()	Rescales S to the same scale as the initial X input
`many_ks_tests`(sample1s, sample2s)	Samples1s, Sample2s = list of arrays Gets p values by running ks tests and then does a multiple testing correction.
`sample_knockoffs`([check_psd])	Samples knockoffs.

fetch_S()[source]: Rescales S to the same scale as the initial X input

sample_knockoffs(check_psd=False)[source]

Samples knockoffs. returns n x p knockoff matrix.

Parameters:

check_psdbool: If True, will check and enforce that S is a valid S-matrix. Defalts to False.

class knockpy.knockoffs.GaussianSampler(X, mu=None, Sigma=None, invSigma=None, groups=None, sample_tol=1e-05, S=None, method=None, verbose=False, **kwargs)[source]

Samples MX Gaussian (group) knockoffs.

Parameters:

Xnp.ndarray

the (n, p)-shaped design

munp.ndarray

(p, )-shaped mean of the features. If None, this defaults to the empirical mean of the features.

Sigmanp.ndarray

(p, p)-shaped covariance matrix of the features. If None, this is estimated using the utilities.estimate_covariance function.

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

Snp.ndarray

the (p, p)-shaped knockoff S-matrix used to generate knockoffs. This is defined such that Cov(X, tilde(X)) = Sigma - S. When None, will be constructed by knockoff generator. Defaults to None.

methodstr

Specifies how to construct S matrix. This will be ignored if S is not None. There are several options:

‘mvr’: Minimum Variance-Based Reconstructability knockoffs.

‘mmi’: Minimizes the mutual information between X and the knockoffs.

‘ci’: Conditional independence knockoffs.

‘sdp’: minimize the mean absolute covariance (MAC) between the features

and the knockoffs. - ‘equicorrelated’: Minimizes the MAC under the constraint that the the correlation between each feature and its knockoff is the same.

The default is to use mvr for non-group knockoffs, and to use the group-SDP for grouped knockoffs (the implementation for group mvr knockoffs is currently fairly slow). In both cases we use a block-diagonal approximation if the number if features is greater than 1000.

objectivestr

How to optimize the S matrix if using the SDP for group knockoffs. There are several options:

‘abs’: minimize sum(abs(Sigma - S))

between groups and the group knockoffs. - ‘pnorm’: minimize Lp-th matrix norm. Equivalent to abs when p = 1. - ‘norm’: minimize different type of matrix norm (see norm_type below).

sample_tolfloat

Minimum eigenvalue allowed for feature-knockoff covariance matrix. Keep this small but nonzero (1e-5) to prevent numerical errors.

verbosebool

If True, prints progress over time

rec_propfloat

The proportion of knockoffs to recycle (see Barber and Candes 2018, https://arxiv.org/abs/1602.03574). If method = ‘mvr’, then S_generation takes this into account and should increase the power of recycled knockoffs. sparsely-correlated, high-dimensional settings.

kwargsdict

Other kwargs for S-matrix solvers.

Methods

`check_PSD_condition`(Sigma, S)	Checks that the feature-knockoff cov matrix is PSD.
`check_xk_validity`(X, Xk[, testname, alpha])	Runs a variety of KS tests on X and Xk to (informally) check that Xk are valid knockoffs for X.
`fetch_S`()	Fetches knockoff S-matrix.
`many_ks_tests`(sample1s, sample2s)	Samples1s, Sample2s = list of arrays Gets p values by running ks tests and then does a multiple testing correction.
`sample_knockoffs`([check_psd])	Samples knockoffs.

fetch_S()[source]: Fetches knockoff S-matrix.

sample_knockoffs(check_psd=False)[source]

Samples knockoffs. returns n x p knockoff matrix.

Parameters:

check_psdbool: If True, will check and enforce that S is a valid S-matrix. Defalts to False.

class knockpy.knockoffs.KnockoffSampler[source]

Base class for sampling knockoffs.

Methods

`check_PSD_condition`(Sigma, S)	Checks that the feature-knockoff cov matrix is PSD.
`check_xk_validity`(X, Xk[, testname, alpha])	Runs a variety of KS tests on X and Xk to (informally) check that Xk are valid knockoffs for X.
`fetch_S`()	Fetches knockoff S-matrix.
`many_ks_tests`(sample1s, sample2s)	Samples1s, Sample2s = list of arrays Gets p values by running ks tests and then does a multiple testing correction.

sample_knockoffs

check_PSD_condition(Sigma, S)[source]

Checks that the feature-knockoff cov matrix is PSD.

Parameters:

Sigmanp.ndarray: (p, p)-shaped covariance matrix of the features. If None, this is estimated using the shrinkage option. This is ignored for fixed-X knockoffs.
Snp.ndarray: the (p, p)-shaped knockoff S-matrix used to generate knockoffs.

Raises:

Raises an error if S is not PSD or 2 Sigma - S is not PSD.

check_xk_validity(X, Xk, testname='', alpha=0.001)[source]

Runs a variety of KS tests on X and Xk to (informally) check that Xk are valid knockoffs for X. Uses the BHQ adjustment for multiple testing.

Parameters:

Xnp.ndarray: the (n, p)-shaped design
Xknp.ndarray: the (n, p)-shaped matrix of knockoffs
testnamestr: a testname that shows up in the error
alphafloat: The significance level. Defaults to 0.001

fetch_S()[source]: Fetches knockoff S-matrix.

many_ks_tests(sample1s, sample2s)[source]: Samples1s, Sample2s = list of arrays Gets p values by running ks tests and then does a multiple testing correction.

knockpy.knockoffs.produce_FX_knockoffs(X, invSigma, S, copies=1)[source]: See equation (1.4) of https://arxiv.org/pdf/1404.5609.pdf

Metropolized Samplers

The metropolized knockoff sampler for an arbitrary probability density and graphical structure using covariance-guided proposals.

See https://arxiv.org/abs/1903.00434 for a description of the algorithm and proof of validity and runtime.

This code was based on initial code written by Stephen Bates in October 2019, which was released in combination with https://arxiv.org/abs/1903.00434.

class knockpy.metro.ARTKSampler(X, Sigma, df_t, **kwargs)[source]

Samples knockoffs for autoregressive T-distributed designs. (Hence, ARTK). See https://arxiv.org/pdf/1903.00434.pdf for details.

Parameters:

Xnp.ndarray: the (n, p)-shaped design matrix
Sigmanp.ndarray: (p, p)-shaped covariance matrix of the features. The first diagonal should be the pairwise correlations which define the Markov chain.
df_tfloat: The degrees of freedom for the t-distributions.
kwargsdict: kwargs to pass to the constructor method of the generic MetropolizedKnockoffSampler class.

Methods

`cache_conditional_proposal_params`([verbose, ...])	Caches some of the conditional means for Xjstar \| Xtemp.
`center`(M[, active_inds])	Centers an n x j matrix M.
`check_PSD_condition`(Sigma, S)	Checks that the feature-knockoff cov matrix is PSD.
`check_xk_validity`(X, Xk[, testname, alpha])	Runs a variety of KS tests on X and Xk to (informally) check that Xk are valid knockoffs for X.
`compute_F`(x_flags, j)	Computes the F function from Page 33 pf the paper: Pr(tildeXj=tildexj, Xjstar=xjstar \| Xtemp, tildeX_{1:j-1}, Xjstar_{1:j-1}) Note that tildexj and xjstar are NOT inputs because they do NOT change during the junction tree DP process.
`compute_acc_prob`(x_flags, j[, log_q1, ...])	Computes acceptance probability for variable `j` given a particular rejection pattern `x_flags`.
`create_proposal_params`(**kwargs)	Constructs the covariance-guided proposal.
`fetch_S`()	Fetches knockoff S-matrix.
`fetch_cached_proposal_params`(Xtemp, x_flags, j)	Same as above, but uses caching to speed up computation.
`fetch_proposal_params`(X, prev_proposals)	Returns mean and variance of proposal j given X and previous proposals.
`lf`(X)	Reordered likelihood function
`lf_ratio`(X, Xjstar, j)	Calculates the log of the likelihood ratio between two observations: X where X[:,j] is replaced with Xjstar, divided by the likelihood of X.
`log_q12`(x_flags, j)	Computes q1 and q2 as specified by page 33 of the paper.
`many_ks_tests`(sample1s, sample2s)	Samples1s, Sample2s = list of arrays Gets p values by running ks tests and then does a multiple testing correction.
`q_ll`(Xjstar, X, prev_proposals[, cond_mean, ...])	Calculates the log-likelihood of a proposal Xjstar given X and the previous proposals. Xjstar : np.ndarray `(n,)`-shaped numpy array of values to evaluate the proposal likelihood at. X : np.ndarray `(n, p)`-shaped array of observed data, in the order used to sample knockoff variables. prev_proposals : np.ndarray `(n, j-1)`-shaped array of previous proposals, in the order used to sample knockoff variables. If None, assumes j = 0.
`sample_knockoffs`([clip, cache])	Samples knockoffs using the metropolized knockoff sampler.
`sample_proposals`(X, prev_proposals[, ...])	Samples a continuous or discrete proposal given the design matrix and the previous proposals.

class knockpy.metro.BlockTSampler(X, Sigma, df_t, **kwargs)[source]

Methods

`check_PSD_condition`(Sigma, S)	Checks that the feature-knockoff cov matrix is PSD.
`check_xk_validity`(X, Xk[, testname, alpha])	Runs a variety of KS tests on X and Xk to (informally) check that Xk are valid knockoffs for X.
`fetch_S`()	Fetches knockoff S-matrix.
`many_ks_tests`(sample1s, sample2s)	Samples1s, Sample2s = list of arrays Gets p values by running ks tests and then does a multiple testing correction.
`sample_knockoffs`(**kwargs)	Actually samples knockoffs sequentially for each block.

fetch_S()[source]: Fetches knockoff S-matrix.

sample_knockoffs(**kwargs)[source]

Actually samples knockoffs sequentially for each block.

Parameters:

kwargsdict: kwargs for the MetropolizedKnockoffSampler.sample_knockoffs call for each block.

Returns:

Xknp.ndarray: A (n, p)-shaped knockoff matrix in the original order the variables were passed in.

class knockpy.metro.GibbsGridSampler(X, gibbs_graph, Sigma, Q=None, mu=None, max_width=6, **kwargs)[source]

Samples knockoffs for a discrete gibbs grid using the divide-and-conquer algorithm plus metropolized knockoff sampling.

Parameters:

Xnp.ndarray: the (n, p)-shaped design matrix
gibbs_graphnp.ndarray: (p, p)-shaped matrix specifying the distribution of the gibbs grid: see knockpy.dgp.sample_gibbs. This must correspond to a grid-like undirected graphical model.
Sigmanp.ndarray: (p, p)-shaped estimated covariance matrix of the data.
max_widthint: The maximum treewidth to allow in the divide-and-conquer algorithm.

Notes

Unlike the attributes of a MetropolizedKnockoffSampler class, the attributes of a BlockTSampler class are stored in the same order that the design matrix is initially passed in. E.g., self.Xk corresponds with self.X.

Methods

`check_PSD_condition`(Sigma, S)	Checks that the feature-knockoff cov matrix is PSD.
`check_xk_validity`(X, Xk[, testname, alpha])	Runs a variety of KS tests on X and Xk to (informally) check that Xk are valid knockoffs for X.
`fetch_S`()	Returns `None` because the divide-and-conquer approach means there is no one S-matrix.
`many_ks_tests`(sample1s, sample2s)	Samples1s, Sample2s = list of arrays Gets p values by running ks tests and then does a multiple testing correction.
`sample_knockoffs`(**kwargs)	Samples knockoffs using divide-and-conquer approach.

coords2num
num2coords

fetch_S()[source]: Returns None because the divide-and-conquer approach means there is no one S-matrix.

sample_knockoffs(**kwargs)[source]

Samples knockoffs using divide-and-conquer approach.

Parameters:

kwargsdict: Keyword args for MetropolizedKnockoffSampler.sample_knockoffs.

Returns:

Xknp.ndarray: A (n, p)-shaped knockoff matrix in the original order the variables were passed in.

class knockpy.metro.MetropolizedKnockoffSampler(lf, X, mu=None, Sigma=None, undir_graph=None, order=None, active_frontier=None, gamma=0.999, metro_verbose=False, cliques=None, log_potentials=None, buckets=None, **kwargs)[source]

A metropolized knockoff sampler for arbitrary random variables using covariance-guided proposals.

Group knockoffs are not yet supported.

Parameters:

lffunction: log-probability density. This function should take a (n, p)-shaped numpy array (n independent samples of a p-dimensional vector) and return a (n,) shaped array of log-probabilities. This can also be supplied as None if cliques and log-potentials are supplied.
Xnp.ndarray: the (n, p)-shaped design matrix
Xknp.ndarray: the (n, p)-shaped matrix of knockoffs
munp.ndarray: The (estimated) mean of X. Exact FDR control is maintained even when this vector is incorrect. Defaults to the mean of X, e.g., X.mean(axis=0).
Sigmanp.ndarray: (p, p)-shaped covariance matrix of the features. If None, this is estimated using the data using a naive method to ensure compatability with the proposals. Exact FDR control is maintained even when Sigma is incorrect.
undir_graphnp.ndarray or nx.Graph: An undirected graph specifying the conditional independence structure of the data-generating process. This must be specified if either of the order or active_frontier params are not specified. One of two options: - A networkx undirected graph object - A (p, p)-shaped numpy array, where nonzero elements represent edges.
ordernp.ndarray: A p-length numpy array specifying the ordering to sample the variables. Should be a vector with unique entries 0,…,p-1.
active_fontierA list of lists of length p where entry j is the set of: entries > j that are in V_j. This specifies the conditional independence structure of the distribution given by lf. See page 34 of the paper.
gammafloat: A tuning parameter to increase / decrease the acceptance ratio. See appendix F.2. Defaults to 0.999.
bucketsnp.ndarray or list: A list or array of discrete values that X can take. Covariance-guided proposals will be rounded to these values. If None, Metro assumes the domain of each feature is all real numbers.
kwargsdict: kwargs to pass to the smatrix.compute_smatrix method for sampling proposals.

Notes

All attributes of the MetropolizedKnockoffSampler are stored in the order that knockoffs are sampled, NOT the order that variables are initially passed in. For example, the X attribute will not necessarily equal the X argument: instead, self.X = X[:, self.order]. To reorder attributes to the initial order of the X argument, use the syntax self.attribute[:, self.inv_order].

Attributes:

ordernp.ndarray: (p,)-shaped array of indices which reorders X into the order for sampling knockoffs.
inv_ordernp.ndarray: (p,)-shaped array of indices which takes a set of variables which have been reordered for metropolized sampling and returns them to their initial order. For example, X == X[:, self.order][:, self.inv_order].
Xnp.ndarray: (n, p) design matrix reordered according to the order for sampling knockoffs
X_propnp.ndarray: (n, p)-shaped array of knockoff proposals
Xknp.ndarray: the (n, p)-shaped array of knockoffs
acceptancesnp.ndarray: a (n, p)-shaped boolean array where acceptances[i, j] == 1 indicates that X_prop[i, j] was accepted.
final_acc_probsnp.ndarray: a (n, p)-shaped array where final_acc_probs[i, j] is the acceptance probability for X_prop[i, j].
Sigmanp.ndarray: the (p, p)-shaped estimated covariance matrix of X. The class constructor guarantees this is compatible with the conditional independence structure of the data.
Snp.ndarray: the (p, p)-shaped knockoff S-matrix used to generate the covariance-guided proposals.

Methods

`cache_conditional_proposal_params`([verbose, ...])	Caches some of the conditional means for Xjstar \| Xtemp.
`center`(M[, active_inds])	Centers an n x j matrix M.
`check_PSD_condition`(Sigma, S)	Checks that the feature-knockoff cov matrix is PSD.
`check_xk_validity`(X, Xk[, testname, alpha])	Runs a variety of KS tests on X and Xk to (informally) check that Xk are valid knockoffs for X.
`compute_F`(x_flags, j)	Computes the F function from Page 33 pf the paper: Pr(tildeXj=tildexj, Xjstar=xjstar \| Xtemp, tildeX_{1:j-1}, Xjstar_{1:j-1}) Note that tildexj and xjstar are NOT inputs because they do NOT change during the junction tree DP process.
`compute_acc_prob`(x_flags, j[, log_q1, ...])	Computes acceptance probability for variable `j` given a particular rejection pattern `x_flags`.
`create_proposal_params`(**kwargs)	Constructs the covariance-guided proposal.
`fetch_S`()	Fetches knockoff S-matrix.
`fetch_cached_proposal_params`(Xtemp, x_flags, j)	Same as above, but uses caching to speed up computation.
`fetch_proposal_params`(X, prev_proposals)	Returns mean and variance of proposal j given X and previous proposals.
`lf`(X)	Reordered likelihood function
`lf_ratio`(X, Xjstar, j)	Calculates the log of the likelihood ratio between two observations: X where X[:,j] is replaced with Xjstar, divided by the likelihood of X.
`log_q12`(x_flags, j)	Computes q1 and q2 as specified by page 33 of the paper.
`many_ks_tests`(sample1s, sample2s)	Samples1s, Sample2s = list of arrays Gets p values by running ks tests and then does a multiple testing correction.
`q_ll`(Xjstar, X, prev_proposals[, cond_mean, ...])	Calculates the log-likelihood of a proposal Xjstar given X and the previous proposals. Xjstar : np.ndarray `(n,)`-shaped numpy array of values to evaluate the proposal likelihood at. X : np.ndarray `(n, p)`-shaped array of observed data, in the order used to sample knockoff variables. prev_proposals : np.ndarray `(n, j-1)`-shaped array of previous proposals, in the order used to sample knockoff variables. If None, assumes j = 0.
`sample_knockoffs`([clip, cache])	Samples knockoffs using the metropolized knockoff sampler.
`sample_proposals`(X, prev_proposals[, ...])	Samples a continuous or discrete proposal given the design matrix and the previous proposals.

cache_conditional_proposal_params(verbose=False, expensive_cache=True)[source]: Caches some of the conditional means for Xjstar | Xtemp. If expensive_cache = True, this will be quite memory intensive in order to achieve a 2-3x speedup. Otherwise, achieves a a 20-30% speedup at a more modest memory cost.

center(M, active_inds=None)[source]: Centers an n x j matrix M. For mu = 0, does not perform this computation, which actually is a bottleneck for large n and p.

compute_F(x_flags, j)[source]: Computes the F function from Page 33 pf the paper: Pr(tildeXj=tildexj, Xjstar=xjstar | Xtemp, tildeX_{1:j-1}, Xjstar_{1:j-1}) Note that tildexj and xjstar are NOT inputs because they do NOT change during the junction tree DP process.

compute_acc_prob(x_flags, j, log_q1=None, log_q2=None, Xtemp=None)[source]

Computes acceptance probability for variable j given a particular rejection pattern x_flags.

Mathematically, this is: Pr(tildeXj = Xjstar | Xtemp, Xtilde_{1:j-1}, Xstar_{1:j})

create_proposal_params(**kwargs)[source]

Constructs the covariance-guided proposal.

Parameters:

kwargsdict: kwargs for the smatrix.compute_smatrix function

fetch_S()[source]: Fetches knockoff S-matrix.

fetch_cached_proposal_params(Xtemp, x_flags, j)[source]: Same as above, but uses caching to speed up computation. This caching can be cheap (if self.cache is False) or extremely expensive (if self.cache is True) in terms of memory.

fetch_proposal_params(X, prev_proposals)[source]: Returns mean and variance of proposal j given X and previous proposals. Both X and prev_proposals must be in the order used to sample knockoff variables.

lf(X)[source]: Reordered likelihood function

lf_ratio(X, Xjstar, j)[source]

Calculates the log of the likelihood ratio between two observations: X where X[:,j] is replaced with Xjstar, divided by the likelihood of X. This is equivalent to (but often faster) than:

>>> ld_obs = self.lf(X)
>>> Xnew = X.copy()
>>> Xnew[:, j] = Xjstar
>>> ld_prop = self.lf(Xnew)
>>> ld_ratio = ld_prop - ld_obs

When node potentials have been passed, this is much faster than calculating the log-likelihood function and subtracting.

Parameters:

X – a n x p matrix of observations
Xjstar – New observations for column j of X
j – an int between 0 and p - 1, telling us which column to replace

log_q12(x_flags, j)[source]: Computes q1 and q2 as specified by page 33 of the paper.

q_ll(Xjstar, X, prev_proposals, cond_mean=None, cond_var=None)[source]

Calculates the log-likelihood of a proposal Xjstar given X and the previous proposals. Xjstar : np.ndarray

(n,)-shaped numpy array of values to evaluate the proposal likelihood at.

Xnp.ndarray: (n, p)-shaped array of observed data, in the order used to sample knockoff variables.
prev_proposalsnp.ndarray: (n, j-1)-shaped array of previous proposals, in the order used to sample knockoff variables. If None, assumes j = 0.

sample_knockoffs(clip=1e-05, cache=None)[source]

Samples knockoffs using the metropolized knockoff sampler.

Parameters:

clipfloat: To provide numerical stability, we make the minimum acceptance probability clip. If clip=0, some acceptance probabilities may become negative due to floating point errors.
cachebool: If True, uses a very memory intensive caching system to get a 2-3x speedup when calculating conditional means for the proposals. Defaults to true if n * (p**2) < 1e9.

Returns:

Xknp.ndarray: A (n, p)-shaped knockoff matrix in the original order the variables were passed in.

sample_proposals(X, prev_proposals, cond_mean=None, cond_var=None)[source]: Samples a continuous or discrete proposal given the design matrix and the previous proposals. Can pass in the conditional mean and variance of the new proposals, if cached, to save computation.

knockpy.metro.gaussian_log_likelihood(X, mu, var)[source]: Somehow this is faster than scipy

knockpy.metro.get_ordering(T)[source]

Takes a junction tree and returns a variable ordering for the metro knockoff sampler. The code from this function is adapted from the code distributed with https://arxiv.org/abs/1903.00434.

Parameters:

TA networkx graph that is a junction tree.
Nodes must be sets with elements 0,…,p-1.

Returns:

ordera numpy array with unique elements 0,…,p-1
active_frontierlist of lists: a list of length p gwhere entry j is the set of entries > j that are in V_j. This specifies the conditional independence structure of a joint covariate distribution. See page 34 of https://arxiv.org/abs/1903.00434.

knockpy.metro.t_log_likelihood(X, df_t)[source]: UNNORMALIZED t loglikelihood. This is also faster than scipy

knockpy.metro.t_markov_loglike(X, rhos, df_t=3)[source]: Calculates log-likelihood for markov chain specified in https://arxiv.org/pdf/1903.00434.pdf

knockpy.metro.t_mvn_loglike(X, invScale, mu=None, df_t=3)[source]: Calculates multivariate t log-likelihood up to normalizing constant. :param X: n x p array of data :param invScale: p x p array, inverse multivariate t scale matrix :param mu: p-length array, location parameter :param df_t: degrees of freedom

S-matrix computation

knockpy.smatrix.compute_smatrix(Sigma, groups=None, method=None, solver=None, how_approx='blockdiag', max_block=1000, num_factors=20, num_processes=1, line_search=True, D=None, U=None, **kwargs)[source]

Wraps a variety of S-matrix generation functions. For mvr, maxent, mmi, and sdp methods, this can use a block-diagonal approximation of Sigma if the dimension of Sigma exceeds max_block or a factor approximation.

Parameters:

Sigmanp.ndarray
``(p, p)``-shaped covariance matrix of X
groupsnp.ndarray: For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).
methodstr: Method for constructing S-matrix. One of mvr, maxent, mmi, sdp, equicorrelated, ci.
solverstr: Method for solving for knockoffs. One of ‘cd’ (coordinate descent), ‘sdp’ (default sdp solver), or ‘psgd’ (projected gradient descent). Coordinate descent is recommended for MRC knockoffs.
how_approxstr: How to approximate the covariance matrix to speed up computation. - If ‘blockdiag’, approximates Sigma as a block-diagonal matrix. - If ‘factor’, approximates Sigma = np.diag(D) + np.dot(U, U.T), a factor model.
max_blockint: The maximum size of a block if how_approx=’blockdiag’. Defaults to 1000.
num_factorsint: The number of factors if how_approx=’factor’. Defaults to 20.
num_processesint: Number of parallel process to use if Sigma is approximated as a block-diagonal matrix.
Dnp.ndarray: p-shaped array of diagonal elements for factor model. Only used if how_approx=’factor’. This is optional if Sigma is not None.
Unp.ndarray: (p, k)-shaped matrix for factor model. Usually k << p. Only used if how_approx=’factor’. This is optional if Sigma is not None.
line_searchbool: If false, mrc methods do not do a line search after blockdiag approximation.
kwargsdict: kwargs to pass to one of the wrapped S-matrix solvers.

Returns:

Snp.ndarray: (p, p)-shaped (block) diagonal matrix used to generate knockoffs

Notes

mmi stands for minimum mutual information, which is the same as maximizing entropy.

knockpy.smatrix.compute_smatrix_factored(Sigma, D=None, U=None, method='mvr', num_factors=20, **kwargs)[source]

Wraps S-matrix generation functions which approximate Sigma = np.diag(D) + np.dot(U, U.T).

Parameters:

Sigmanp.ndarray
``(p, p)``-shaped covariance matrix of X
Dnp.ndarray: p-shaped array of diagonal elements for factor model. Only used if how_approx=’factor’. This is optional if Sigma is not None.
Unp.ndarray: (p, k)-shaped matrix for factor model. Usually k << p. Only used if how_approx=’factor’. This is optional if Sigma is not None.
methodstr: Method for constructing S-matrix. One of mvr, maxent, mmi.
num_factorsint: The number of factors if how_approx=’factor’. Defaults to 20.
kwargsdict: kwargs to pass to one of the wrapped S-matrix solvers.

Returns:

Snp.ndarray: (p, p)-shaped (block) diagonal matrix used to generate knockoffs

Notes

mmi stands for minimum mutual information, which is the same as maximizing entropy.

knockpy.smatrix.divide_computation(Sigma, max_block)[source]

Approximates a correlation matrix Sigma as a block-diagonal matrix using hierarchical clustering. Roughly follows the R knockoff package.

Parameters:

Sigmanp.ndarray: (p, p)-shaped covariance matrix of X
max_sizeint: Maximum size of a block in the block-diagonal approximation.

Returns:

blocksnp.ndarray: (p, )-shaped numpy array where blocks[i] == j indicates that variable i belongs to block j.

knockpy.smatrix.merge_groups(groups, max_block)[source]: Merges groups of variables together while ensuring all new groups have size less than max_block.

knockpy.smatrix.parse_method(method, groups, p=None)[source]: Decides which method to use to create the knockoff S matrix

Methods for minimum-reconstructability knockoffs.

knockpy.mrc.cholupdate(R, x, add=True)[source]

Performs rank one updates to a cholesky factor R in place.

Parameters:

Rnp.ndarray: A (p,p)-shaped upper-triangular matrix.
xnp.ndarray: A (p,)-shaped vector.
addbool: If True, performs a rank one update; else performs a rank one downdate.

Returns:

Rnp.ndarray: Suppose the parameter R was a cholesky factor of a matrix V. Upon return, R is the cholesky factor of V +/- np.outer(x, x).

Notes

This function modifies both R and x in place.
The choldate package is a much faster and more

numerically stable alternative.

knockpy.mrc.maxent_loss(Sigma, S, smoothing=0)[source]

Computes the log determinant of the feature-knockoff precision matrix, which is proportional to the negative entropy of [X, tilde{X}].

Parameters:

Sigmanp.ndarray: (p, p)-shaped covariance matrix of X
Snp.ndarray: (p, p)-shaped S-matrix used to generate knockoffs
smoothingfloat: Add smoothing to all eigenvalues of the feature-knockoff precision matrix before taking the log determinant to avoid numerical instability. Defaults to 0.

Returns:

lossfloat: The maxent loss for Sigma and S. This is infinite if S is not feasible.

knockpy.mrc.mmi_loss(*args, **kwargs)[source]

Computes the log determinant of the feature-knockoff precision matrix, which is proportional mutual information between X and knockoffs.

This is identical to maxent_loss and exists only for backwards compatability.

knockpy.mrc.mvr_loss(Sigma, S, smoothing=0)[source]

Computes minimum variance-based reconstructability loss for knockoffs, e.g., the trace of the feature-knockoff precision matrix.

Parameters:

Sigmanp.ndarray: (p, p)-shaped covariance matrix of X
Snp.ndarray: (p, p)-shaped S-matrix used to generate knockoffs
smoothingfloat: Add smoothing to all eigenvalues of the feature-knockoff precision matrix before inverting to avoid numerical instability. Defaults to 0.

Returns:

lossfloat: The MVR loss for Sigma and S. This is infinite if S is not feasible.

knockpy.mrc.solve_ciknock(Sigma, tol=1e-05, num_iter=10)[source]

Computes S-matrix used to generate conditional independence knockoffs.

Parameters:

Sigmanp.ndarray: (p, p)-shaped covariance matrix of X
tolfloat: Minimum permissible eigenvalue of 2Sigma - S and S.
num_iterint: The number of iterations in the binary search to ensure S is feasible.

Returns:

Snp.ndarray: (p, p)-shaped (block) diagonal matrix used to generate knockoffs

Notes

When the S-matrix corresponding to conditional independence knockoffs is not feasible, this computes that S matrix and then does a binary search to find the maximum gamma such that gamma * S is feasible.

knockpy.mrc.solve_maxent(Sigma, tol=1e-05, verbose=False, num_iter=50, converge_tol=0.0001, choldate_warning=True)[source]

Computes S-matrix used to generate maximum entropy knockoffs using coordinate descent.

Parameters:

Sigmanp.ndarray: (p, p)-shaped covariance matrix of X
tolfloat: Minimum permissible eigenvalue of 2Sigma - S and S.
verbosebool: If True, prints updates during optimization.
num_iterint: The number of coordinate descent iterations. Defaults to 50.
converge_tolfloat: A parameter specifying the criteria for convergence.
choldate_warningbool: If True, will warn the user if choldate is not installed. Defaults to True

Returns:

Snp.ndarray: (p, p)-shaped (block) diagonal matrix used to generate knockoffs

knockpy.mrc.solve_maxent_factored(D, U, tol=1e-05, verbose=False, num_iter=50, converge_tol=0.0001)[source]

Computes S-matrix used to generate maximum entropy knockoffs using coordinate descent assuming that the covariance matrix follows a factor model. This means Sigma = D + UU^T for a p x p diagonal matrix D and a p x k matrix U.

Parameters:

Dnp.ndarray: p-shaped array of diagonal elements.
Unp.ndarray: (p, k)-shaped matrix. Usually k << p.
tolfloat: Minimum permissible eigenvalue of 2Sigma - S and S.
verbosebool: If True, prints updates during optimization.
num_iterint: The number of coordinate descent iterations. Defaults to 50.
converge_tolfloat: A parameter specifying the criteria for convergence.

Returns:

Snp.ndarray: (p, p)-shaped (block) diagonal matrix used to generate knockoffs

knockpy.mrc.solve_mmi(*args, **kwargs)[source]: Computes S-matrix used to generate minimum mutual information knockoffs. This is identical to solve_maxent and exists only for backwards compatability.

knockpy.mrc.solve_mvr(Sigma, groups=None, tol=1e-05, verbose=False, num_iter=50, smoothing=0, rej_rate=0, converge_tol=0.001, choldate_warning=True)[source]

Computes S-matrix used to generate minimum variance-based reconstructability knockoffs using coordinate descent.

Parameters:

Sigmanp.ndarray: (p, p)-shaped covariance matrix of X
groupsnp.ndarray: For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).
tolfloat: Minimum permissible eigenvalue of 2Sigma - S and S.
verbosebool: If True, prints updates during optimization.
num_iterint: The number of coordinate descent iterations. Defaults to 50.
smoothingfloat: Add smoothing to all eigenvalues of the feature-knockoff precision matrix before inverting to avoid numerical instability. Defaults to 0.
converge_tolfloat: A parameter specifying the criteria for convergence.
choldate_warningbool: If True and groups is None, will warn the user if choldate is not installed. Defaults to True.

Returns:

Snp.ndarray: (p, p)-shaped (block) diagonal matrix used to generate knockoffs

knockpy.mrc.solve_mvr_factored(D, U, tol=1e-05, verbose=False, num_iter=15, converge_tol=0.0001)[source]

Computes S-matrix used to generate mvr knockoffs using coordinate descent assuming that the covariance matrix follows a factor model. This means Sigma = D + UU^T for a p x p diagonal matrix D and a p x k matrix U.

Parameters:

Dnp.ndarray: p-shaped array of diagonal elements.
Unp.ndarray: (p, k)-shaped matrix. Usually k << p.
tolfloat: Minimum permissible eigenvalue of 2Sigma - S and S.
verbosebool: If True, prints updates during optimization.
num_iterint: The number of coordinate descent iterations. Defaults to 50.
converge_tolfloat: A parameter specifying the criteria for convergence.

Returns:

Snp.ndarray: (p, p)-shaped (block) diagonal matrix used to generate knockoffs

knockpy.mac.calc_min_group_eigenvalue(Sigma, groups, tol=1e-05, verbose=False)[source]: Calculates the minimum “group” eigenvalue of a covariance matrix Sigma: see Dai and Barber 2016. This is useful for constructing equicorrelated (group) knockoffs.

knockpy.mac.solve_SDP(Sigma, verbose=False, num_iter=10, tol=1e-05)[source]

Solves ungrouped SDP to create S-matrix for MAC-minimizing knockoffs.

Parameters:

Sigmanp.ndarray: (p, p)-shaped covariance matrix of X
verbosebool: If True, prints updates during optimization.
num_iterint: Number of iterations in a final binary search to account for numerical errors and ensure 2Sigma - S is PSD.
tolfloat: Minimum permissible eigenvalue of 2Sigma - S and S.

Returns:

Snp.ndarray: (p, p)-shaped diagonal S-matrix used to generate knockoffs

knockpy.mac.solve_equicorrelated(Sigma, groups, tol=1e-05, verbose=False, num_iter=10)[source]

Calculates the block diagonal matrix S using the equicorrelated method described by Dai and Barber 2016.

Parameters:

Sigmanp.ndarray: (p, p)-shaped covariance matrix of X
groupsnp.ndarray: For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).
tolfloat: Minimum permissible eigenvalue of 2Sigma - S and S.

Returns:

Snp.ndarray: (p, p)-shaped (block) diagonal matrix used to generate knockoffs

knockpy.mac.solve_group_SDP(Sigma, groups=None, verbose=False, objective='abs', norm_type=2, num_iter=10, tol=1e-05, dsdp_warning=True, **kwargs)[source]

Solves the MAC-minimizng SDP formulation for group knockoffs: extends Barer and Candes 2015/ Candes et al 2018.

Sigmanp.ndarray

(p, p)-shaped covariance matrix of X

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

verbosebool

If True, prints updates during optimization.

objectivestr

How to optimize the S matrix for group knockoffs. There are several options: - ‘abs’: minimize sum(abs(Sigma - S)) - ‘pnorm’: minimize Lp-th matrix norm. - ‘norm’: minimize different type of matrix norm (see norm_type below).

norm_typestr or int

When objective == ‘pnorm’, a float specifying which Lp-th matrix norm

to use. Can be any float >= 1. - When objective == ‘norm’, can be ‘fro’, ‘nuc’, np.inf, or 1.

num_iterint

Number of iterations in a final binary search to account for numerical errors and ensure 2Sigma - S is PSD.

tolfloat

Minimum permissible eigenvalue of 2Sigma - S and S.

kwargsdict

Keyword arguments to pass to the cvxpy.Problem.solve() method.

Returns:

Snp.ndarray: (p, p)-shaped (block) diagonal matrix used to generate knockoffs

Kpytorch

The kpytorch submodule includes a miscallaneous set of classes and methods which rely on pytorch for automatic differentation.

class knockpy.kpytorch.deeppink.DeepPinkModel(p, hidden_sizes=[64], y_dist='gaussian', normalize_Z=True)[source]

Methods

`add_module`(name, module)	Add a child module to the current module.
`apply`(fn)	Apply `fn` recursively to every submodule (as returned by `.children()`) as well as self.
`bfloat16`()	Casts all floating point parameters and buffers to `bfloat16` datatype.
`buffers`([recurse])	Return an iterator over module buffers.
`children`()	Return an iterator over immediate children modules.
`compile`(args, *kwargs)	Compile this Module's forward using `torch.compile()`.
`cpu`()	Move all model parameters and buffers to the CPU.
`cuda`([device])	Move all model parameters and buffers to the GPU.
`double`()	Casts all floating point parameters and buffers to `double` datatype.
`eval`()	Set the module in evaluation mode.
`extra_repr`()	Set the extra representation of the module.
`float`()	Casts all floating point parameters and buffers to `float` datatype.
`forward`(features)	Note: features are now shuffled
`get_buffer`(target)	Return the buffer given by `target` if it exists, otherwise throw an error.
`get_extra_state`()	Return any extra state to include in the module's state_dict.
`get_parameter`(target)	Return the parameter given by `target` if it exists, otherwise throw an error.
`get_submodule`(target)	Return the submodule given by `target` if it exists, otherwise throw an error.
`half`()	Casts all floating point parameters and buffers to `half` datatype.
`ipu`([device])	Move all model parameters and buffers to the IPU.
`load_state_dict`(state_dict[, strict, assign])	Copy parameters and buffers from `state_dict` into this module and its descendants.
`modules`()	Return an iterator over all modules in the network.
`mtia`([device])	Move all model parameters and buffers to the MTIA.
`named_buffers`([prefix, recurse, ...])	Return an iterator over module buffers, yielding both the name of the buffer as well as the buffer itself.
`named_children`()	Return an iterator over immediate children modules, yielding both the name of the module as well as the module itself.
`named_modules`([memo, prefix, remove_duplicate])	Return an iterator over all modules in the network, yielding both the name of the module as well as the module itself.
`named_parameters`([prefix, recurse, ...])	Return an iterator over module parameters, yielding both the name of the parameter as well as the parameter itself.
`parameters`([recurse])	Return an iterator over module parameters.
`predict`(features)	Wraps forward method, for compatibility with sklearn classes.
`register_backward_hook`(hook)	Register a backward hook on the module.
`register_buffer`(name, tensor[, persistent])	Add a buffer to the module.
`register_forward_hook`(hook, *[, prepend, ...])	Register a forward hook on the module.
`register_forward_pre_hook`(hook, *[, ...])	Register a forward pre-hook on the module.
`register_full_backward_hook`(hook[, prepend])	Register a backward hook on the module.
`register_full_backward_pre_hook`(hook[, prepend])	Register a backward pre-hook on the module.
`register_load_state_dict_post_hook`(hook)	Register a post-hook to be run after module's `load_state_dict()` is called.
`register_load_state_dict_pre_hook`(hook)	Register a pre-hook to be run before module's `load_state_dict()` is called.
`register_module`(name, module)	Alias for `add_module()`.
`register_parameter`(name, param)	Add a parameter to the module.
`register_state_dict_post_hook`(hook)	Register a post-hook for the `state_dict()` method.
`register_state_dict_pre_hook`(hook)	Register a pre-hook for the `state_dict()` method.
`requires_grad_`([requires_grad])	Change if autograd should record operations on parameters in this module.
`set_extra_state`(state)	Set extra state contained in the loaded state_dict.
`set_submodule`(target, module)	Set the submodule given by `target` if it exists, otherwise throw an error.
`share_memory`()	See `torch.Tensor.share_memory_()`.
`state_dict`(*args[, destination, prefix, ...])	Return a dictionary containing references to the whole state of the module.
`to`(args, *kwargs)	Move and/or cast the parameters and buffers.
`to_empty`(*, device[, recurse])	Move the parameters and buffers to the specified device without copying storage.
`train`([mode])	Set the module in training mode.
`type`(dst_type)	Casts all parameters and buffers to `dst_type`.
`xpu`([device])	Move all model parameters and buffers to the XPU.
`zero_grad`([set_to_none])	Reset gradients of all model parameters.

__call__
feature_importances
l1norm
l2norm

forward(features)[source]: Note: features are now shuffled

predict(features)[source]: Wraps forward method, for compatibility with sklearn classes.

Gradient-based methods for solving MRC problems. Currently only used for group-knockoffs.

class knockpy.kpytorch.mrcgrad.MVRLoss(Sigma, groups, init_S=None, invSigma=None, rec_prop=0, smoothing=0.01, min_smoothing=0.0001, method='mvr')[source]

A pytorch class to compute S-matrices for (gaussian) MX knockoffs which minimizes the trace of the feature-knockoff precision matrix (the inverse of the feature-knockoff covariance/Grahm matrix, G).

Parameters:: Sigma – p x p numpy matrix. Must already

be sorted by groups. :param groups: p length numpy array of groups. These must already be sorted and correspond to the Sigma. :param init_S: The initialization values for the S block-diagonal matrix. - A p x p matrix. The block-diagonal of this matrix, as specified by groups, will be the initial values for the S matrix. - A list of square numpy matrices, with the ith element corresponding to the block of the ith group in S. Default: Half of the identity. :param rec_prop: The proportion of data you are planning to recycle. (The optimal S matrix depends on the recycling proportion.) :param rec_prop: The proportion of knockoffs that will be recycled. :param smoothing: Calculate the loss as sum 1/(eigs + smoothing) as opposed to sum 1/eigs. This is helpful if fitting lasso statistics on extremely degenerate covariance matrices. Over the course of optimization, this smoothing parameter will go to 0. :param method: One of mvr or maxent (mmi for backwards compatability).

Methods

`add_module`(name, module)	Add a child module to the current module.
`apply`(fn)	Apply `fn` recursively to every submodule (as returned by `.children()`) as well as self.
`bfloat16`()	Casts all floating point parameters and buffers to `bfloat16` datatype.
`buffers`([recurse])	Return an iterator over module buffers.
`children`()	Return an iterator over immediate children modules.
`compile`(args, *kwargs)	Compile this Module's forward using `torch.compile()`.
`cpu`()	Move all model parameters and buffers to the CPU.
`cuda`([device])	Move all model parameters and buffers to the GPU.
`double`()	Casts all floating point parameters and buffers to `double` datatype.
`eval`()	Set the module in evaluation mode.
`extra_repr`()	Set the extra representation of the module.
`float`()	Casts all floating point parameters and buffers to `float` datatype.
`forward`([smoothing])	Calculates trace of inverse grahm feature-knockoff matrix
`get_buffer`(target)	Return the buffer given by `target` if it exists, otherwise throw an error.
`get_extra_state`()	Return any extra state to include in the module's state_dict.
`get_parameter`(target)	Return the parameter given by `target` if it exists, otherwise throw an error.
`get_submodule`(target)	Return the submodule given by `target` if it exists, otherwise throw an error.
`half`()	Casts all floating point parameters and buffers to `half` datatype.
`ipu`([device])	Move all model parameters and buffers to the IPU.
`load_state_dict`(state_dict[, strict, assign])	Copy parameters and buffers from `state_dict` into this module and its descendants.
`modules`()	Return an iterator over all modules in the network.
`mtia`([device])	Move all model parameters and buffers to the MTIA.
`named_buffers`([prefix, recurse, ...])	Return an iterator over module buffers, yielding both the name of the buffer as well as the buffer itself.
`named_children`()	Return an iterator over immediate children modules, yielding both the name of the module as well as the module itself.
`named_modules`([memo, prefix, remove_duplicate])	Return an iterator over all modules in the network, yielding both the name of the module as well as the module itself.
`named_parameters`([prefix, recurse, ...])	Return an iterator over module parameters, yielding both the name of the parameter as well as the parameter itself.
`parameters`([recurse])	Return an iterator over module parameters.
`project`(**kwargs)	Project by scaling sqrt_S
`pull_S`()	Returns the S matrix
`register_backward_hook`(hook)	Register a backward hook on the module.
`register_buffer`(name, tensor[, persistent])	Add a buffer to the module.
`register_forward_hook`(hook, *[, prepend, ...])	Register a forward hook on the module.
`register_forward_pre_hook`(hook, *[, ...])	Register a forward pre-hook on the module.
`register_full_backward_hook`(hook[, prepend])	Register a backward hook on the module.
`register_full_backward_pre_hook`(hook[, prepend])	Register a backward pre-hook on the module.
`register_load_state_dict_post_hook`(hook)	Register a post-hook to be run after module's `load_state_dict()` is called.
`register_load_state_dict_pre_hook`(hook)	Register a pre-hook to be run before module's `load_state_dict()` is called.
`register_module`(name, module)	Alias for `add_module()`.
`register_parameter`(name, param)	Add a parameter to the module.
`register_state_dict_post_hook`(hook)	Register a post-hook for the `state_dict()` method.
`register_state_dict_pre_hook`(hook)	Register a pre-hook for the `state_dict()` method.
`requires_grad_`([requires_grad])	Change if autograd should record operations on parameters in this module.
`scale_sqrt_S`(tol, num_iter)	Scales sqrt_S such that 2 Sigma - S is PSD.
`set_extra_state`(state)	Set extra state contained in the loaded state_dict.
`set_submodule`(target, module)	Set the submodule given by `target` if it exists, otherwise throw an error.
`share_memory`()	See `torch.Tensor.share_memory_()`.
`state_dict`(*args[, destination, prefix, ...])	Return a dictionary containing references to the whole state of the module.
`to`(args, *kwargs)	Move and/or cast the parameters and buffers.
`to_empty`(*, device[, recurse])	Move the parameters and buffers to the specified device without copying storage.
`train`([mode])	Set the module in training mode.
`type`(dst_type)	Casts all parameters and buffers to `dst_type`.
`update_sqrt_S`()	Updates sqrt_S using the block parameters
`xpu`([device])	Move all model parameters and buffers to the XPU.
`zero_grad`([set_to_none])	Reset gradients of all model parameters.

__call__

forward(smoothing=None)[source]: Calculates trace of inverse grahm feature-knockoff matrix

project(**kwargs)[source]: Project by scaling sqrt_S

pull_S()[source]: Returns the S matrix

scale_sqrt_S(tol, num_iter)[source]: Scales sqrt_S such that 2 Sigma - S is PSD.

update_sqrt_S()[source]: Updates sqrt_S using the block parameters

class knockpy.kpytorch.mrcgrad.PSGDSolver(Sigma, groups, losscalc=None, lr=0.01, verbose=False, max_epochs=100, tol=1e-05, line_search_iter=10, convergence_tol=0.1, **kwargs)[source]

Projected gradient descent to solve for MRC knockoffs. This will work for non-convex loss objectives as well, although it’s a heuristic optimization method. :param Sigma: p x p numpy array, the correlation matrix :param groups: p-length numpy array specifying groups :param losscalc: A pytorch class wrapping nn.module which contains the following methods: - .forward() which calculates the loss based on the internally stored S matrix. - .project() which ensures that both the internally-stored S matrix as well as (2*Sigma - S) are PSD. - .pull_S(), which returns the internally-stored S matrix. If None, creates a MVRLoss class. :param lr: Initial learning rate (default 1e-2) :param verbose: if true, reports progress :param max_epochs: Maximum number of epochs in SGD :param tol: Mimimum eigenvalue allowed for PSD matrices :param line_search_iter: Number of line searches to do when scaling sqrt_S. :param convergence_tol: After each projection, we calculate improvement = 2/3 * ||prev_opt_S - opt_S||_1 + 1/3 * (improvement) When improvement < convergence_tol, we return. :param kwargs: Passed to MVRLoss

Methods

optimize()

See __init__ for arguments.

cache_S

optimize()[source]: See __init__ for arguments.

knockpy.kpytorch.mrcgrad.block_diag_sparse(*arrs)[source]: Given a list of 2D torch tensors, creates a sparse block-diagonal matrix See https://github.com/pytorch/pytorch/issues/31942

knockpy.kpytorch.mrcgrad.solve_mrc_psgd(Sigma, groups=None, method='mvr', **kwargs)[source]: Wraps the PSGDSolver class. :param Sigma: Covariance matrix :param groups: groups for group knockoffs :param method: MRC loss (mvr or maxent) :param init_kwargs: kwargs to pass to PSGDSolver. :param optimize_kwargs: kwargs to pass to optimizer method. :returns: opt_S

Gaussian graphical model discovery

Knockoffs for detecting edges in gaussian graphical models. See https://arxiv.org/pdf/1908.11611.pdf.

class knockpy.ggm.KnockoffGGM(fstat='lcd', fstat_kwargs=None, knockoff_kwargs=None)[source]

Tests for edges in a Gaussian Graphical Model. See Li and Maathuis (2019) for details.

Parameters:

fstatstr

The feature statistic to use in the knockoff filter: this must be a string and must be a valid fixed-X knockoff feature statistic. Identifiers include: - ‘lasso’ or ‘lcd’: lasso coefficients differences - ‘lsm’: signed maximum of the lasso path statistic as

in Barber and Candes 2015

‘ols’: Ordinary least squares coefficients
‘margcorr’: marginal correlations between features and response

fstat_kwargsdict

Kwargs to pass to the feature statistic fit function, excluding the required arguments, defaults to {}

knockoff_kwargsdict

Kwargs for instantiating the knockoff sampler argument if the ksampler argument is a string identifier. Defaults to {}

Notes

There is no known way to use model-X knockoffs for this application.

Examples

Here we fit KnockoffGGM under the global null when the true Gaussian graphical model has no edges:

# Fake data-generating process for Gaussian graphical model import numpy as np X = np.random.randn(300, 30)

# LCD statistic with FX knockoffs from knockpy.ggm import KnockoffGGM gkf = KnockoffGGM(

fstat=’lcd’, knockoff_kwargs={“method”:”mvr”},

) edges = gkf.forward(X=X, verbose=True)

Attributes:

Wnp.array: (p,p)-shaped array of knockoff statistics with zeros along the diagonal. The rows of W obey the flip-sign property, i.e., W[0] obeys the flip-sign property.
kfslist: A list of KnockoffFilter classes corresponding to the regression run on each covariate.

Methods

forward(X[, logic, fdr, ggm_kwargs, verbose])

Runs the GGM filter by applying fixed-X knockoffs

forward(X, logic='and', fdr=0.1, ggm_kwargs=None, verbose=True)[source]

Runs the GGM filter by applying fixed-X knockoffs to each column of X using the other columns as covariates.

Parameters:

Xnp.ndarray: (n, p)-shaped design matrix.
fdrfloat: Nominal level at which to control the FDR. logic : string

One of ‘and’ or ‘or’. This is a hyperparameter used to determine the rejection set. Defaults to “and”.
ggm_kwargsdict: Dictionary of hyperparameters to pass to the ggm.compute_ggm_threshold function. Defaults to {}.
verbosebool: If true, log progress over time.

Returns:

edgesnp.array: (p,p)-shaped symmetric boolean array where edges[i,j] is true if and only if edge (i,j) has been discovered.

Notes

This requires fitting knockoffs p times, so it is quite expensive.

knockpy.ggm.compute_ggm_threshold(W, fdr=0.1, logic='and', a=1, offset=1)[source]

Parameters:

Wnp.array

(p,p)-shaped array of knockoff statistics with zeros along the diagonal. The rows of W must obey the flip-sign property.

fdrfloat: Desired level of FDR control.
logicstring: One of ‘and’ or ‘or’. This is a hyperparameter used to determine the rejection set. Defaults to “and”.
afloat: One of 0.01 or 1. Hyperparameter used to determine the rejection threshold. See Li and Maathuis (2019) for discussion.

offsetint

If offset = 0, control the modified FDR. If offset = 1 (default), controls the FDR exactly.

Returns:

Tnp.array: (p,)-shaped array of knockoff thresholds

Notes

See Algorithm 2, https://arxiv.org/pdf/1908.11611.pdf

knockpy.ggm.discovered_edges(W, T, logic='and')[source]

Parameters:

Wnp.array: (p,p)-shaped array of knockoff statistics with zeros along the diagonal. The rows of W must obey the flip-sign property.
Tnp.array: (p,)-shaped array of knockoff thresholds
logicstring: One of ‘and’ or ‘or’. This is a hyperparameter used to determine the rejection set. Defaults to “and”.

Returns:

edgesnp.array: (p,p)-shaped symmetric boolean array where edges[i,j] is true if and only if edge (i,j) has been discovered.

Quickly creating data-generating processes

A collection of functions for generating synthetic datasets.

knockpy.dgp.AR1(p=30, a=1, b=1, tol=None, max_corr=1, rho=None)[source]: Generates p-dimensional correlation matrix for AR(1) Gaussian process, where successive correlations are drawn from Beta(a,`b`) independelty. If rho is specified, then the process is stationary with correlation rho.

class knockpy.dgp.DGP(mu=None, Sigma=None, invSigma=None, beta=None, gibbs_graph=None)[source]

A utility class which creates a (random) data-generating process for a design matrix X and a response y. If the parameters are not specified, they will be randomly generated during the sample_data method.

Parameters:

munp.ndarray: (p,)-shaped mean vector for X data
Sigmanp.ndarray: (p, p)-shaped covariance matrix for X data
invSigmanp.ndarray: (p, p)-shaped precision matrix for X data
betanp.ndarray: coefficients used to generate y from X in a single index or sparse additive model.
gibbs_graphnp.ndarray: (p, p)-shaped matrix of coefficients for gibbs grid method.

Attributes:

munp.ndarray: See above
Sigmanp.ndarray: See above
invSigmanp.ndarray: See above
betanp.ndarray: See above
gibbs_Graphnp.ndarray: See above
Xnp.ndarray: (n, p)-shaped design matrix
ynp.ndarray: (n, )-shaped response vector

Methods

sample_data([p, n, x_dist, method, y_dist, ...])

(Possibly) generates random data-generating parameters and then samples data using those parameters.

sample_data(p=100, n=50, x_dist='gaussian', method='AR1', y_dist='gaussian', cond_mean='linear', coeff_size=1, coeff_dist=None, sparsity=0.5, groups=None, sign_prob=0.5, iid_signs=True, corr_signals=False, df_t=3, **kwargs)[source]

(Possibly) generates random data-generating parameters and then samples data using those parameters. By default, (X, y) are jointly Gaussian and y is a linear response to X.

Parameters:

nint: The number of data points
pint: The dimensionality of the data
x_diststr: Specifies the distribution of X. One of “Gaussian”, “blockt”, “ar1t”, or “gibbs”.
methodstr: How to generate the covariance matrix of X. One of “AR1”, “NestedAR1”, “partialcorr”, “factor”, “blockequi”, “ver”, “qer”, “dirichlet”, or “uniformdot”. See the docs for each of these methods.
y_diststr: Specifies the distribution of y. One of “Gaussian” or “binomial”.
cond_meanstr: How to calculate the conditional mean of y given X. Defaults to “linear”. See knockpy.dgp.sample_response for more options.
coeff_size: float: Size of non-zero coefficients
coeff_diststr: Specifies the distribution of nonzero coefficients. Three options: - None: all coefficients have absolute value coeff_size. - normal: all nonzero coefficients are drawn from Normal(coeff_size, 1). - uniform: nonzero coeffs are drawn from Unif(coeff_size/2, coeff_size).
sparsityfloat: Proportion of non-null coefficients. Generates np.floor(p*sparsity) non-nulls.
groupsnp.ndarray: A p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None. Else, floor(sparsity * num_groups) groups will be chosen to be non-null, where all elements of each group are non-null.
sign_probfloat: The probability that each nonzero coefficient will be positive.
iid_signsbool: If True, the signs of the coeffs are assigned independently. Else, exactly sign_prob*sparsity*p coefficients will be positive.
corr_signalsbool: If true, all of the nonzero coefficients will lie in a consecutive block.
df_tfloat: If the X variables are marginally t-distributed, the degrees of freedom.
kwargs: dict: keyword arguments to pass to method for generating the covariance matrix.

knockpy.dgp.DirichletCorr(p=100, temp=1, tol=1e-06)[source]: Generates a correlation matrix by sampling p eigenvalues from a dirichlet distribution whose p parameters are i.i.d. uniform [tol, temp] and generating a random covariance matrix with those eigenvalues.

knockpy.dgp.ErdosRenyi(p=300, delta=0.2, lower=0.1, upper=1, tol=0.1)[source]: Randomly samples bernoulli flags as well as values for partial correlations to generate sparse square matrices. Follows https://arxiv.org/pdf/1908.11611.pdf.

knockpy.dgp.FactorModel(p=500, rank=2)[source]: Generates random correlation matrix from a factor model with dimension p and rank rank.

knockpy.dgp.NestedAR1(p=500, a=7, b=1, tol=0.001, num_nests=5, nest_size=2)[source]: Generates correlation matrix for AR(1) Gaussian process with hierarchical correlation structure.

knockpy.dgp.PartialCorr(p=300, rho=0.3)[source]: Creates a correlation matrix of dimension p with partial correlation rho.

knockpy.dgp.UniformDot(d=100, p=100, tol=0.01)[source]: Let U be a random d x p matrix with i.i.d. uniform entries. Then Sigma = ``cov2corr``(U^T U)

knockpy.dgp.Wishart(d=100, p=100, tol=0.01)[source]: Let W be a random d x p matrix with i.i.d. Gaussian entries. Then Sigma = ``cov2corr``(W^T W).

knockpy.dgp.block_equi_graph(n=3000, p=1000, block_size=5, sparsity=0.1, rho=0.5, gamma=0, coeff_size=3.5, coeff_dist=None, sign_prob=0.5, iid_signs=True, corr_signals=False, beta=None, mu=None, **kwargs)[source]

Samples data according to a block-equicorrelated Gaussian design, where rho is the within-block correlation and gamma * rho is the between-block correlation.

Parameters:

nint: The number of data points
pint: The dimensionality of the data
block_sizeint: The size of blocks. Defaults to 5.
sparsityfloat: The proportion of groups which are null. Defaults to 0.1
rhofloat: The within-group correlation
gammafloat: The between-group correlation is gamma * rho
betanp.ndarray: If supplied, the (p,)-shaped set of coefficients. Else, will be generated by calling knockpy.dgp.create_sparse_coefficients.
munp.ndarray: The p-dimensional mean of the covariates. Defaults to 0.
kwargsdict: Args passed to knockpy.dgp.sample_response.

Notes

This defaults to the same data-generating process as Dai and Barber 2016 (see https://arxiv.org/abs/1602.03589).

knockpy.dgp.construct_gibbs_grid(n, p, temp=1)[source]: Creates gridlike gibbs_graph parameter for sample_gibbs. See sample_gibbs.

knockpy.dgp.coords2num(lcoord, wcoord, gridwidth=10)[source]: Takes coordinates of variable in a Gibbs grid, returns position

knockpy.dgp.cov2blocks(V, tol=1e-05)[source]: Decomposes a PREORDERED block-diagonal matrix V into its blocks.

knockpy.dgp.create_correlation_tree(corr_matrix, method='single')[source]

Creates hierarchical clustering (correlation tree) from a correlation matrix

Parameters:

corr_matrixnp.ndarray: (p, p)-shaped correlation matrix
methodstr: the method of hierarchical clustering: ‘single’, ‘average’, ‘fro’, or ‘complete’. Defaults to ‘single’.

Returns:

linknp.ndarray: The link of the correlation tree, as in scipy

knockpy.dgp.create_grouping(Sigma, cutoff=0.95, **kwargs)[source]

Creates a knockoff grouping from a covariance matrix using hierarchical clustering.

Parameters:

Sigmanp.ndarray: (p, p)-shaped covariance matrix
cutoffnp.ndarray: The correlation at which to “cut” the correlation tree. For method=’single’, cutoff=0.95 ensures that no two variables in different groups have absolute correlation greater than 0.95.

Returns:

groupsnp.ndarray: (p,)-shaped array of groups for each variable

knockpy.dgp.create_sparse_coefficients(p, sparsity=0.5, groups=None, coeff_size=1, coeff_dist=None, sign_prob=0.5, iid_signs=True, corr_signals=False, n=None)[source]

Generate a set of sparse coefficients for single index or sparse additive models. p : int

Dimensionality of coefficients

sparsityfloat: Proportion of non-null coefficients. Generates np.floor(p*sparsity) non-nulls.
coeff_size: float: Size of non-zero coefficients
groupsnp.ndarray: A p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None. Else, floor(sparsity * num_groups) groups will be chosen to be non-null, where all elements of each group are non-null.
sign_probfloat: The probability that each nonzero coefficient will be positive.
iid_signsbool: If True, the signs of the coeffs are assigned independently. Else, exactly sign_prob*sparsity*p coefficients will be positive.
coeff_diststr: Specifies the distribution of nonzero coefficients. Three options: - None: all coefficients have absolute value coeff_size. - normal: all nonzero coefficients are drawn from Normal(coeff_size, 1). - uniform: nonzero coeffs are drawn from Unif(coeff_size/2, coeff_size).
corr_signalsbool: If true, all of the nonzero coefficients will lie in a consecutive block.

Returns:

betanp.ndarray: (p,)-shaped array of sparse coefficients

knockpy.dgp.graph2cliques(Q)[source]: Turns graph Q of connections into binary cliques for Gibbs grid

knockpy.dgp.num2coords(i, gridwidth=10)[source]

Coordinates of variable i in a Gibbs grid

Parameters:

iint: Position of variable in ordering
gridwidthint: Width of the grid

Returns:

length_coord, width_coord (coordinates)

knockpy.dgp.sample_ar1t(rhos, n=50, df_t=3)[source]

Samples t-distributed variables according to a Markov chain.

Parameters:

rhosnp.ndarray: (p-1, )-length array of correlations between consecutive variables.
nint: The number of data points to sample
df_tfloat: The degrees of freedom of the t variables

Returns:

Xnp.ndarray: (n, p)-shaped design matrix

knockpy.dgp.sample_block_tmvn(blocks, n=50, df_t=3)[source]

Samples a blocks of multivariate-t distributed variables according to a list of covariance matrices called blocks.

Parameters:

blockslist: A list of square, symmetric numpy arrays. These are the covariance matrices for each block of variables.
nint: The number of data points to sample
df_tfloat: The degrees of freedom of the t-distribution

Notes

The variables are scaled such that the marginal variance of each variable equals a diagonal element in one of the blocks.

knockpy.dgp.sample_gibbs(n, p, gibbs_graph=None, temp=1, num_iter=15, K=20, max_val=2.5)[source]

Samples from a discrete Gibbs measure using a gibbs sampler.

The joint likelihood for the p-dimensional data X1 through Xp is the product of all terms of the following form:

`np.exp(-1*gibbs_graph[i,j]*np.abs(Xi - Xj))`

where gibbs_graph is assumed to be symmetric (and is usually sparse). Each feature takes values on an evenly spaced grid from -1*max_val to max_val with K values.

Parameters:

nint: How many data points to sample
pint: Dimensionality of the data
gibbs_graphnp.ndarray: A symmetric (p, p)-shaped array. See likelihood equation. By default, this is corresponds to an undirected graphical model with a square grid (like an Ising model) with nonzero entries set to 1 or -1 with equal probability.
tempfloat: Governs the strength of interactions between features—see the likelihood equation.
num_iterint: Number of iterations in the Gibbs sampler; defaults to 15.
Kint: Number of discrete values each sampled feature can take.
max_valfloat: The maximum absolute value each feature can take.

Returns:

Xnp.ndarray: (n, p)-shaped array of data
gibbs_graphnp.ndarray: The generated gibbs_graph.

knockpy.dgp.sample_response(X, beta, cond_mean='linear', y_dist='gaussian')[source]

Given a design matrix X and coefficients beta, samples a response y.

Xnp.ndarray

(n, p)-shaped design matrix

betanp.ndarray

(p, )-shaped coefficient vector

cond_meanstr

How to calculate the conditional mean of y given X, denoted mu(X). Six options:

“linear” denotes np.dot(X, beta)

“cubic” denotes np.dot(X**3, beta) - np.dot(X, beta)

3. “trunclinear” ((X * beta >= 1).sum(axis = 1)) Stands for truncated linear. 4. “pairint”: pairs up non-null coefficients according to the order of beta, multiplies them and their beta values, then sums. “pairint” stands for pairwise-interactions. 5. “cos”: mu(X) = sign(beta) * (beta != 0) * np.cos(X) 6. “quadratic”: mu(X) = np.dot(np.power(X, 2), beta)

y_diststr

If “gaussian”, y is the conditional mean plus gaussian noise. If “binomial”, Pr(y=1) = softmax(cond_mean).

Returns:

ynp.ndarray: (n,)-shaped response vector

Utility functions

knockpy.utilities.apply_pool(func, constant_inputs={}, num_processes=1, **kwargs)[source]

Spawns num_processes processes to apply func to many different arguments. This wraps the multiprocessing.pool object plus the functools partial function.

Parameters:

funcfunction: An arbitrary function
constant_inputsdictionary: A dictionary of arguments to func which do not change in each of the processes spawned, defaults to {}.
num_processesint: The maximum number of processes spawned, defaults to 1.
kwargsdict: Each key should correspond to an argument to func and should map to a list of different arguments.

Returns:

outputslist: List of outputs for each input, in the order of the inputs.

Examples

If we are varying inputs ‘a’ and ‘b’, we might have ``apply_pool(

func=my_func, a=[1,3,5], b=[2,4,6]

)`` which would return [my_func(a=1, b=2), my_func(a=3,b=4), my_func(a=5,b=6)].

knockpy.utilities.blockdiag_to_blocks(M, groups)[source]

Given a square array M, returns a list of diagonal blocks of M as specified by groups.

Parameters:

Mnp.ndarray: (p, p)-shaped array
groupsnp.ndarray: (p, )-shaped array with m unique values. If groups[i] == j, this indicates that coordinate i belongs to group j.

Returns:

blockslist: A list of square np.ndarrays. blocks[i] corresponds to group identified by the ith smallest unique value of groups.

knockpy.utilities.calc_group_sizes(groups)[source]

Given a list of groups, finds the sizes of the groups.

Parameters:

groupsnp.ndarray: (p, )-shaped array which takes m integer values from 1 to m. If groups[i] == j, this indicates that coordinate i belongs to group j.
:param groups: p-length array of integers between 1 and m,

Returns:

sizesnp.ndarray: (m, )-length array of group sizes.

knockpy.utilities.calc_mineig(M)[source]: Calculates the minimum eigenvalue of a square symmetric matrix M

knockpy.utilities.chol2inv(X)[source]: Uses cholesky decomp to get inverse of matrix

knockpy.utilities.cov2corr(M, invM=None)[source]: Rescales a p x p cov. matrix M to be a correlation matrix. If invM is not None, also rescales invM to be the inverse of M.

knockpy.utilities.estimate_covariance(X, tol=0.0001, shrinkage='ledoitwolf', **kwargs)[source]

Estimates covariance matrix of X.

Parameters:

Xnp.ndarray: (n, p)-shaped design matrix
shrinkagestr: The type of shrinkage to apply during estimation. One of “ledoitwolf”, “graphicallasso”, or None (no shrinkage).
tolfloat: If shrinkage is None but the minimum eigenvalue of the MLE is below tol, apply LedoitWolf shrinkage anyway.
kwargsdict: kwargs to pass to the shrinkage estimator.

Returns:

Sigmanp.ndarray: (p, p)-shaped estimated covariance matrix of X
invSigmanp.ndarray: (p, p)-shaped estimated precision matrix of X

knockpy.utilities.estimate_factor(Sigma, num_factors=20, num_iter=10)[source]

Approximates Sigma = np.diag(D) + np.dot(U, U.T). See https://arxiv.org/pdf/2006.08790.pdf.

Parameters:

Sigmanp.ndarray: (p, p)-shaped covariance matrix of X
num_factorsint: Dimensionality of U.

Returns:

Dnp.ndarray: (p,)-shaped array of diagonal elements.
Unp.ndarray: (p, num_factors)-shaped array.

Notes

TODO: allow X as an input when Sigma does not fit in memory.

knockpy.utilities.fetch_group_nonnulls(non_nulls, groups)[source]: Combines feature-level null hypotheses into group-level hypothesis.

knockpy.utilities.permute_matrix_by_groups(groups)[source]: Create indices which permute a (covariance) matrix according to a list of groups.

knockpy.utilities.preprocess_groups(groups)[source]: Maps the m unique elements of a 1D “groups” array to the integers from 1 to m.

knockpy.utilities.random_permutation_inds(length)[source]

Returns indexes which will randomly permute/unpermute a numpy array of length length. Also returns indices which will undo this permutation.

Returns:

indsnp.ndarray: (length,)-shaped ndarray corresponding to a random permutation from 0 to length-1.
rev_indsnp.ndarray: (length,)-shaped ndarray such that for any (length,)-shaped array called x, x[inds][rev_inds] equals x.

knockpy.utilities.scale_until_PSD(Sigma, S, tol, num_iter)[source]

Perform a binary search to find the largest gamma such that the minimum eigenvalue of 2*Sigma - gamma*S is at least tol.

Returns:

gamma * Snp.ndarray: See description.
gammafloat: See description

knockpy.utilities.shift_until_PSD(M, tol)[source]: Add the identity until a p x p matrix M has eigenvalues of at least tol

knockpy.utilities.srand(val: int) → None[source]: Some knockpy functions invoke the C rand() function to generate random numbers. Invoking srand() with some fixed value can help ensure repeatable results.