API Reference
BLiP
- pyblip.blip.BLiP(samples=None, cand_groups=None, weight_fn='inverse_size', error='fdr', q=0.1, max_pep=0.25, deterministic=True, verbose=True, solver_verbose=False, perturb=True, max_iters=100, search_method='binary', return_problem_status=False, **kwargs)[source]
Given samples from a posterior or a list of
CandidateGroup
objects, performs resolution-adaptive signal detection to maximize power while controlling (e.g.) the FDR.Note: when working with image data or a continuous set of locations, consider using
BLiP_cts
.- Parameters
- samplesnp.array
An
(N, p)
-shaped array of posterior samples, where a nonzero value indicates the presence of a signal. Defaults to None.- cand_groupslist
A list of
CandidateGroup
objects. Defaults to None. Note eithercand_groups
orsamples
must be provided.- weight_fnstring or function
A function mapping
CandidateGroup
objects to nonnegative weights. This defines the power function. Also can be ‘inverse_size’, ‘log_inverse_size’, or ‘prespecified’. Defaults to ‘inverse_size’.- errorstring
The type of error rate to control. Must be one of ‘fdr’, ‘local_fdr’, ‘fwer’, ‘pfer’.
- qfloat
The level at which to control the error rate.
- max_pepfloat
BLiP automatically filters out candidate groups with a posterior error probability (PEP) above
max_pep
. Default: 0.5.- deterministicbool
If True, guarantees a deterministic solution. Defaults to True.
- verbosebool
If True, gives occasional progress reports.
- solver_verbosebool
If True, the underlying LP solver is verbose.
- search_methodstring
For FWER control, how to find the optimal parameter for the LP. Either “none” or “binary.” Defaults to “binary”.
- max_itersint
Maximum number of binary-search iterations for FWER problem. Defaults to 100.
- perturbbool
If True, will perturb the weight function. Defaults to True.
- return_problem_statusFalse
If True, will return extra information about the problem.
- Returns
- detectionslist
A list of
CandidateGroup
objects, which asserts that each group contains at least one signal.- problem_statusdict
A dict containing information about the BLiP optimization problem. Only returned if
return_problem_status=True
.
Examples
Here we fit BLiP to perform resolution-adaptive variable selection for a Gaussian linear model:
import numpy as np import scipy.linalg ## Synthetic data generating process with AR1 design matrix n, p, nsignals, rho = 100, 500, 20, 0.95 c = np.cumsum(np.zeros(p) + np.log(rho)) - np.log(rho) cov = scipy.linalg.toeplitz(np.exp(c)) X = np.dot(np.random.randn(n, p), np.linalg.cholesky(cov).T) # Sparse coefficients for linear model beta = np.zeros(p) signal_locations = np.random.choice(np.arange(p), nsignals) beta[signal_locations] = np.random.randn(nsignals) y = np.dot(X, beta) + np.random.randn(n) ## Fit linear spike and slab model import pyblip lm = pyblip.linear.LinearSpikeSlab(X=X, y=y) lm.sample(N=1000, chains=10, bsize=2) ## Run BLiP detections = pyblip.blip.BLiP( samples=lm.betas, q=0.1, error='fdr' )
- pyblip.blip.BLiP_cts(locs, grid_sizes=array([4.000e+00, 6.000e+00, 8.000e+00, 1.100e+01, 1.500e+01, 2.000e+01, 2.800e+01, 3.900e+01, 5.400e+01, 7.500e+01, 1.040e+02, 1.440e+02, 2.000e+02, 2.770e+02, 3.840e+02, 5.320e+02, 7.370e+02, 1.021e+03, 1.414e+03, 1.959e+03, 2.714e+03, 3.761e+03, 5.210e+03, 7.218e+03, 1.000e+04]), weight_fn=<function inverse_radius_weight>, extra_centers=None, max_pep=0.25, shape='circle', min_blip_size=1500, verbose=True, **kwargs)[source]
BLiP when the set of locations is continuous, e.g., when working with image data.
- Parameters
- locsnp.ndarray
A (N, num_disc, d)-dimensional array, where
N
is the number of posterior samples,d
is the dimension. Each point represents to a signal in a posterior sample. NANs are ignored (useful when the number of signals changes between posterior iterations.)- grid_sizeslist or np.ndarray
List of grid sizes to split up the locations.
- weight_fnstring or function
A function mapping
CandidateGroup
objects to nonnegative weights. This defines the power function. Defaults toinverse_radius
.- extra_centersnp.ndarray
A (ncenters, d)-dimensional array. At each resolution, candidate groups will be computed with centers at these locations.
- max_pepfloat
The maximum allowable PEP for output candidate groups. Defaults to 0.25.
- shapestring
One of
square
orcircle
- min_blip_sizeint
Combines connected components so all subproblems are of at least this size.
- verbosebool
If True (default), gives occasional status updates.
- kwargsdict
Additional arguments to pass to the underlying BLiP call, such as the error rate or nominal level.
- Returns
- detectionslist
A list of
CandidateGroup
objects, which asserts that each group contains at least one signal.- problem_statusdict
A dict containing information about the BLiP optimization problem. Only returned if
return_problem_status=True
.
- pyblip.blip.binarize_selections(cand_groups, q, error, deterministic, problem_status=None, return_problem_status=False, tol=0.001, nsample=10, verbose=False)[source]
- Parameters
- cand_groupslist
list of candidate groups.
- qfloat
Level at which to control the error rate.
- errorstring
The error to control: one of ‘fdr’, ‘fwer’, ‘pfer’, ‘local_fdr’
- deterministicbool
If True, will not use a randomized solution.
- nsampleint
Number of samples for randomized version.
- Returns
- detectionslist
List of candidate groups which have been detected.
Creating Candidate Groups
Variable selection / discrete locations
Functions for creating candidate groups from posterior samples
- class pyblip.create_groups.CandidateGroup(group, pep, data=None)[source]
A single candidate group as an input to BLiP. Discovering a group asserts there is at least one signal in the group.
- Parameters
- grouplist or set
Set of locations in the group.
- pepfloat
Posterior error probability (1 - PIP) for the group.
- datadict
Miscallaneous other attributes to associate with the group.
Methods
to_dict
()Converts
CandidateGroup
into a dictionary.
- pyblip.create_groups.all_cand_groups(samples, X=None, q=0, max_pep=0.25, max_size=25, prenarrow=True, prefilter_thresholds=[0, 0.01, 0.02, 0.03, 0.05, 0.1, 0.2], min_purity=0.0, corr_matrix=None)[source]
Creates many candidate groups by prefiltering locations at various thresholds and then creating sequential and hierarchical candidate groups.
- Parameters
- samplesnp.ndarray
An
(N, p)
-shaped array of posterior samples, where a nonzero value indicates the presence of a signal.- Xnp.array
The n x p design matrix. Defaults to
None.
If provided, adds hierarchical groups based on a correlation cluster of X.- qfloat
The nominal level at which to control the error rate.
- max_pepfloat
The maximum posterior error probability allowed in a candidate group. Default is 0.25.
- max_sizefloat
Maximum size of a group. Default is 25.
- prenarrowbool
If true, “prenarrows” the candidate groups as described in the paper. Defaults to False.
- prefilter_thresholdslist
List of thresholds at which to prefilter the locations.
- min_purityfloat
Minimum acceptable absolute correlation within a candidate group. Defaults to zero.
- corr_matrixnp.array
Optional input; p x p correlation matrix. Only used if
X=None
andmin_purity > 0
.
- Returns
- cand_groupslist
A list of
CandidateGroup
objects.
- pyblip.create_groups.filter_by_purity(cand_groups, min_purity=0, corr_matrix=None, X=None)[source]
Filters candidate groups to only include groups of features whose minimum absolute correlation is larger than
min_purity
.- Parameters
- cand_groupslist
A list of CandidateGroup objects.
- min_purityfloat
Minimum acceptable purity.
- corr_matrixnp.array
p x p correlation matrix
- Xnp.array
n x p design matrix
- pyblip.create_groups.hierarchical_groups(samples, dist_matrix=None, max_pep=0.25, max_size=25, filter_sequential=False, min_purity=0, X=None, corr_matrix=None, **kwargs)[source]
Creates candidate groups by hierarchical clustering.
- Parameters
- samplesnp.ndarray
An
(N, p)
-shaped array of posterior samples, where a nonzero value indicates the presence of a signal.- dist_matrixnp.ndarray or list
A square numpy array corresponding to distances between locations. Can also be a list of different distance matrices. The default is to use a correlation matrix based on
samples.
- max_pepfloat
The maximum posterior error probability allowed in a candidate group. Default is 1.
- max_sizefloat
Maximum size of a group. Default is 25.
- filter_sequentialbool
If True, does not calculate PEPs for sequential (contiguous) groups of variables to avoid duplicates.
- min_purityfloat
Minimum acceptable absolute correlation within a candidate group. Defaults to zero.
- Xnp.array
Optional input; n x p design matrix. Only used if
corr_matrix=None
andmin_purity > 0
.- corr_matrixnp.array
Optional input; p x p correlation matrix. Only used if
X=None
andmin_purity > 0
.
- Returns
- cand_groupslist
A list of
CandidateGroup
objects.
- pyblip.create_groups.sequential_groups(samples=None, susie_alphas=None, q=0, max_pep=0.25, max_size=25, prenarrow=False, min_purity=0.0, X=None, corr_matrix=None)[source]
Calculates all sequential candidate groups below max_size.
- Parameters
- samplesnp.ndarray
An
(N, p)
-shaped array of posterior samples, where a nonzero value indicates the presence of a signal.- susie_alphasnp.ndarray
As an alternative to posterior samples, users may specify an L x p matrix of alphas from a SuSiE object. However, calling
susie_groups
is recommended instead.- qfloat
The nominal level at which to control the error rate.
- max_pepfloat
The maximum posterior error probability allowed in a candidate group. Default is 1.
- max_sizefloat
Maximum size of a group. Default is 25.
- prenarrowbool
If true, “prenarrows” the candidate groups as described in the paper. Defaults to False.
- min_purityfloat
Minimum acceptable absolute correlation within a candidate group. Defaults to zero.
- Xnp.array
Optional input; n x p design matrix. Only used if
corr_matrix=None
andmin_purity > 0
.- corr_matrixnp.array
Optional input; p x p correlation matrix. Only used if
X=None
andmin_purity > 0
.
- Returns
- cand_groupslist
A list of
CandidateGroup
objects.
- pyblip.create_groups.susie_groups(alphas, q, X=None, max_pep=0.25, max_size=25, prenarrow=False, purity_threshold=0.0, k_threshold=None, min_purity=0.0, corr_matrix=None)[source]
Creates candidate groups based on a SuSiE fit.
- Parameters
- alphasnp.array
An
(L, p)
-shaped matrix of alphas from a SuSiE object,L
is the number of SuSiE iterations andp
is the number of covariates.- qfloat
The level at which to control the error rate
- Xnp.array
The n x p design matrix. Defaults to
None.
If provided, adds hierarchical groups based on a correlation cluster of X.- max_pepfloat
The maximum posterior error probability allowed in a candidate group. Default is 1.
- max_sizefloat
Maximum size of a group. Default is 25.
- prenarrowbool
If true, “prenarrows” the candidate groups as described in the paper. Defaults to False.
- purity_thresholdfloat
When computing PIPs, ignores SuSiE iterations which do not pass this purity threshold, as in the original SusiE paper.
- k_thresholdfloat
When computing PIPs, ignores SuSiE iterations for which the vanilla SuSiE CS exceeds this size.
- min_purityfloat
Minimum acceptable absolute correlation within a candidate group. Defaults to zero.
- corr_matrixnp.array
Optional input; p x p correlation matrix. Only used if
X=None
andmin_purity > 0
.
- Returns
- cand_groupslist
A list of
CandidateGroup
objects.
Continuous sets of locations
Creates candidate groups when signals could appear anywhere in a continuous d-dimensional space.
- pyblip.create_groups_cts.grid_peps(locs, grid_sizes, count_signals=False, extra_centers=None, max_pep=0.25, log_interval=None, time0=None, shape='square')[source]
- Parameters
- locsnp.array
A (N, num_disc, d)-dimensional array. Here, N is the number of samples from the posterior, d is the number of dimensions of the space, and each point corresponds to a signal in a particular posterior sample.
- grid_sizeslist or np.ndarray
List of grid-sizes to split up the locations.
- count_signalsbool
If True, the number of signals in each group is counted.
- extra_centersnp.ndarray
A (ncenters, d)-dimensional array. At each resolution, candidate groups will be computed with centers at these locations.
- shapestring
One of
square
orcircle
- Returns
- cand_groupslist
A list of CandidateGroup objects.
- pyblip.create_groups_cts.grid_peps_to_cand_groups(filtered_peps, time0=None, min_blip_size=1000, verbose=False, shape='square', max_pep=0.25, min_pep=0.001)[source]
Turns the output of the
grid_peps
function into a list of list of CandidateGroups. Each sub-list corresponds to a list of completely disconnected CandidateGroups which can be fed to BLiP separately (this saves computation).- filtered_pepsdict
An output of the
grid_peps
function.- time0float
The initial time the analysis started, useful for logging.
- min_blip_sizeint
Combines connected components so all subproblems are at least this size.
- verbose: bool
If True, will report progress over time. Default: False.
- shapestring
One of
square
orcircle
- max_pepfloat
The maximum pep for candidate groups. Default: 1.
- min_pepfloat
Once we achieve a pep of this level for candidate groups, we do not search to find peps any lower. Default: 0.001.
- pyblip.create_groups_cts.normalize_locs(locs)[source]
- Returns
- locsnp.array
locs, but normalized such that all values lie in [0,1]. NAs are converted to -1.
- shiftsnp.array
d
-dimensional array corresponding to the shifts applied in the normalizatin.- scalesnp.array
d
-dimension array corresponding to the scales in the normalization.
Edge clique cover
Polynomial-time function to find a (heuristically) minimal edge clique cover
Weight functions
A few default choices of weight function to define power.
- pyblip.weight_fns.inverse_area_weight(cand_group)[source]
Weights candidate groups by the inverse of their area.
- Parameters
- cand_groupCandidateGroup
The candidate group
- Returns
- weightfloat
The associated weight when calculating power.
- pyblip.weight_fns.inverse_radius_weight(cand_group)[source]
Weights candidate groups by the inverse of the radius.
- Parameters
- cand_groupCandidateGroup
The candidate group
- Returns
- weightfloat
The associated weight when calculating power.
Bayesian Samplers
BLiP can wrap on top of nearly any Bayesian model or algorithm. However,
for convenience, pyblip
includes a few custom MCMC samplers.
Linear Spike and Slab
- class pyblip.linear.linear.LinearSpikeSlab(X, y, p0=0.9, p0_a0=1, p0_b0=1, update_p0=True, min_p0=0, sigma2=1, update_sigma2=True, sigma2_a0=2, sigma2_b0=1, tau2=1, tau2_a0=2, tau2_b0=1, update_tau2=True)[source]
Spike-and-slab model for linear regression.
- Parameters
- Xnp.array
(n,p)
-shaped design matrix.- ynp.array
n
-length array of responses.- p0float
Prior probability that any coefficient equals zero.
- update_p0bool
If True, updates
p0
using a Beta hyperprior onp0
. Else, the value ofp0
is fixed.- p0_a0float
If
update_p0
is True,p0
has a Beta(p0_a0
,p0_b0
,min_p0
) hyperprior.- p0_b0float
If
update_p0
is True,p0
has a TruncBeta(p0_a0
,p0_b0
,min_p0
) hyperprior.- min_p0float
Minimum value for
p0
as specified by the prior.- sigma2float
Variance of y given X.
- update_sigma2bool
If True, imposes an InverseGamma hyperprior on
sigma2
. Else, the value ofsigma2
is fixed.- sigma2_a0float
If
update_sigma2
is True,sigma2
has an InvGamma(sigma2_a0
,sigma2_b0
) hyperprior.- sigma2_b0float
If
update_sigma2
is True,sigma2
has an InvGamma(sigma2_a0
,sigma2_b0
) hyperprior.- tau2float
Prior variance on nonzero coefficients.
- update_tau2bool
If True, imposes an InverseGamma hyperprior on
tau2
. Else, the value oftau2
is fixed.- tau2_a0float
If
update_tau2
is True,tau2
has an InvGamma(tau2_a0
,tau2_b0
) hyperprior.- tau2_b0float
If
update_tau2
is True,tau2
has an InvGamma(tau2_a0
,tau2_b0
) hyperprior.
Methods
sample:
Samples from the posterior using Gibbs sampling.
- sample(N, burn=100, chains=1, num_processes=1, bsize=1, max_signals_per_block=None)[source]
- Nint
Number of samples per chain
- burnint
Number of samples to burn per chain
- chainsint
Number of chains to run
- num_processesint
How many processes to use
- bsizeint
Maximum block size within gibbs sampling. Default: 1.
- max_signals_per_blockint
Maximum number of signals allowed per block. Default: None (this places no restrictions on the number of signals per block). The default is highly recommended.
Probit Spike and Slab
- class pyblip.probit.probit.ProbitSpikeSlab(*args, **kwargs)[source]
Spike-and-slab model for probit regression. The arguments are identical to those for
LinearSpikeSlab
, excepty
should be binary.Methods
sample:
Samples from the posterior using Gibbs sampling.
- sample(N, burn=100, chains=1, num_processes=1, bsize=1, max_signals_per_block=None)[source]
- Parameters
- Nint
Number of samples per chain
- burnint
Number of samples to burn per chain
- chainsint
Number of chains to run
- num_processesint
How many processes to use
- bsizeint
Maximum block size within gibbs sampling. Default: 1.
- max_signals_per_blockint
Maximum number of signals allowed per block. Default: None (this places no restrictions on the number of signals per block). The default is highly recommended.
Neuronized Priors
- class pyblip.nprior.nprior.NPrior(X, y, tauw2, p0=None, update_p0=True, min_p0=1e-10, sigma_prior_type=0, sigma_a0=5, sigma_b0=1, alpha0_a0=1, alpha0_b0=1)[source]
Implements Neuronized Prior sampler for spike-and-slab regression.
- Parameters
- Xnp.array
(n,p)
-shaped design matrix- ynp.array
(n,)
-shaped vector of response data- p0float
The initial parameter for the proportion of nulls. Defaults to 1 - min(0.01, 1/p).
- update_p0bool
If true, will update p0 throughout MCMC sampling using a uniform hyperprior.
- min_p0float
If updating p0 throughout uniform sampling, will force p0 to be above
min_p0
. This can dramatically speed up computation in very high-dimensional settings.- sigma_prior_typeinteger
If 0, assumes sigma2 is conditionally independent of the coefficients given the residuals.
- tauw2float
prior variance of the weight parameter
- a0float
sigma2 has an inverse-gamma prior with parameters a0, b0
- b0float
sigma2 has an inverse-gamma prior with parameters a0, b0
Notes
See https://arxiv.org/pdf/1810.00141.pdf.
Methods
sample
([N, burn, chains, num_processes, ...])- Parameters
- sample(N=100, burn=10, chains=1, num_processes=1, joint_sample_W=True, group_alpha_update=True, log_interval=None)[source]
- Parameters
- Nint
The number of samples to draw from the chain
- burnint
The burn-in period for each chain.
- chainsint
The number of independent MCMC chains to run.
- num_processesint
The number of processes to run the chains.
- joint_sample_Wbool
If true, will jointly sample the “W” variables at each iteration before individually resampling alpha and W. This can improve sample efficiency but is a computational bottleneck in high dimensions.
- group_alpha_updatebool
If true, does a joint group-move update to estimate the sparsity. Else, uses the standard conjugacy rules for a Uniform prior on the sparsity.
- log_intervalint
Will log progress after
log_interval
iterations. Defaults to None (no logging).
Changepoint Detection
- pyblip.changepoint.detect_changepoints(Y, q=0.1, lm_kwargs={}, sample_kwargs={}, blip_kwargs={})[source]
Changepoint detection with BLiP using the LinearSpikeSlab sampler.
- Parameters
- Ynp.array
Array of observations in order they were observed.
- qfloat
Level at which to control the FDR.
- **kwargsdict
Optional inputs to linear spike slab model.
- **sample_kwargsdict
Optional inputs to
sample
method of linear spike slab model.- **blip_kwargsdict
Optional inputs to BLiP.