{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### MRC Knockoffs Primer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given a set of $p$ features $X = (X_1, \\dots, X_p)$, how should one generate knockoffs $\\tilde{X}$? A \"conventional\" method is to minimize the mean absolute correlation, defined as \n", "$$\\mathrm{MAC} = \\sum_{j=1}^p |\\text{Cor}(X_j, \\tilde{X}_j)|. $$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As discussed in [Spector and Janson (2020)](https://arxiv.org/abs/2011.14625), MAC-minimizing knockoffs (known as \"SDP knockoffs\") perform poorly when the features are correlated. In particular, minimizing the MAC often makes it possible to *reconstruct* each feature $X_j$ using the other features $X_{-j}$ and the knockoffs $\\tilde{X}$. In other words, SDP knockoffs often ensures that no features contain any unique information, which makes it very hard to determine which features are important and which are unimportant." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A better approach to constructing knockoffs is to *minimizing the reconstructability* (MRC) of each feature $X_j$. Intuitively, this framework maximizes the amount of unique information each feature contains, making it easier to determine which features are important. **MVR knockoffs** are one instantiation of this framework, and as illustrated\n", "below, they have much higher power than SDP knockoffs in a variety of settings." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example 1**: As an example, we consider a case where $X$ is Gaussian and the correlation between each pair of features is $50\\%$. In this situation, SDP knockoffs are provably nearly powerless to select any features. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SDP knockoffs made 0.0 discoveries!\n" ] } ], "source": [ "import numpy as np\n", "import knockpy as kpy\n", "from knockpy.knockoff_filter import KnockoffFilter\n", "\n", "p = 300\n", "n = 600\n", "np.random.seed(110)\n", "\n", "# Covariance matrix of X\n", "rho = 0.5\n", "Sigma = (1-rho) * np.eye(p) + rho * np.ones((p, p))\n", "X = np.random.multivariate_normal(np.zeros(p), cov=Sigma, size=(n,))\n", "\n", "# Sample y given X\n", "beta = kpy.dgp.create_sparse_coefficients(\n", " p=p, sparsity=0.2, coeff_size=1, coeff_dist=\"uniform\"\n", ")\n", "y = X @ beta + np.random.randn(n)\n", "\n", "# SDP knockoff filter\n", "kfilter_sdp = KnockoffFilter(\n", " fstat='lasso', ksampler='gaussian', knockoff_kwargs={\"method\":\"sdp\"}\n", ")\n", "selections_sdp = kfilter_sdp.forward(X=X, y=y, Sigma=Sigma, fdr=0.1)\n", "print(f\"SDP knockoffs made {selections_sdp.sum()} discoveries!\")" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MVR knockoffs made 67.0 discoveries.\n", "MVR knockoffs had a power of 100.0% and false positive rate of 10.45%.\n" ] } ], "source": [ "# Run knockoff filter\n", "kfilter_mvr = KnockoffFilter(fstat='lasso', ksampler='gaussian')\n", "selections_mvr = kfilter_mvr.forward(X=X, y=y, Sigma=Sigma, fdr=0.1)\n", "\n", "# calculate power and false positive rate\n", "power = np.around(100*np.dot(selections_mvr, beta != 0) / max(1, (beta != 0).sum()), 2)\n", "fdp = np.around(100*np.dot(selections_mvr, beta == 0) / max(1, selections_mvr.sum()), 2)\n", "print(f\"MVR knockoffs made {selections_mvr.sum()} discoveries.\")\n", "print(f\"MVR knockoffs had a power of {power}% and false positive rate of {fdp}%.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example 2**: As shown below, SDP knockoffs also have low power when X has an AR1 covariance matrix. See [Spector and Janson, 2020](https://arxiv.org/abs/2011.14625) for more discussion on this setting." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Generate data where X is Gaussian with an AR1 covariance matrix\n", "# where Cov(Xj, X_{j+1}) is sampled from Beta(a,b)\n", "# and y | X is a sparse Gaussian linear model\n", "np.random.seed(111)\n", "data_gen_process = kpy.dgp.DGP()\n", "data_gen_process.sample_data(\n", " method='ar1', \n", " a=3,\n", " b=1,\n", " n=650, # number of data-points\n", " p=500, # dimensionality\n", " sparsity=0.1, # number of non-null coefficients\n", " coeff_dist='uniform', # distribution of size of non-null coefficients\n", " coeff_size=1, # size of non-null coefficients\n", " corr_signals=True # non-nulls features are correlated \n", ")\n", "X = data_gen_process.X\n", "y = data_gen_process.y\n", "beta = data_gen_process.beta\n", "Sigma = data_gen_process.Sigma" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SDP knockoffs made 0.0 discoveries!\n", "SDP knockoffs had a power of 0.0% and false positive rate of 0.0%.\n" ] } ], "source": [ "# run SDP knockoff filter\n", "kfilter_sdp = KnockoffFilter(\n", " fstat='lasso', ksampler='gaussian', knockoff_kwargs={\"method\":\"sdp\"}\n", ")\n", "selections_sdp = kfilter_sdp.forward(X=X, y=y, Sigma=Sigma, fdr=0.1)\n", "\n", "# Calculate power/FDR \n", "power = np.around(100*np.dot(selections_sdp, beta != 0) / max(1, (beta != 0).sum()), 2)\n", "fdp = np.around(100*np.dot(selections_sdp, beta == 0) / max(1, selections_sdp.sum()), 2)\n", "print(f\"SDP knockoffs made {selections_sdp.sum()} discoveries!\")\n", "print(f\"SDP knockoffs had a power of {power}% and false positive rate of {fdp}%.\")" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MVR knockoffs made 55.0 discoveries.\n", "MVR knockoffs had a power of 94.0% and false positive rate of 14.55%.\n" ] } ], "source": [ "# Run knockoff filter\n", "kfilter_mvr = KnockoffFilter(fstat='lasso', ksampler='gaussian')\n", "selections_mvr = kfilter_mvr.forward(X=X, y=y, Sigma=Sigma, fdr=0.1)\n", "\n", "# calculate power and false positive rate\n", "power = np.around(100*np.dot(selections_mvr, beta != 0) / max(1, (beta != 0).sum()), 2)\n", "fdp = np.around(100*np.dot(selections_mvr, beta == 0) / max(1, selections_mvr.sum()), 2)\n", "print(f\"MVR knockoffs made {selections_mvr.sum()} discoveries.\")\n", "print(f\"MVR knockoffs had a power of {power}% and false positive rate of {fdp}%.\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 2 }