{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### MRC Knockoffs Primer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Given a set of $p$ features $X = (X_1, \\dots, X_p)$, how should one generate knockoffs $\\tilde{X}$? A \"conventional\" method is to minimize the mean absolute correlation, defined as \n",
    "$$\\mathrm{MAC} = \\sum_{j=1}^p |\\text{Cor}(X_j, \\tilde{X}_j)|. $$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As discussed in [Spector and Janson (2020)](https://arxiv.org/abs/2011.14625), MAC-minimizing knockoffs (known as \"SDP knockoffs\") perform poorly when the features are correlated. In particular, minimizing the MAC often makes it possible to *reconstruct* each feature $X_j$ using the other features $X_{-j}$ and the knockoffs $\\tilde{X}$. In other words, SDP knockoffs often ensures that no features contain any unique information, which makes it very hard to determine which features are important and which are unimportant."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A better approach to constructing knockoffs is to *minimizing the reconstructability* (MRC) of each feature $X_j$. Intuitively, this framework maximizes the amount of unique information each feature contains, making it easier to determine which features are important. **MVR knockoffs** are one instantiation of this framework, and as illustrated\n",
    "below, they have much higher power than SDP knockoffs in a variety of settings."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Example 1**: As an example, we consider a case where $X$ is Gaussian and the correlation between each pair of features is $50\\%$. In this situation, SDP knockoffs are provably nearly powerless to select any features. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "SDP knockoffs made 0.0 discoveries!\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "import knockpy as kpy\n",
    "from knockpy.knockoff_filter import KnockoffFilter\n",
    "\n",
    "p = 300\n",
    "n = 600\n",
    "np.random.seed(110)\n",
    "\n",
    "# Covariance matrix of X\n",
    "rho = 0.5\n",
    "Sigma = (1-rho) * np.eye(p) + rho * np.ones((p, p))\n",
    "X = np.random.multivariate_normal(np.zeros(p), cov=Sigma, size=(n,))\n",
    "\n",
    "# Sample y given X\n",
    "beta = kpy.dgp.create_sparse_coefficients(\n",
    "    p=p, sparsity=0.2, coeff_size=1, coeff_dist=\"uniform\"\n",
    ")\n",
    "y = X @ beta + np.random.randn(n)\n",
    "\n",
    "# SDP knockoff filter\n",
    "kfilter_sdp = KnockoffFilter(\n",
    "    fstat='lasso', ksampler='gaussian', knockoff_kwargs={\"method\":\"sdp\"}\n",
    ")\n",
    "selections_sdp = kfilter_sdp.forward(X=X, y=y, Sigma=Sigma, fdr=0.1)\n",
    "print(f\"SDP knockoffs made {selections_sdp.sum()} discoveries!\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MVR knockoffs made 67.0 discoveries.\n",
      "MVR knockoffs had a power of 100.0% and false positive rate of 10.45%.\n"
     ]
    }
   ],
   "source": [
    "# Run knockoff filter\n",
    "kfilter_mvr = KnockoffFilter(fstat='lasso', ksampler='gaussian')\n",
    "selections_mvr = kfilter_mvr.forward(X=X, y=y, Sigma=Sigma, fdr=0.1)\n",
    "\n",
    "# calculate power and false positive rate\n",
    "power = np.around(100*np.dot(selections_mvr, beta != 0) / max(1, (beta != 0).sum()), 2)\n",
    "fdp = np.around(100*np.dot(selections_mvr, beta == 0) / max(1, selections_mvr.sum()), 2)\n",
    "print(f\"MVR knockoffs made {selections_mvr.sum()} discoveries.\")\n",
    "print(f\"MVR knockoffs had a power of {power}% and false positive rate of {fdp}%.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Example 2**: As shown below, SDP knockoffs also have low power when X has an AR1 covariance matrix. See [Spector and Janson, 2020](https://arxiv.org/abs/2011.14625) for more discussion on this setting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generate data where X is Gaussian with an AR1 covariance matrix\n",
    "# where Cov(Xj, X_{j+1}) is sampled from Beta(a,b)\n",
    "# and y | X is a sparse Gaussian linear model\n",
    "np.random.seed(111)\n",
    "data_gen_process = kpy.dgp.DGP()\n",
    "data_gen_process.sample_data(\n",
    "    method='ar1', \n",
    "    a=3,\n",
    "    b=1,\n",
    "    n=650, # number of data-points\n",
    "    p=500, # dimensionality\n",
    "    sparsity=0.1, # number of non-null coefficients\n",
    "    coeff_dist='uniform', # distribution of size of non-null coefficients\n",
    "    coeff_size=1, # size of non-null coefficients\n",
    "    corr_signals=True # non-nulls features are correlated \n",
    ")\n",
    "X = data_gen_process.X\n",
    "y = data_gen_process.y\n",
    "beta = data_gen_process.beta\n",
    "Sigma = data_gen_process.Sigma"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "SDP knockoffs made 0.0 discoveries!\n",
      "SDP knockoffs had a power of 0.0% and false positive rate of 0.0%.\n"
     ]
    }
   ],
   "source": [
    "# run SDP knockoff filter\n",
    "kfilter_sdp = KnockoffFilter(\n",
    "    fstat='lasso', ksampler='gaussian', knockoff_kwargs={\"method\":\"sdp\"}\n",
    ")\n",
    "selections_sdp = kfilter_sdp.forward(X=X, y=y, Sigma=Sigma, fdr=0.1)\n",
    "\n",
    "# Calculate power/FDR \n",
    "power = np.around(100*np.dot(selections_sdp, beta != 0) / max(1, (beta != 0).sum()), 2)\n",
    "fdp = np.around(100*np.dot(selections_sdp, beta == 0) / max(1, selections_sdp.sum()), 2)\n",
    "print(f\"SDP knockoffs made {selections_sdp.sum()} discoveries!\")\n",
    "print(f\"SDP knockoffs had a power of {power}% and false positive rate of {fdp}%.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MVR knockoffs made 55.0 discoveries.\n",
      "MVR knockoffs had a power of 94.0% and false positive rate of 14.55%.\n"
     ]
    }
   ],
   "source": [
    "# Run knockoff filter\n",
    "kfilter_mvr = KnockoffFilter(fstat='lasso', ksampler='gaussian')\n",
    "selections_mvr = kfilter_mvr.forward(X=X, y=y, Sigma=Sigma, fdr=0.1)\n",
    "\n",
    "# calculate power and false positive rate\n",
    "power = np.around(100*np.dot(selections_mvr, beta != 0) / max(1, (beta != 0).sum()), 2)\n",
    "fdp = np.around(100*np.dot(selections_mvr, beta == 0) / max(1, selections_mvr.sum()), 2)\n",
    "print(f\"MVR knockoffs made {selections_mvr.sum()} discoveries.\")\n",
    "print(f\"MVR knockoffs had a power of {power}% and false positive rate of {fdp}%.\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}