---
title: "Vignette of R package kko"
#date: "Oct 23, 2021"
output:
  rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Vignette of R package kko}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
  
This package provides a kernel knockoffs selection procedure, dubbed KKO, for the nonparametric additive model.
The procedure integrates three key components: the knockoffs, the subsampling for stability, and the random feature mapping for nonparametric function approximation.
Finite-sample false discovery rate (FDR) control guarantee is established for KKO, see [Dai et al. (2021)][1].




# Generate data
Let us begin by creating some synthetic data. The data is generated from additive polynomial function. 
```{r,eval=FALSE}
library(ggplot2)
library(kko)
library(knockoff)
set.seed(12345)

### generate regression coefficent
p=20 # number of predictors
sig_mag=10 # signal strength
s=5  # sparsity, number of nonzero component functions
reg_coef=c(rep(1,s),rep(0,p-s))  # regression coefficient
reg_coef=reg_coef*(2*(rnorm(p)>0)-1)*sig_mag

### generate response and design
model="poly"
n= 600 # sample size
X=matrix(rnorm(n*p),n,p)   # generate design
X_k = create.second_order(X) # generate knockoff
y=generate_data(X,reg_coef,model) # response
```





# Kernel knockoffs selection
We then apply KKO method to generate importance scores of variables. 
```{r,eval=FALSE}
rkernel="laplacian" # kernel choice
rk_scale=1  # scaling paramtere of kernel
rfn_range=c(2,3,4)  # number of random features
cv_folds=15  # folds of cross-validation in group lasso
n_stb=200 # number of subsampling for importance scores 
n_stb_tune=100 # number of subsampling for tuning random feature number
frac_stb=1/2 # fraction of subsample
nCores_para=2 # number of cores for parallelization

### KKO selection 
kko_fit=kko(X,y,X_k,rfn_range,n_stb_tune,n_stb,cv_folds,frac_stb,nCores_para,rkernel,rk_scale)
```

The importance scores by KKO are the difference of selection frequencies between variables and knockoffs, ranging from $-1$ to $1$. 
The active variables are expected to have high positive scores (close to one). 
Those of null variables are expcted to stay centered at zero. 
```{r,echo=FALSE}
library(kko)
library(knockoff)
library(ggplot2)
load("demo.Rdata")
p=length(kko_fit$importance_score)
```

```{r,fig.width=6,fig.height=4}
reg_coef  # true regression coefficient 

W=kko_fit$importance_score # knockoff importance scores generated by KKO 
W 

mydata=data.frame(W=W,var_group=ifelse(reg_coef!=0,"Active","NUll"))
myplot = ggplot(mydata, aes(W, fill = var_group)) +  
  geom_histogram(color = "gray2",binwidth=1/p) + theme_bw()+
  xlab("Importance scores")+ylab("Number of variables")+
  xlim(-1,1)

print(myplot)
```



# Knockoff filtering 
We apply knockoff filter on KKO importance scores.
The filter computes a threshold on scores, and pick significant variables above the threshold. 

```{r}
fdr=0.2 #FDR control level 
thres = knockoff.threshold(W, fdr=fdr) # thresholding on scores by knockoff filter
selected = which(W >= thres) 
selected  # indices of selected variables 
```


# Reference
1. Xiaowu Dai, Xiang Lyu, and Lexin Li. *Kernel Knockoffs Selection for Nonparametric Additive Models.* ***arXiv preprint*** **arXiv:2105.11659 (2021)**.

[1]: https://arxiv.org/abs/2105.11659

