---
title: "Introduction to mash: data-driven covariances"
author: "Matthew Stephens"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{mashr intro with data-driven covariances}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,comment = "#",fig.width = 5,
                      fig.height = 4,fig.align = "center",
                      eval = FALSE)
```

## Goal

This vignette introduces the important topic of using data-driven
covariance matries in `mashr`. You should read [the introductory
vignette](intro_mash.html) before this.

## Outline

Recall the four major steps in a mash analysis:

- Read in the data

- Set up the covariance matrices to be used

- Fit the model

- Extract posterior summaries and other quantities of interest

Here we pay more attention to Step 2, which can be split into
two steps:

- Set up the "canonical" covariance matrices.

- Set up the "data-driven" covariance matrices.

The first of these steps ("canonical" covariance matrices) is
straightforward using `cov_canonical`, as illustrated in
[the introductory vignette](intro_mash.html).

Setting up the data-driven matrices is slightly more complex, and
there are multiple possible approaches. This vignette illustrates one
approach.

First we simulate some data for illustration (see the
[the introductory vignette](intro_mash.html) for more details):

```{r}
library(ashr)
library(mashr)
set.seed(1)
simdata = simple_sims(500,5,1)
```

## Data-driven covariances

Although the user is free to set up data driven covariance matrices
using any method that they would like, typically we suggest the
following three-step strategy:

1. Locate some strong signals. For example, here we do this within
R by running a "condition-by-condition" using `mash_1by1`, but you could
do this as a preprocessing step in other software - for example
in eQTL studies you might instead put together a separate dataset
containing the test results for only the "top" eQTL or eQTLs in each gene.

2. Use methods such as PCA or factor analysis on these top signals to
compute some initial data-driven covariance matrices (e.g. using
`cov_pca`).

3. Use these data-driven covariance matrices as initializations for
the extreme deconvolution (ED) algorithm (using `cov_ed`), to get some
refined data-driven covariance matrix estimates.

The ED step is helpful primarily because the second step will often
estimate the covariances of the observed data (Bhat) whereas what is
required is the covariances of the actual underlying effects (B), and
this is what ED estimates. However, ED can be quite sensitive to
initialization, and so the goal of the second step is to provide a
good initialization.

### Step 1: select strong signals

If your entire data set (matrix of *all* tests in all conditions) is 
not too big (eg <100k tests say) then you can simply do this by
setting up the entire data set as a mash data object 
(using `mash_set_data`), and then 
running a condition-by-condition (1by1)
analysis on all the data. 
For example, here we select the strong signals as those with lfsr<0.05 in
any condition in the 1by1 analysis.

```{r}
data   = mash_set_data(simdata$Bhat, simdata$Shat)
m.1by1 = mash_1by1(data)
strong = get_significant_results(m.1by1,0.05)
```
This sets up a vector `strong` containing the indices corresponding
to the "significant" rows of the test results in `data`. This
vector is used in subsequent functions below to specify which
rows of `data` to use.

### Step 2: Obtain initial data-driven covariance matrices

There are various approaches to this; here we just illustrate the
simplest and quickest (but probably not the best), which is to use
PCA. Specifically here we use the function `cov_pca` to produce
covariance matrices based on the top 5 PCs of the strong signals. The
result is a list of 6 covariance matrices: one based on all 5 PCs, and
the others each based on one PC.

```{r}
U.pca = cov_pca(data,5,subset=strong)
print(names(U.pca))
```

### Step 3: Apply Extreme Deconvolution

The function `cov_ed` is used to apply the ED algorithm from a
specified initialization (here `U.pca`) and to a specified subset of
signals.

```{r}
U.ed = cov_ed(data, U.pca, subset=strong)
```

## Run mash

After all this we are ready to run `mash` using the data driven
matrices. Remember the Crucial Rule that we must fit mash to *all*
tests - do not use only the `strong` subset here!!

```{r}
m.ed = mash(data, U.ed)
print(get_loglik(m.ed),digits = 10)
```

From the fit we see that the fit using the data-driven covariances is
not as good as when we used the canonical covariances (which was
-16120.32; from [the introductory vignette](intro_mash.html)). This is
expected for this simulation because the simulation actually used the
canonical covariances!

In general we recommend running `mash` with both data-driven and
canonical covariances. You could do this by combining the data-driven
and canonical covariances as in this code:

```{r}
U.c = cov_canonical(data)  
m   = mash(data, c(U.c,U.ed))
print(get_loglik(m),digits = 10)
```

For an example with simulations that do not follow the standard
canonical matrices see [here](simulate_noncanon.html).

## Session information.

```{r info}
print(sessionInfo())
```