---
title: "Analysis of multivariate binomial data: family analysis"
author: Klaus Holst & Thomas Scheike
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    fig_caption: yes
    fig_width: 7.15
    fig_height: 5.5    
vignette: >
  %\VignetteIndexEntry{Analysis of multivariate binomial data: family analysis} 
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(mets)
```

Overview 
=============

When analysing multivariate binomial data with the aim of learning about the
dependence present, possibly after adjusting for covariates, many
models and methods are available:

   *  Random-effects logistic regression models covered elsewhere (`glmer` in `lme4`).

In the mets package you can fit the

   *  Pairwise odds ratio model

   *  Bivariate Probit model 
      + With random effects
      + Special functionality for polygenic random effects modelling 
        such as ACE, ADE ,AE and so forth.

   *  Additive gamma random effects model 
      + Special functionality for polygenic random effects modelling 
        such as ACE, ADE ,AE and so forth.

These last three models are all fitted in the mets package using composite 
likelihoods based on pairs within clusters. The models can be fitted specifically based 
on specifying which pairs one wants to use for the composite score. 

The models are described in further details in the binomial-twin vignette. 


Simulated family data
====================

We start by simulating family data with an additive gamma structure
on ACE form.  Here 10000 families consisting of two parents and
two children. The response is ybin and there is one covariate x. 

```{r}
 library(mets)
 library(timereg)
 set.seed(100)
 data <- sim_binClaytonOakes_family_ace(500,2,1,beta=NULL,alpha=NULL)
 data$number <- c(1,2,3,4)
 data$child <- 1*(data$number==3)
 head(data)
```

We fit the marginal models, and here find a covariate effect at 0.3 for x. The marginals can be 
specified as desired.

```{r}
 aa <- margbin <- glm(ybin~x,data=data,family=binomial())
 summary(aa)
```

Additive Gamma model 
====================

For the Additive Gamma model we set-up the random effects
included in such a family to make the ACE valid using some
special functions for this. 

The model is constructed with one environmental effect shared by
all in the family and 8 genetic random effects with size 1/4 genetic
variance. Looking at the first family we see that the mother and
father both share half the genes with the children and that the
two children also share half their genes with this specification.
Below we also show an alternative specification of this model
using all pairs. 

```{r}
# make ace random effects design
out <- ace_family_design(data,member="type",id="cluster")
out$pardes
head(out$des.rv,4)
```


We can now fit the model calling the two-stage function

```{r}
# fitting ace model for family structure
ts <- binomial_twostage(margbin,data=data,clusters=data$cluster,
theta=c(2,1),random.design=out$des.rv,theta.des=out$pardes)
summary(ts)
# true variance parameters
c(2,1)
# total variance 
3
```


Pairwise fitting 
=================

We now specify the same model by extracting all pairs.
The random effects structure is typically simpler when looking at pairs only, but we start by
specifying the random effects jointly for the whole family.
A special function writes up all combinations of pairs.
There are 6 pairs within each family, and we keep track of
which family each belongs to. We first simply supply the
pairs and should get the same result as before.


```{r}
mm <- familycluster_index(data$cluster)
head(mm$familypairindex,n=20)
pairs <- mm$pairs
dim(pairs)
head(pairs,12)
```

Now with the pairs we fit the model 

```{r}
tsp <- binomial_twostage(margbin,data=data,clusters=data$cluster,theta=c(2,1),detail=0,
        random.design=out$des.rv,theta.des=out$pardes,pairs=pairs)
summary(tsp)
```

Here a random sample of pairs are given instead and we get other estimates. 

```{r}
set.seed(100)
ssid <- sort(sample(1:nrow(pairs),nrow(pairs)/2))
tsd <- binomial_twostage(aa,data=data,clusters=data$cluster,
               theta=c(2,1),step=1.0,
               random.design=out$des.rv,iid=1,Nit=10,
  	           theta.des=out$pardes,pairs=pairs[ssid,])
summary(tsd)
```

To specify such a model when only the pairs are available we show
how to specify the model. We here use the same marginal "aa"
to make the results comparable. 
The marginal can also be fitted based on available data. 

We start by selecting the data related to the pairs, setting up new ids, and
specifying the model using the full design with 9 random effects.
Below we also show how to use only the random effects needed for each
pair, which is typically simpler.


```{r}
head(pairs[ssid,])
ids <- sort(unique(c(pairs[ssid,])))

pairsids <- c(pairs[ssid,])
pair.new <- matrix(fast.approx(ids,c(pairs[ssid,])),ncol=2)
head(pair.new)

dataid <- dsort(data[ids,],"cluster")
outid <- ace_family_design(dataid,member="type",id="cluster")
outid$pardes
head(outid$des.rv)
```

Now fitting the model with the data set up 

```{r}
aa <- glm(ybin~x,data=dataid,family=binomial())
tsdid <- binomial_twostage(aa,data=dataid,clusters=dataid$cluster,
         theta=c(2,1),random.design=outid$des.rv,theta.des=outid$pardes,pairs=pair.new)
summary(tsdid)
```

We now specify the design specifically using the pairs. 
The random.design and design on the parameters
are now given for each pair, as a 3 dimensional matrix. 
with a direct specification of random.design and the 
design on the parameters theta.design.
In addition we need also to give the number of random effects for
each pair. These basic things are constructed by certain functions
for the ACE design. 


```{r}
pair.types <-  matrix(dataid[c(t(pair.new)),"type"],byrow=T,ncol=2)
head(pair.new,7)
head(pair.types,7)
###
theta.des  <- rbind( c(rbind(c(1,0),  c(1,0),  c(0,1),  c(0,0))),
		     c(rbind(c(0.5,0),c(0.5,0),c(0.5,0),c(0,1))))
random.des <- rbind( 
        c(1,0,1,0),c(0,1,1,0),
        c(1,1,0,1),c(1,0,1,1))
mf <- 1*(pair.types[,1]=="mother" & pair.types[,2]=="father")
##          pair, rv related to pairs,  theta.des related to pair 
pairs.new <- cbind(pair.new,(mf==1)*1+(mf==0)*3,(mf==1)*2+(mf==0)*4,(mf==1)*1+(mf==0)*2,(mf==1)*3+(mf==0)*4)
```

`pairs.new` is a matrix with

 * columns 1:2 giving the indices of the data points

 * columns 3:4 giving the indices of the `random.design` for the different pairs

 * column 5 giving the indices of `theta.des` written as rows

 * column 6 giving the number of random variables for this pair


The length of all rows of theta.des
are the maximum number of random effects $\times$ the number of parameters. 
These two numbers are given in the call. In this case 4 $\times$ 2. 
So theta.des has rows of length $8$, possibly including some 0's for rows
not relevant due to fewer random effects, as is the case here for pairs 
that do not share genetic effects.


For pair 1 that is a mother/father pair, we see that they share 
1 environmental random effect of size 1. There are also two genetic
effects that are unshared between the two. 
So a total of 3 random effects are needed here. The theta.des relates the 3 
random effects to possible relationships in the parameters. Here the genetic
effects are full and so is the environmental effect. 
In contrast we also consider a mother/child pair that share half the genes, now
with random effects with (1/2) gene variance.  We there need 4 random effects,
2 non-shared half-gene, 1 shared half-gene, and one shared full environmental 
effect. 


```{r}
# 3 rvs here 
random.des
theta.des

head(pairs.new)
```

Now fitting the model, and we see that it is a lot quicker due to the
fewer random effects needed for pairs.  We need to also specify the number of
parameters in this case. 

```{r}
tsdid2 <- binomial_twostage(aa,data=dataid,clusters=dataid$cluster, theta=c(2,1),
           random.design=random.des,theta.des=theta.des,pairs=pairs.new,dim.theta=2)
summary(tsdid2)
```

The same model can be specifed even simpler via the kinship
coefficient.  For this specification there are 4 random effects for
each pair, but some have variance 0. The mother-father pair, here
shares a random effect with variance 0, and have two non-shared genetic 
effects with full variance, in addition to a fully shared environmental 
effect.

```{r}
kinship  <- rep(0.5,nrow(pair.types))
kinship[pair.types[,1]=="mother" & pair.types[,2]=="father"] <- 0
head(kinship,n=10)

out <- make_pairwise_design(pair.new,kinship,type="ace") 
```

```{r}
tsdid3 <- binomial_twostage(aa,data=dataid,clusters=dataid$cluster,
             theta=c(2,1)/9,random.design=out$random.design,
             theta.des=out$theta.des,pairs=out$new.pairs,dim.theta=2)
summary(tsdid3)
```


Pairwise dependence modelling 
------------------------------


```{r}
library(mets)
set.seed(1000)
data <- sim_binClaytonOakes_family_ace(500,2,1,beta=NULL,alpha=NULL)
head(data)
data$number <- c(1,2,3,4)
data$child <- 1*(data$number==3)

mm <- familycluster_index(data$cluster)
head(mm$familypairindex,n=20)
pairs <- mm$pairs
dim(pairs)
head(pairs,12)
```


```{r}
 dtypes <- interaction( data[pairs[,1],"type"], data[pairs[,2],"type"])
 dtypes <- droplevels(dtypes)
 table(dtypes)
 dm <- model.matrix(~-1+factor(dtypes))
```

Now with the pairs we fit the model 

```{r}
aa <- glm(ybin~x,data=data,family=binomial())

tsp <- binomial_twostage(aa,data=data, clusters=data$cluster,
		 theta.des=dm,pairs=cbind(pairs,1:nrow(dm)))
summary(tsp)
```


Pairwise odds ratio model 
=========================

To fit the pairwise odds-ratio model in the case of a pair-specification there 
are two options for fitting the model. 

1. One option is to set up some artificial data similar to twin data with
  + a pair-cluster-id  (clusters) 
  + with a cluster-id to get GEE type standard errors (se.cluster) 
2. We can also use the specify the design via the theta.des that is also a 
   matrix of dimension pairs x design with the design for POR model.  


Starting by the second option. We need to start by specify the design of
the odds-ratio of each pair. We set up the data and find all combinations 
within the pairs. Subsequently, we remove all the empty groups, by grouping
together the factor levels 4:9, and then we construct the design. 

```{r}
tdp <-cbind( dataid[pair.new[,1],],dataid[pair.new[,2],])
names(tdp) <- c(paste(names(dataid),"1",sep=""),
		paste(names(dataid),"2",sep=""))
tdp <-transform(tdp,tt=interaction(type1,type2))
dlevel(tdp)
drelevel(tdp,newlevels=list(mother.father=4:9)) <-  obs.types~tt
dtable(tdp,~tt+obs.types)
tdp <- model.matrix(~-1+factor(obs.types),tdp)
```

We can then fit the pairwise model using the pairs and the pair-design for
describing the OR. The results are consistent with the ACE model:
mother-father pairs have a lower dependence since it is driven only by environmental
effects. All other combinations appear to have the same dependence, as expected.

To fit the OR model it is generally recommended to use the var.link to
use the parametrization with log-odd-ratio regression.


```{r}
###porpair <- binomial.twostage(aa,data=dataid,clusters=dataid$cluster,
###           theta.des=tdp,pairs=pair.new,model="or",var.link=1)
###summary(porpair)
```


SessionInfo
============

```{r}
sessionInfo()
```