---
title: "Introduction to the `panelr` package"
author: "Jacob A. Long"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_document:
    toc: true
    toc_float: true
vignette: >
  %\VignetteIndexEntry{Introduction to the `panelr` package}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
required <- c("clubSandwich", "geepack")
do_eval <- all(sapply(required, requireNamespace, quietly = TRUE))
knitr::opts_chunk$set(
  collapse = FALSE,
  comment = "",
  message = FALSE,
  eval = do_eval
)
```


The `panelr` package contributes two categories of things:

1. A `panel_data` object and some tools to create/manipulate them.
2. A series of regression modeling functions for panel data.

# `panel_data` frames

Check out the other vignette for a lot of detail on how to take your raw data
and reshape it into a `panel_data` format. Here's a short version, using some
example data provided by this package.

```{r include = FALSE}
library(panelr)
data("teen_poverty")
teen_poverty
```

```{r echo = FALSE}
library(panelr)
data("teen_poverty")
teen_poverty
```

These data come from a subset of young women surveyed as part of the 
National Longitudinal Survey of Youth starting in 1979. The `teen_poverty`
data come in "wide" format, meaning there is one row per respondent and 
each of the repeated measures is in a separate column for each wave.

We need to convert this to "long" format, in which you have one row for each
respondent in each wave of the 5-wave survey. We'll use `long_panel()` for 
that.

```{r}
teen <- long_panel(teen_poverty, begin = 1, end = 5, label_location = "end")
teen
```

Now we have a `panel_data` object! It is a special version of a `tibble`, 
which is itself a special kind of `data.frame`. `panel_data` objects 
work very hard to make sure you never accidentally drop the variables that 
are the identifiers for each respondent and the indicators for which wave the
row corresponds to. `panel_data` objects also try to stay in order by ID and
wave. 

Note that if your raw data are already in long format, you can use the 
`panel_data()` function to convert them to `panel_data` format.

```{r}
data("WageData")
wages <- panel_data(WageData, id = id, wave = t)
```


`panel_data` frames are designed to work with `tidyverse` packages, 
particularly `dplyr`. When used inside `mutate()`, functions like `lag()` work 
properly by taking the previous value for the specific respondent. If you ever
need to do something that is easier to do with a "regular" data frame, you can
just use the `unpanel()` function to convert the `panel_data` frame back to 
normal.

# Regression models

## Within-between models

The original motivation to create this package was to automate the process of
fitting "within-between" models, sometimes called "between-within" or 
"hybrid" models (see Allison, 2009; Bell & Jones, 2015). These combine the
benefits of what econometricians call "fixed effects" models — robustness to
time-invariant confounding chief among them — as well as what they call
"random effects" models, which allow the inclusion of time-invariant 
coefficients. Within-between models include coefficients that are identical
to the fixed effects equivalent, but the flexibility to also include the random
effects and other time-invariant predictors (this was noticed by Mundlak, 1978). 
They are fit via multilevel models which allow for some other nice possibilities
like inclusion of random slopes and generalized linear model specifications.

From here, I'll give a somewhat technical description of these models. If you
just want to look at how to estimate them in R, skip ahead to the next 
mini-section.

Note that fixed effects models can be fit using individual demeaning. That is,
you can subtract the entity's own mean for each predictor and the dependent 
variable and fit a model via OLS that is equivalent to the so-called least 
squares dummy variable approach (in which dummy variables for every entity ID 
are included as predictors).

Let's get a bit more technical. We have entities $i = 1, ..., n$ who are 
measured at times $t = 1, ..., T$. We have as our dependent variable $y_{it}$,
the variable $y$ for individual $i$ at time $t$. We have predictors that 
vary over time $x_{it}$, variables that do not vary over time $z_i$, and
variables we did not measure that do not vary over time $\alpha_i$ as well
as random error $\epsilon_{it}$

The fixed effects model, then, looks like this:

$$
y_{it} = \mu_t + \beta_1x_{it} + \gamma z_i + \alpha_i + \epsilon_{it}
$$

Although $\alpha_i$ is not observed, it can be estimated by including a dummy
variable for each $i$. The $\gamma$ is undefined because the $z_i$ are perfectly
collinear with the $\alpha_i$ dummy variables.


The individual-mean-centered version of the fixed effects models is based on
calculating a mean of $y$ and $x$ for each $i$ — so $\bar{y_i}$ and $\bar{x_i}$ 
and subtracting it from each $y_{it}$ and $x_{it}$. The model can be 
expressed like this, including $\bar{z_i}$ and $\bar{\alpha_i}$ for 
demonstration:

$$
y_{it} - \bar{y_i} = \mu_t + \beta_1(x_{it} - \bar{x_i}) + 
(z_i - \bar{z_i} = 0) + 
(\alpha_i - \bar{\alpha_i} = 0) + (\epsilon_{it} - \bar{\epsilon_i})
$$

By de-meaning everything, all the time-invariant variables drop out:

$$
y_{it} - \bar{y_i} = \mu_t + \beta_1(x_{it} - \bar{x_i}) + (\epsilon_{it} - \bar{\epsilon_i})
$$

This is often called the "within" estimator. You can take these de-meaned
variables and fit an OLS regression and get valid estimates (with some 
adjustments to the standard errors).

You can also do something slightly different and get the same results with
multilevel models. Take this, for example:

$$
y_{it} = \beta_{0i} + \beta_1(x_{it} - \bar{x_i}) + (\epsilon_{it} -
\bar{\epsilon_i})
$$

Where $\beta_{0i}$ is a random intercept estimated for each $i$. This is 
equivalent to subtracting $\bar{y_i}$ in terms of the estimation of $\beta_1$.
But in the multilevel modeling framework, we can include those time-invariant
$z_i$ as well. Conceptually, they are basically being included in a model
predicting $\beta_{0i}$:

$$
\beta_{0i} = \beta_0 + \gamma z_i + u_{0i}
$$

Where $u_{0i}$ is the random error of the model predicting $\beta_{0i}$.

In fact, we can include the $\bar{x_i}$ in our multilevel model as well and 
they are used just like the $z_i$:

$$
\beta_{0i} = \beta_0 + \beta_2 \bar{x_i} + \gamma z_i +  u_{0i}
$$

Now we can substitute into the previous multilevel equation and we have our
within-between model:

$$
y_{it} = \beta_{0} + \beta_1(x_{it} - \bar{x_i})  
+ \beta_2 \bar{x_i} + \gamma z_i +  u_{0i} + \epsilon_{it}
$$

The $\beta_1$ has the same interpretation as in the fixed effects model, these
are the effects of within-entity deviations of $x$ on within-entity deviations
of $y$. The $\beta_2$ is basically predicting the $\bar{y_i}$, however, so 
these coefficients are helpful for predicting differences in mean levels across
entities. The same is true for the $z_i$.

A similar model that I call the "contextual" model because this is how it is
often interpreted (see, e.g., Raudenbush & Bryk, 2002). Here we do not demean
the $x_i$:

$$
y_{it} = \beta_{0} + \beta_1 x_{it} 
+ \beta_2 \bar{x_i} + \gamma z_i +  u_{0i} + \epsilon_{it}
$$

Believe it or not, the $\beta_1$ is unchanged in this model; it is the $\beta_2$
that changes. The interpretation of $\beta_2$ becomes a the *difference* 
between the within- and between-entities effects. A significant coefficient for
$\beta_2$ means significant differences between the within- and between-entity
effects. For those who are familiar, this is like a variable-by-variable Hausman
test. Substantively, $\beta_2$ is often interpreted as a *contextual* effect.

From this framework, we can do cross-level interactions, random slopes, 
generalized linear models, and all kinds of interesting stuff.

### A note on interactions

In the fixed effects framework, it is generally considered wrong to 
operationalize an interaction between two time-varying variables (let's call
them $w$ and $x$) by taking 
the product of their individual-demeaned forms. That is, you are **not**
supposed to generate the interaction term $xw_{it}$ by doing this:

### A note on splines and other basis expansions

`wbm()` supports common basis expansion functions in the time-varying part of
the formula that return multiple columns, including `splines::ns()`,
`splines::bs()`, and `stats::poly()`.

For a nonlinear term like `ns(exp, df = 3)`, `panelr` expands the term into
multiple within and between columns:

* **Within component**: compute spline bases on the within-person deviations
  `exp_it - expbar_i`, and then de-mean each resulting basis column within
  person (a double-demean analogue for nonlinear terms).
* **Between component**: compute spline bases on the person means `expbar_i`.

This approach avoids the error you would get if spline knots were chosen within
each person due to grouped `mutate()`.

```{r, eval = do_eval}
library(panelr)
library(splines)
data("WageData")
wages <- panel_data(WageData, id = id, wave = t)

mod_spline <- wbm(lwage ~ ns(exp, df = 3) | blk, data = wages)
mod_spline
```

Note: for `bs()`, after demeaning some values may fall outside the original
boundary knots, which can produce a warning about ill-conditioned bases.

$$
xw_{it} = (x_{it} - \bar{x_{i}}) \times (w_{it} - \bar{w_i})
$$

Instead, the conventional wisdom goes, you should first take the product of the
observed variables and subtract the individual-level mean of that product,
like so:

$$
xw_{it} = x_{it}w_{it} - \overline{xw}_i
$$

Where $\overline{xw}_i$ can also be expressed as $\frac{\sum_{t=1}^{T_i}{x_{it}w_{it}}}{T_i}$, the sum of all products for each
$i$ divided by the number of time points for each $i$, $T_i$.

[Giesselmann and Schmidt-Catran (2020)](https://doi.org/10.1177/0049124120914934)
show that this conventional method for generating $xw_{it}$ does not have the
unbiasedness that the individual terms do. I'll leave it to them to explain
why exactly this is, but the solution is to start with the first, wrong 
version of $xw_{it}$, which I'll call $xw_{it}^*$, and subtract *its* mean too:

$$
xw_{it}^* = (x_{it} - \bar{x_{i}}) \times (w_{it} - \bar{w_i}) \\
xw_{it} = xw_{it}^* - \overline{xw_i^*}
$$

I call this the "double-demeaning" approach to interactions, in contrast to 
the one-time demeaning in the conventional approach. By default, `wbm()` 
calculates interactions via the double-demeaning method. You can change this
via the `interaction.style` argument if you need your results to match other
software.

## Fitting within-between models

The workhorse function for within-between models is `wbm()`, which is built on 
top of `lme4`'s `lmerMod()` and `glmerMod()`. It is not so hard to understand
how to treat your data to estimate within-between models, but the programming
can be a challenge to those who aren't skilled with R (or whatever else they
might use) and is error-prone in any case.

The main thing to know in order to use `wbm()` is how the model formula
works, because it's a little different from your typical regression model.
It is split into up to 3 parts, each for a different kind of variable. Each
part is separated by a `|`. The pattern is like this:

```
dependent ~ time_varying | time_invariant | cross_lev_interactions + (random_slopes | id)
```

So you start with your dependent variable on the left-hand side like normal and
then what comes next are variables that vary over time. You will only get 
within-entity estimates for these variables. Next are time-invariant variables;
the between-entity terms for the time-varying variables are added automatically
so no need to try to include them here. Finally, in the third part you can
specify cross-level interactions (i.e., within-entity by
between-entity/time-invariant) as well as additional random effects terms 
using the `lme4`-style syntax. By default, `(1 | id)` (or whatever the ID 
variable is) is added internally for a random intercept so you do not need to
include it yourself. 

Let's walk through an example with the `wages` data we looked at briefly 
earlier. We'll predict the logarithm of wages (`lwage`) using weeks worked
(`wks`), union membership (`union`), marital status (`ms`), 
blue (vs. white) collar job status (`occ`), black race (`blk`), and 
female sex (`fem`).

```{r}
model <- wbm(lwage ~ wks + union + ms + occ | blk + fem, data = wages)
summary(model)
```

As you can see, the output distinguishes within- and between-entity effects.
When you see `imean()` around a variable, that is the between-entity effect
represented as the individual mean.

Here, we see there seems to be a wage penalty for switching from white collar
to blue collar work (`occ`) and although married people earn more (`imean(ms)`),
just becoming married (`ms`) coincides with a drop in earnings. We also see a
boost in earnings from joining a union (`union`).

Maybe we think the timing of the marriage effect is off and the true effect 
occurs the time period after a person becomes married. We can ask for the 
lagged effect using `lag()`.

```{r}
model <- wbm(lwage ~ wks + union + lag(ms) + occ | blk + fem, data = wages)
summary(model)
```

Well that doesn't change the direction of the estimate, but it also moved 
it sufficiently close to 0 that we can't say much about it one way or another.

Keep in mind that you do not have to stick to linear models. Using the `family`
argument (just like `glm()`), you can estimate logit (`family = binomial`),
probit (`family = binomal(link = "probit")`), poisson (`family = poisson`), or
other model families and links as needed.

### Growth curves

Now maybe we want to include an effect of time since wages tend to go up for 
everyone, on average, over time. We can just include the time variable in the
formula or set `use.wave` to `TRUE`.

```{r}
model <- wbm(lwage ~ wks + union + ms + occ | blk + fem, data = wages, use.wave = TRUE)
summary(model)
```

Including `t` wipes out some of those previously observed effects. Believe it
or not, we just fit a growth curve model!

Now, we might think people have different trajectories. We can include that as
a random slope, which will go in the third part of the formula.

```{r}
model <- wbm(lwage ~ wks + union + ms + occ | blk + fem | (t | id), use.wave = TRUE, data = wages)
summary(model)
```

And now we have a latent growth curve model. The general effect on the other
coefficients is more uncertainty and attenuated estimates. It's worth 
keeping in mind that it is sometimes wrong to use a growth curve model like this
if you think the variables in your model *cause* the time trend; if you think
wages are going up because more people are moving into white collar work, then
including the growth curve will make it harder for you to see the true effect
of `occ`.

### Contextual, within, and random effects specifications

By default, `wbm()` does as the name suggests. But if you'd rather have the 
contextual model described earlier, in which the means are not subtracted from
the time varying variables, that's an option too.

```{r}
model <- wbm(lwage ~ wks + union + ms + occ | blk + fem, data = wages, model = "contextual")
summary(model)
```

Now the individual means have a new interpretation as the difference in effect
compared to the within-entity estimates.

If you don't want to use any of the time-invariant variables, you can also
just ask for the "within" estimator:

```{r}
model <- wbm(lwage ~ wks + union + ms + occ, data = wages, model = "within")
summary(model)
```

This can help declutter your output when you really just don't care about the
between-subjects effects.

## Using GEE to fit within-between models

You don't have to estimate these models using multilevel models and in 
fact you may get better inferences by avoiding some of the assumptions inherent
to multilevel modeling (see McNeish, 2019). You can use the semiparametric
generalized estimating equations (GEE) approach to estimation, with the main
tradeoff being that you can no longer use random slopes or anything like that.
But if you only care about the average effects across all entities, GEE can
be a better approach that doesn't require you to be right about the distribution
of effects and several other assumptions.

`wbgee()` builds on `geeglm()` from the `geepack` package and works just like
`wbm()`.

```{r}
model <- wbgee(lwage ~ wks + union + ms + occ | blk + fem, data = wages)
summary(model)
```

This gives us more conservative estimates, in general. Note that by default,
`wbgee()` uses an AR-1 working error correlation structure in estimation. 
This makes sense in general but at times it may make sense to use 
"exchangeable" as the argument to `cor.str` which assumes all within-entity
correlations are equal regardless of time lag. Other options include 
"unstructured", which can be very computationally intensive, and "independence,"
assuming no correlation within entities.

Like `wbm()`, you can do generalized linear models via the `family` argument.
It is for these generalized linear models that GEEs are likely to stand out the
most in terms of added benefit above and beyond the multilevel models, although
this is not a well-tested question to my knowledge.

## Asymmetric effects

Sometimes, theory may suggest that increases in a variable have a different
effect than decreases in a variable. For instance, getting married and
getting divorced are probably not equivalent (in the sense that one is the
exact opposite of the other) in their effects on other outcomes. Allison (2019)
described a method for estimating models with asymmetric effects based on
first differences.

First, you take first differences:

$$
y_{it} - y_{it-1} = (\mu_t - \mu_{t-1}) + \beta(x_{it} - x_{it -1}) +
(\epsilon_{it} - \epsilon_{it-1})
$$

We need a slightly different model for asymmetric effects in which we 
decompose the differences into positive and negative variables.

Our asymmetric effects model will be:

$$
y_{it} - y_{it-1} = (\mu_t - \mu_{t-1}) + \beta^+x_{it}^+ + \beta^-x_{it}^- +
(\epsilon_{it} - \epsilon_{it-1})
$$

Where 

$$
x_{it}^+ = x_{it} - x_{it -1} \text{ if } (x_{it} - x_{it -1}) > 0,
\text{otherwise } 0 \\
x_{it}^- = -(x_{it} - x_{it -1}) \text{ if } (x_{it} - x_{it -1}) < 0,
\text{otherwise } 0
$$

In other words, if the difference is positive, it becomes part of the 
$x_{it}^+$ and if it is negative, it is multiplied by -1 to be made positive 
and is made part of the $x_{it}^-$ variable. 
If the effects are symmetric, $\beta^+ = -\beta^-$. 

After fitting the model via GLS, we can then do a test of the contrasts of 
the $\beta^+$ and $\beta^-$ coefficients as a formal way to assess the presence
of asymmetric effects.

Here's how it works with the `panelr` function, `asym()`.

```{r}
model <- asym(lwage ~ ms + occ + union + wks, data = wages)
summary(model)
```

As you can see, in a model comparable to our within-between model from earlier,
the effects seem quite symmetric.

Let's look at the `teen` data from earlier, where `spouse` indicates whether
the respondent is living with a spouse, `inschool` indicates whether the 
respondent is enrolled in school, and `hours` is the hours worked in the week
of the survey.

```{r}
summary(asym(hours ~ spouse + inschool, data = teen))
```

Here we see an asymmetric effect of marriage: gaining a spouse corresponds with
fewer hours worked, but there's no effect on work hours when a spouse is lost.
You can see in the lower table that this difference in coefficients is 
associated with a fairly low *p* value. There is only weak evidence of an
asymmetric effect for entering/leaving school.

### Asymmetric effects for generalized linear models

The downside to the first differences method is that it does not generalize
to non-continuous dependent variables — you can't run a logit model with a 
differenced binary outcome. Allison (2019) showed that you can do a modified
form for such situations. 

Instead of including the $x_{it}^+$ and $x_{it}^-$ as predictors, you instead
create new variables $z_{it}^+$ and $z_{it}^-$ that are the cumulative sum
of all differences prior to time $t$. 

$$
z_{it}^+ = \sum_{s = 1}^{t}{x_{is}^+} \\
z_{it}^- = \sum_{s = 1}^{t}{x_{is}^-} \\
$$

Note that at $t = 1$, both are set to 0. I'll leave the details as to *why*
this works to the manuscript, but he shows that we're left with the following
equation:

$$
y_{it} = \mu_t + \beta^+ z_{it}^+ + \beta^-z_{it}^- + \alpha_i + \epsilon_{it}
$$

So we can treat this like a fixed effects model in which we just need to address
the $\alpha_i$. For situations like this that call for a conditional logit, as
Allison used in his paper, another option is the GEE with logit link. 

Let's try with the `teen` data, which also appears in Allison (2019). Here our
outcome variable is `pov`, poverty, and there's a new predictor, `mother`, an
indicator for whether the respondent has ever had any children.

```{r message = TRUE}
model <- asym_gee(pov ~ mother + spouse + inschool + hours, data = teen, family = binomial(link = "logit"), 
                  use.wave = TRUE, wave.factor = TRUE)
summary(model)
```

The results are broadly similar in terms of coefficient estimates to those 
obtained by Allison. Unlike Allison, we do not have good evidence of an 
asymmetric effect in the case of `spouse` but we do have one in the case of
`hours`. Note that `mother` never goes down so the negative version of this
variable is dropped from the model with a message. To match Allison, I also
used `use.wave` to include the wave variable and `wave.factor` to make it 
a factor variable.

# References

Allison, P. D. (2009). Fixed effects regression models. 
Thousand Oaks, CA: SAGE Publications. 
https://doi.org/10.4135/9781412993869.d33

Allison, P. D. (2019). Asymmetric fixed-effects models for panel data. 
*Socius*, *5*, 1–12. https://doi.org/10.1177/2378023119826441

Bell, A., & Jones, K. (2015). Explaining fixed effects: Random effects 
modeling of time-series cross-sectional and panel data. 
*Political Science Research and Methods*, *3*, 133–153.
https://doi.org/10.1017/psrm.2014.7

Giesselmann, M., & Schmidt-Catran, A. W. (2020). Interactions in fixed effects
regression models. *Sociological Methods & Research*, 1–28.
https://doi.org/10.1177/0049124120914934

McNeish, D. (2019). Effect partitioning in cross-sectionally clustered data
without multilevel models. *Multivariate Behavioral Research*, Advance online
publication. https://doi.org/10.1080/00273171.2019.1602504