---
title: "Lab 6"
author: "9.59 Lab in Psycholinguistics"
---

```{r setup, include=FALSE, cache=FALSE}
source("_lab_setup.R")
```

```{r}
library(tidyverse)
library(broom)
```

## Multiple predictors review

```{r, message=FALSE}
data_source <- "http://web.mit.edu/psycholinglab/data/"
rts <- read_csv(file.path(data_source, "rts.csv"))

ggplot(rts) + geom_histogram(aes(x = RTlexdec), bins = 60, color = "white")
```

Now remember when we looked at the histogram, we saw that the RTs aren't quite normally distributed in that there were two peaks. We first tried a linear model with just WrittenFrequency as the predictor.

```{r}
ggplot(rts, aes(x = WrittenFrequency, y = RTlexdec)) +
  geom_point(aes(colour = AgeSubject), alpha = 0.1) +
  geom_smooth(method = "lm")
```

This does an okay job, but we see there are two groups here. Remember we are evaluating our model by seeing how close the fitted values are to the real data points, and we can get a lot closer if we have different lines for different age. 

```{r}
ggplot(rts, aes(x = WrittenFrequency, y = RTlexdec, colour = AgeSubject)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "lm")
```

So we ran a regression model with multiple predictors, just adding them together:

```{r}
rts$AgeSubject <- factor(rts$AgeSubject)
contrasts(rts$AgeSubject) <- c(-0.5, 0.5)

freq_age_lm <- lm(RTlexdec ~ WrittenFrequency + AgeSubject, data = rts)
summary(freq_age_lm)
```

### __Try it yourself...__

- Recreate the expected values of RTlexdec in each age condition for a WrittenFrequency value of 6 from the estimated parameters above.
- Double check your values using `augment()`.

```{r, indent="    "}
freq_age_results <- tidy(freq_age_lm)

intercept <- filter(freq_age_results, term == "(Intercept)")$estimate
slope_age <- filter(freq_age_results, term == "AgeSubject1")$estimate
slope_freq <- filter(freq_age_results, term == "WrittenFrequency")$estimate

old_code <- contrasts(rts$AgeSubject)["old", ]
young_code <- contrasts(rts$AgeSubject)["young", ]

freq <- 6

# old RTs
intercept + freq * slope_freq + old_code * slope_age

# young RTs
intercept + freq * slope_freq + young_code * slope_age

test_data <- data_frame(WrittenFrequency = c(6, 6),
                        AgeSubject = c("old", "young"))
augment(freq_age_lm, newdata = test_data)
```

What do our fitted values look like?

```{r}
freq_age_aug <- augment(freq_age_lm)

ggplot(freq_age_aug, aes(x = WrittenFrequency, y = RTlexdec,
                         colour = AgeSubject)) +
  geom_point(shape = 1) +
  geom_point(aes(y = .fitted, shape = AgeSubject), colour = "blue", size = 2)
```

So you can see that you can have multiple continuous predictors, multiple categorical predictors, or a mix of the two in your regression models. This is part of what makes them so useful.


## Model comparison

Let's consider the case of two real-valued predictors. Let's look at the variable FrequencyInitialDiphoneWord which is the frequency of the first two phonemes in a word. Let's say you think that frequency is what determines RT, but then someone else thinks initial diphone frequency is what matters. Let's evaluate that claim.

```{r}
diphone_lm <- lm(RTlexdec ~ FrequencyInitialDiphoneWord, data = rts)
summary(diphone_lm)
```

We see that initial diphone frequency does predict RT. But what if we include both predictors, initial diphone frequency and written frequency?

```{r}
freq_diphone_lm <- lm(RTlexdec ~ WrittenFrequency + FrequencyInitialDiphoneWord, data = rts)
summary(freq_diphone_lm)
```

What does this mean? When you have two linear predictors, and try to find coefficients that minimize the squared distance to datapoints, then to get the best coefficients, you try to estimate what is the effect of variable a while holding variable b constant, and what is the effect of variable b while holding variable a constant. So this means, approximately, whole holding WrittenFrequency constant, initial diphone has this effect on RTs, which is not significant. And controlling for initial diphone frequency -- holding it constant -- we estimate this is the slope for WrittenFrequency, basically the same slope as we found before.

This offers some evidence that WrittenFrequency is what matters here, not initial diphone frequency. You often find yourself in the position where there are multiple things that predict patterns in your dataset, and you want to figure out which one is more basic, or which one causes the others, or which one is real and which one is confounded.

This kind of test offers some evidence about that question, but there are lots and lots of caveats to interpreting this p-value this way. There are better tests to do. The principle thing that makes this hard to interpret is that the controlling is mutual here: this is the effect of written frequency after holding initial diphone frequency constant, and vice versa.

Basically, you can trust this kind of test for two variables that are not highly correlated.

These two variables aren't highly correlated, so this kind of hypothesis testing makes sense.

```{r}
cor(rts$WrittenFrequency, rts$FrequencyInitialDiphoneWord)
```


## Interactions

So far in all these models we've entered multiple predictors that can exert their effects somewhat independently, but what if things are more complicated and they actually interact?

For example, what if for young people, more frequent words had faster RTs, but for old people, it was the opposite?

This isn't the case in our data but let's make it the case for illusrtration purposes.

```{r}
rts_mod <- rts %>%
  mutate(RTfake = if_else(AgeSubject == "old", 14 - RTlexdec, RTlexdec))

ggplot(rts_mod, aes(x = WrittenFrequency, y = RTfake)) +
  geom_point(aes(colour = AgeSubject), alpha = 0.2) +
  geom_smooth(method = "lm")
```

If we analyze this the way we did before:

```{r}
mod_lm <- lm(RTfake ~ WrittenFrequency + AgeSubject, data = rts_mod)
summary(mod_lm)
```

We see no effect of WrittenFrequency which is clearly wrong it's just that it's getting obscured by the fact that it's in opposite directions for each age group.

So how can we include the fact that the effect of writtenFrequency varies depending on which value of AgeSubject:

$$ \text{RTlexdec} = \beta_0 + \beta_1 \cdot \text{WrittenFreqeuncy} + \beta_2 \cdot \text{AgeSubject} + \beta_3 \cdot \text{WrittenFreqeuncy} \cdot \text{AgeSubject} + \mathcal{N}(0, \sigma) $$

This multiplicative term is called an __interaction__.

```{r}
mod_int_lm <- lm(RTfake ~ WrittenFrequency * AgeSubject, data = rts_mod)
summary(mod_int_lm)
```

So what this is saying is that there is no __main effect__ of WrittenFrequency but there is a main effect of AgeSubject (younger people are faster) and the __interaction__ between AgeSubject and WrittenFrequency is significant. The significance of the interaction term here means that the slope on top is significantly different from the slope on the bottom. 

A note about the coding:

So if we go back to the dummy coding and rerun the model...

```{r}
contrasts(rts_mod$AgeSubject)
contrasts(rts_mod$AgeSubject) <- c(0, 1)
contrasts(rts_mod$AgeSubject)

mod_int_lm_2 <- lm(RTfake ~ WrittenFrequency * AgeSubject, data = rts_mod)
summary(mod_int_lm_2)
```

What you're getting here is a significant effect of WrittenFrequency because it's looking only at the effect of WrittenFrequency for AgeSubject = 0 (i.e. old) --> this is a simple effect.

When we had the effects coding, WrittenFrequency was being measured across both levels Age so there was no main effect of WrittenFreqeuncy. Everything else about the model looks about the same, but if you don't know what coding scheme you used, you can't interpret the effects. People misunderstand this ALL THE TIME.


## Logistic regression

So far we've dealt with data where the dependent variable is some continuous quantity, in this case we've looked at RTs. But often in linguistics we want to answer a categorical question. For example, we may want to see what predicts people's tendency to say "yes" or "no", as in the typical comprehension question. In cases where the possible values of the dependent variable are 0 or 1, it's no longer appropriate to use linear regression, because linear regression produces values all over the number line.

For this part, let's imagine that we care about slow vs. fast RTs more coarsely.

```{r}
rts_binary <- rts %>%
  mutate(rt_binary = if_else(RTlexdec > 6.5, 1, 0))

ggplot(rts_binary, aes(x = WrittenFrequency, y = rt_binary)) +
  geom_point(shape = 1) +
  geom_smooth(method = "glm", method.args = list(family = "binomial"))
```

You can see that all the values are 0s and 1s so that makes it hard to see a pattern

Let's bin the x-axis and get means within each bin just to get an idea...

```{r}
rts_binned <- rts_binary %>%
  mutate(freq_bin = cut(WrittenFrequency, 20)) %>%
  group_by(freq_bin) %>%
  summarise(mean_rt = mean(rt_binary))

ggplot(rts_binned, aes(x = freq_bin, y = mean_rt)) +
  geom_point() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
```

So we see a familiar relationship: the higher the written frequency, the more likely that the average RT is closer to 0. 

But rt_binary is a binary variable (1 or 0) so if we want that to be our outcome, the normal distribution is not appropriate. The errors are not normally distributed around the expected values -- it's all 0s and 1s. We're no longer interested in predicting number values, but rather in __classifying__ our data as 0 or 1. Our generative process should be something that gets a probability that something is in class 1, then flips a coin with that probability to determine its class.

Instead of a normal distribution here, let's use a __bernoulli__ distribution. A bernoulli distribution takes one parameter, called $p$, and returns 1 with probability $p$, and 0 with probability $1 - p$.
 
```{r}
rbernoulli(n = 10, p = 0.5)
```

So we can still use the concepts from linear regression where we have an intercept $\beta_0$ and slope $\beta_1$ to figure out the probability of a $y$ being 1 or 0 given a certain $x$ and then submit that to the bernoulli distribution. 

$$ p = \beta_0 + \beta_1 \cdot x_1 + \beta_2 \cdot x_2 + ... $$
$$ y \sim \text{Bernoulli}(p) $$

Okay, but the problem is a probability has to be between 0 and 1, and there's no guarantee that this expression will do that. In our expression so far, p could be any real number from negative infinity to infinity. So let's just run it through a function which takes real numbers, then squashes them into the space between 0 and 1 by generating values that are like .9999 or .97 or .0001. So we want the values coming out of our model to be really close to 1, or really close to 0, which is similar to our real data. Then if the value is less than .5 we count that as a prediction of 0 and if it's greater than .5 we count it as a prediction of 1.

That is the logistic function. In particular it makes 0 into .5.

```{r, fig.width=4, fig.height=3}
logistic <- function(x) {
  exp(x) / (1 + exp(x))
}

xs <- runif(10000, -10, 10)
qplot(xs, logistic(xs))
```

Let's try out a few different possible $\beta_0$ and $\beta_1$ values to get a sense of how the logistic function changes as function of them:

```{r, fig.width=4, fig.height=3}
qplot(xs, logistic( 1  +  0.5 * xs))
qplot(xs, logistic( 3  +  0.5 * xs))
qplot(xs, logistic( 3  +  3   * xs))
qplot(xs, logistic( 1  +  3   * xs))
qplot(xs, logistic( 3  -  0.5 * xs))
qplot(xs, logistic( 0.1 - 0.5 * xs))
qplot(xs, logistic( 0.1 - 5   * xs))
qplot(xs, logistic(-1 -   0.5 * xs))
```

Now we'll say:

$$ p = \text{logistic}( \beta_0 + \beta_1 \cdot x_1 + \beta_2 \cdot x_2 + ... ) $$

So let's run a logistic regression. The syntax is very similar to linear regression:

```{r}
freq_glm <- glm(rt_binary ~ WrittenFrequency, data = rts_binary,
                family = binomial(link = "logit"))
summary(freq_glm)
```

```{r, fig.width=4, fig.height=3}
freq_glm_tidy <- tidy(freq_glm)

intercept <- filter(freq_glm_tidy, term == "(Intercept)")$estimate
slope <- filter(freq_glm_tidy, term == "WrittenFrequency")$estimate

newxs <- runif(10000, 0, 10)
freq_logistic <- logistic(intercept + slope * newxs)
qplot(newxs, freq_logistic)
```

At the top we have some information about the residuals (same as with linear regression), but here they don't need to be normally distributed.

At the very bottom there are some descriptions of how well the model is fitting the data. One test asks whether the model with predictors fits significantly better than a model with just an intercept (i.e., a null model). The test statistic is the difference between the residual deviance for the model with predictors and the null model. It is chi-square distributed with degrees of freedom equal to the differences in degrees of freedom between the current and the null model (i.e. the number of predictor variables in the model). You could compute it but it's not super informative. AIC is similar - it's useful for comparing models. The iteration is a measure of how much time it took to find the best estimate.

Now the coefficients are somewhat harder to interpret than in linear regression:

* The intercept corresponds to the log odds of being in the slow RT group (RTbin=1) when WrittenFrequency is 0. In other words, the odds of being slow for words with frequency 0 is pretty high. In this case this is not super useful information. In general, the intercept is often less interesting in logistic regression.

* Earlier, we could say that the slope corresponded the change in $y$ for 1 unit change of $x$. Here, it's the same thing, but then squashed in the logistic function, so it's a little harder to map onto some meaningful numbers.

Let's see what happens if we pull out some plausible values of frequency.

```{r}
# probability of slow given frequency = 5
logistic(intercept + slope * 5)

# probability of slow given frequency = 6
logistic(intercept + slope * 6)

predicted <- augment(freq_glm, newdata = data_frame(WrittenFrequency = c(5, 6)),
                     type.predict = "response")
```

So the log odds of a slow RT, when Frequency is 5, are 0.63 and the log odds of a slow RT are 0.55 for Frequency = 6. Now we can reconstruct the slope by looking at a 1 unit increase in Frequency by taking the difference of the log odds.

```{r}
log_odds_5 <- log(predicted$.fitted[1] / (1 - predicted$.fitted[1]))
log_odds_6 <- log(predicted$.fitted[2] / (1 - predicted$.fitted[2]))

diff_log_odds <- log_odds_6 - log_odds_5
diff_log_odds
```

You can still take a larger slope to indicate a larger effect, but now instead of 1 unit change in $x$ being equal to 1 unit change in $y$, it means 1 unit change in WrittenFrequency corresponds to a -0.32 change in log odds of RTbin being 1, i.e. of a slow reaction time.

It's pretty hard to imagine what it means for something to decrease in log-odds space. By exponentiating we can think about the odds ratio (odds for $x=6$ / odds for $x=5$).

```{r}
exp(diff_log_odds)
```

This says that the odds of RT being categorized as slow are multiplied by 0.73 (i.e. < 1 means decreases) with every additional unit of WrittenFrquency. 

Because the relationship between $p(x)$ and $x$ is not linear, we can't easily say anything about the change in proportions for a unit change in $x$.

What about categorical predictors?

You can use the same dummy coding or deviation coding approach as before:

```{r}
contrasts(rts_binary$AgeSubject)  

age_glm <- glm(rt_binary ~ AgeSubject, data = rts_binary,
               family = binomial(link = "logit"))
summary(age_glm)
```

```{r}
age_glm_tidy <- tidy(age_glm)

intercept <- filter(age_glm_tidy, term == "(Intercept)")$estimate
slope <- filter(age_glm_tidy, term == "AgeSubject1")$estimate

old_code <- contrasts(rts_binary$AgeSubject)["old", ]
young_code <- contrasts(rts_binary$AgeSubject)["young", ]

# probability of slow given old
logistic(intercept + slope * old_code)

# probability of slow given young
logistic(intercept + slope * young_code)
```

```{r}
ggplot(rts_binary, aes(x = WrittenFrequency, y = rt_binary, colour = AgeSubject)) +
  geom_point(shape = 1) +
  geom_smooth(method = "glm", method.args = list(family = "binomial"))
```

```{r}
freq_age_glm <- glm(rt_binary ~ AgeSubject * WrittenFrequency, data = rts_binary,
                    family = binomial(link = "logit"))
summary(freq_age_glm)
```


## Linear mixed-effects models

One important assumption of all these models is that we have been pretending that all the data points in our scatter plot are generated independently from calls to some generative process like `rnorm`. But that's not how experiments usually work...

Usually each subject sees a bunch of items in one of several conditions. And the same items are seen by many subjects (sometimes there are different lists but sometimes all subjects see all items).

There will be some correlation between datapoints within subjects, because maybe some subjects are sleepy and tend to respond slowly, or others just drank a bunch of coffee and are responding super fast. This correlation is entirely missed in our regression model here. We're analyzing our data as if it was between-subjects, and consequently we're losing statistical power. Same thing with the items. We know now that words have properties that affect how people respond to them so subjects' responses will be correlated for a the same items.

Let's look at this in the `lexdec` dataset:

```{r}
library(languageR)

ggplot(lexdec, aes(x = Trial, y = RT)) +
  geom_point(shape = 1)

lexdec_subject <- lexdec %>%
  group_by(Subject) %>%
  summarise(mean_rt = mean(RT))

ggplot(lexdec, aes(x = Subject, y = RT)) +
  geom_point(shape = 1, colour = "grey") +
  geom_point(aes(y = mean_rt), data = lexdec_subject, size = 3)
```

You can see that the different subjects have different average RTs and they also vary in terms of how they are affected by the the predictor of interest:

```{r, fig.width=8, fig.height=8}
ggplot(lexdec, aes(x = Trial, y = RT)) +
  facet_wrap(~Subject) +
  geom_point(shape = 1) +
  geom_smooth(method = "lm")
```

Same thing if we look at items:

```{r, fig.width=8, fig.height=6}
lexdec_items <- lexdec %>%
  group_by(Word) %>%
  summarise(mean_rt = mean(RT))

ggplot(lexdec_items, aes(x = Word, y = mean_rt)) +
  geom_text(aes(label = Word)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
```

```{r, fig.width=10, fig.height=12}
ggplot(lexdec, aes(x = Trial, y = RT)) +
  facet_wrap(~Word) +
  geom_point(shape = 1) +
  geom_smooth(method = "lm")
```

We're going to have to change our generative process for our data, such that the generative process generates datapoints that are correlated within subject and within item.
 
```{r}
trial_lm <- lm(RT ~ Trial, data = lexdec)
summary(trial_lm)
```
 
Previously we had:
$$ y_i \sim \mathcal{N}(\mu_i, \sigma) $$
 $$ \mu_i = \beta_0 + \beta_1 \cdot x_{1_i} + \beta_2 \cdot x_{2_i} + \beta_3 \cdot x_{1_i} \cdot x_{2_i} $$

Now let's incorporate the fact that each subject and each item has its own mean.

$$ y_i \sim \mathcal{N}(\mu_i, \sigma) $$
$$ \mu_i = \beta_0 + \gamma_{s_i} + \gamma_{j_i} + ... $$

So the intercept for a particular datapoint will not just be $\beta_0$, but rather $\beta_0$ plus $\gamma_s$ for the particular subject, plus $\gamma_j$ for the particular item. So now we have incorporated different intercepts for different subjects and items in our model. We need to take one more step.

In general, while we know there is variation between subjects and items, we don't think that variation is without bound. We think most subject means will be around the same, and most item means will be around the same.

So instead of just having these numbers here, let's draw these numbers from a normal distribution:

$$ y_i \sim \mathcal{N}(\mu_i, \sigma) $$
$$ \mu_i = \beta_0 + \gamma_{s_i} + \gamma_{j_i} + ... $$
$$ \gamma_{s_i} \sim \mathcal{N}(0, \sigma_s) $$
$$ \gamma_{j_i} \sim \mathcal{N}(0, \sigma_j) $$

This says the value of $\gamma_{s_i}$ is probably 0, and with some probability as determined by the normal distribution it is some other value with some distance from 0.

$\gamma_{s_i}$ is the extent to which the intercept for subject will differ from the mean intercept of all subjects.

Then we generate a $\gamma_{j_i}$ for an item. Then we generate our data from a normal distribution around that mean.

### Random intercepts

```{r}
library(lme4)
library(lmerTest)

subject_lmer <- lmer(RT ~ Trial + (1 | Subject), data = lexdec)
summary(subject_lmer)

subject_item_lmer <- lmer(RT ~ Trial + (1 | Subject) + (1 | Word), data = lexdec)
summary(subject_item_lmer)
```

So in addition to the fixed effects you see random effects!
Fixed effects are the things we care about / that we are testing, generally.

```{r}
ranef(subject_item_lmer)
```

```{r, fig.width=7}
lmer_augmented <- augment(subject_item_lmer)

ggplot(lmer_augmented, aes(x = Trial)) +
  geom_point(aes(y = .fitted, colour = Subject), shape = 1) +
  geom_point(aes(y = .fixed), colour = "red")
```

```{r, fig.width=10, fig.height=8}
ggplot(lmer_augmented, aes(x = Trial)) +
  facet_wrap(~Subject) +
  geom_point(aes(y = .fitted, colour = Subject), shape = 1) +
  geom_point(aes(y = .fixed), colour = "red")
```

This kind of model is called a __mixed effects regression with random intercepts__. Our old coefficients here are called __fixed effects__ because they are fixed in value across all datapoints. Our new coefficients $\gamma$ are called __random effects__ because they are drawn for each subject and item randomly from a normal distribution. 

### Random slopes

But you may find that not only do subjects and items have baseline differences, they may also be differentially affected by your experimental manipulation.

```{r}
ggplot(lexdec, aes(x = Trial, y = RT)) +
  geom_smooth(aes(colour = Word), method = "lm", se = FALSE, size = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  scale_color_discrete(guide = FALSE)

ggplot(lexdec, aes(x = Trial, y = RT)) +
  geom_smooth(aes(colour = Subject), method = "lm", se = FALSE, size = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  scale_color_discrete(guide = FALSE)
```

In these cases you can run __mixed effects regression with random slopes__. Here is the syntax for adding slopes:

```{r}
slopes_lmer <- lmer(RT ~ Trial + (Trial | Subject) + (Trial | Word),
                    data = lexdec)
```

Sometimes `lmer` will fail to find a good set of parameter values for you. The first things to try in these cases is to double-check that your model formula is correct and then to scale any continuous predictors.

```{r}
lexdec_scaled <- lexdec %>%
  mutate(Trial = scale(Trial))

slopes_lmer <- lmer(RT ~ Trial + (Trial | Subject) + (Trial | Word),
                    data = lexdec_scaled)
summary(slopes_lmer)
```

There are also [some other tricks to try](https://rstudio-pubs-static.s3.amazonaws.com/33653_57fc7b8e5d484c909b615d8633c01d51.html). If you can't get it converge, it often means that you do not have enough data to fit all the parameters you are looking for. This is not great but there's a semi-reasonable (or at least this is currently the best approach we've got) way to deal with it. You take out terms from the random effects structure in stepwise fashion until the model converges.

You can also model binary variables with mixed-effects models using `glmer()` and the `family="binomial"` argument.

Great visualization of mixed effects modeling:
http://mfviz.com/hierarchical-models/

More information on `glmm` models:
http://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#model-specification