---
title: "Lab 5"
author: "9.59 Lab in Psycholinguistics"
---

```{r setup, include=FALSE, cache=FALSE}
source("_lab_setup.R")
```

```{r}
library(tidyverse)
```

## Multiple comparisons

One thing that's come up recently in the literature is the issue of multiple comparisons. Here's a way researchers sometimes plan studies: I want to see if group A differs from group B in cognitive ability X, but I don't know exactly how to measure X. I'll collect response time, and accuracy, and time spent on response, and also, since none of those might work out, I'll collect some measures of IQ and favorite dog breed. Then I'll see which of those yields significant differences!

Do you see the problem here?

We know that you can get p-values < 0.05 when the null hypothesis is true, so what happens when we run lots of tests?

```{r}
experiment <- function() {
  sample1 <- rnorm(50, 20, 10)
  sample2 <- rnorm(50, 20, 10)
  p_value <- t.test(sample1, sample2)$p.value
  return(p_value)
}

tests <- function(num_test) {
  p_values <- replicate(num_test, experiment())
  num_sig <- sum(p_values < 0.05)
  return(num_sig)
}

mult_comp <- data_frame(num_test = 1:100) %>%
  mutate(num_sig = map_dbl(num_test, tests))

ggplot(mult_comp, aes(x = num_test, y = num_sig)) +
  geom_point() +
  geom_smooth(method = "lm")
```

As you can see, as you run more and more tests, the number of significant p-values increases. So if you throw in enough stuff in your experiment, you're pretty much bound to find a p-value less than 0.05. The lesson isn't that you should just have lots of comparisons in your experiments. The lesson is that the threshold of 5% is no longer appropriate when you have multiple comparisons. 

There are ways to correct alpha level (Bonferroni, FDR, etc..) but they shouldn't be used as a one-size-fits-all solution.

- [See Simmons et al. (2011)](http://journals.sagepub.com/doi/pdf/10.1177/0956797611417632)
- [Reproducibility Project](http://science.sciencemag.org/content/349/6251/aac4716)
- [Relevant xkcd](https://imgs.xkcd.com/comics/significant.png)


### __Try it yourself...__

1. Simulate how power changes across sample sizes for a two-sample t-test.
2. Plot your results (sample size on the x axis, power on the y axis).
3. Increase or decrease your effect size and re-run your simulation/plot. How do sample size and effect size affect power?

```{r, indent= "    ", eval=FALSE}
experiment <- function(group_size) {
  # sample data of group size from each of two groups with different means
  # return p-value of t-test comparing them
}

power <- function(group_size) {
  # run experiment 100 times
  # return the percentage of experiments that have significant p-values
}

power_levels <- data_frame(group_size = 2:200) %>%
  mutate(sample_size = group_size * 2,
         power = map_dbl(group_size, power))

ggplot(power_levels) # add plot structure
```

```{r, indent= "    "}
experiment <- function(group_size) {
  sample1 <- rnorm(group_size, 20, 10)
  sample2 <- rnorm(group_size, 25, 10)
  p_value <- t.test(sample1, sample2)$p.value
  return(p_value)
}

power <- function(group_size) {
  p_values <- replicate(100, experiment(group_size))
  num_sig <- sum(p_values < 0.05) / 100
  return(num_sig)
}

power_levels <- data_frame(group_size = 2:200) %>%
  mutate(sample_size = group_size * 2,
         power = map_dbl(group_size, power))

ggplot(power_levels, aes(x = sample_size, y = power)) +
  geom_hline(yintercept = 0.8, colour = "grey", linetype = "dotted") +
  geom_point() +
  geom_smooth()
```


## Regression

So far we've looked at the types of data where you have samples groups and you want to know if they were generated by the same generative process or two separate ones. But a lot of questions in psycholinguists can't really be answered in this way. For instance, you can't use a t-test to look at whether lexical decision times increase as word frequency increases. This is a bivariate relationship between 2 numerical variables.

```{r}
data_source <- "http://web.mit.edu/psycholinglab/data/"
rts <- read_csv(file.path(data_source, "rts.csv"))

ggplot(rts) +
  geom_histogram(aes(x = RTlexdec), bins = 60, color = "white")

ggplot(rts) +
  geom_point(aes(x = WrittenFrequency, y = RTlexdec))
```

So in the past we've looked at this using a correlation which tells us about the strength of the relationship between 2 variables. And you can go beyond that and show whether this relationship is statistically significant, i.e., different from 0 or not using `cor.test()`

```{r}
cor.test(rts$RTlexdec, rts$WrittenFrequency)
```

So this looks similar to the output we got from the t-test because what it's doing is taking the Pearson's coefficient, _r_ = -0.43, and finding a corresponding t-statistic (specific formula: $\frac{r\sqrt{n−2}}{\sqrt{1−r^2}}$), _t_ = -32.63, based on a t-distribution with df = N-2. For this particular t-statistic, the probability that of it occurring given an _r_ = 0 is very very small. So we can conclude that we have a significant correlation.

Note that, in general, a bigger _r_ value is likely to be significant, but you can have a big _r_ value that's not significant, if you have a very small sample size. You can also have very tiny _r_ values that are significant if you have a big enough sample size.

This is very useful but it's telling us about how the 2 variables relate to each other without making any hypotheses about the direction of that relationship. Often in experiments we have more specific questions because we have manipulated some variable and we're looking at how this affects another variable, or we hypothesize that one variable is driving a change in another variable. Correlation says nothing about that. This is where linear regression comes in. For two continuous variables, linear regression and correlation are almost the same but you'll see that regression is actually much more general because you can use it with categorical predictors, multiple predictors, etc...

We've actually been using regression already when we use `geom_smooth()` to draw a line to fit the data...

```{r}
ggplot(rts, aes(x = WrittenFrequency, y = RTlexdec)) +
  geom_point() +
  geom_smooth(method = "lm")
```

If you remember from high school math class, the formula for a line is $y = ax + b$. To do linear regression, we'll use just this:

$$ y_i = \beta_0 + \beta_1 \cdot x_i + \epsilon $$

The $\beta_0$ is the point at which it intersects with $x = 0$: the __intercept__.

The $\beta_1$ is the change in $y$ that corresponds to a unit change in $x$: the __slope__.

The line that is being drawn by `geom_smooth()` has an intercept and slope parameter but it also has another key component: __error__.

```{r}
freq_lm <- lm(RTlexdec ~ WrittenFrequency, data = rts)
summary(freq_lm)
```

There's a lot of information here so let's focus first on the coefficients.

```{r}
library(broom)
tidy(freq_lm, conf.int = TRUE)
```

First, let's look at the estimate: the values here are our intercept and slope defining the best fit line.

Those estimates are being compared to a t-distribution in the same way that we've done for previous tests and you obtain a p-value. So here this suggests that:

1. The intercept value is significantly different from 0, so words with 0 WrittenFrequency have a predicted RTlexdec value of 6.74 seconds.
2. The slope is significantly different from 0, so for every unit increase in WrittenFrequency, you expect a unit decrease of 0.04 units of RTlexdec.

So our linear model is:

$$ \text{RTlexdec} = 6.74 - 0.04 \cdot \text{WrittenFrequency} + \mathcal{N}(0, \sigma)$$

In other words for every value of Written Frequency we can find the value of RTlexdec by taking the Written Frequency, multiplying by -0.04 and subtracting that from 6.74, then drawing from a normal distribution centered on that point. Why the normal distribution? That's the error...

```{r}
freq_aug <- augment(freq_lm)
head(freq_aug)

ggplot(freq_aug, aes(x = WrittenFrequency, y = RTlexdec)) +
  geom_point() +
  geom_point(aes(y = .fitted), color = "blue")
```

So clearly the real values don't fall exactly on the line like the predicted values. The line is not a perfect fit and it couldn't be -- there is no line that would fit every single one of those datapoints. The distances between the real datapoints and this line is what's called error.

```{r}
errors <- freq_aug$RTlexdec - freq_aug$.fitted
residuals <- freq_aug$.resid
all.equal(errors, residuals)

ggplot(freq_aug, aes(sample = .resid)) +
  stat_qq()

ggplot(freq_aug, aes(x = WrittenFrequency, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "blue")
```

The residuals are normally distributed which is good -- this is the mark of an appropriately applied linear regression model. We're also not seeing any huge differences in the residuals for different values of x (homoscedasticity).

The residuals are the part of the data that we have failed to explain with our model. So we want them to be minimized. In fact, minimizing those residuals is pretty much how the best fit line is picked. (I don't think we need to get into all the detail about this but sum of the squared distances between each point and the line is what is being minimized: http://students.brown.edu/seeing-theory/regression/index.html#first)

http://shinyapps.org/showapp.php?app=http://87.106.45.173:3838/felix/lmfit&by=Felix%20Sch%C3%B6nbrodt&title=Find-a-fit!&shorttitle=Find-a-fit

The deviation in residuals appears in the summary as "residual standard error".

```{r}
sd(freq_aug$.resid)
summary(freq_lm)
```

So once you've found this optimal line and you have those predicted/fitted values, you can assess whether this line is actually explaining a lot of the variation in the data or not. This is the same as asking whether variation in $x$ explains variation in $y$ well or not. You can do this by seeing how close the fitted values are to the actual $y$ values.

```{r}
(cor(freq_aug$.fitted, freq_aug$RTlexdec)) ^ 2
```

It is often interpreted as the proportion of data which your model explains. So frequency here explains 18.9% of the observed variation in RTs.
The F-test at the very bottom is saying that $r^2$ is significantly different from 0, i.e. the model is explaining a non-zero portion of the variance.

```{r}
freq_lm_null <- lm(RTlexdec ~ 1, data = rts)
rss_null <- sum(residuals(freq_lm_null) ^ 2)
rss <- sum(residuals(freq_lm) ^ 2)

n <- nrow(rts)
p_null <- 1
p <- 2
f_value <- ((rss_null - rss) / (p - p_null)) / (rss / (n - p - 1))
pf(f_value, p - p_null, n - p, lower.tail = FALSE)
```

This is not a particularly useful value for our purposes, it'll usually be significant...

### __Try it yourself...__

1. Model the relationship between RTlexdec and WrittenSpokenFrequencyRatio, plot it, and summarize in words what the result means.
2. Now look at the relationship between RTnaming and WrittenSpokenFrequencyRatio. Do you notice anything different?

```{r, indent="    "}
ratio_lm <- lm(RTlexdec ~ WrittenSpokenFrequencyRatio, data = rts)
ratio_tidy <- tidy(ratio_lm)
ratio_aug <- augment(ratio_lm)
summary(ratio_lm)

ggplot(ratio_aug, aes(sample = .resid)) + stat_qq()

ggplot(ratio_aug, aes(x = WrittenSpokenFrequencyRatio, y = RTlexdec)) +
  geom_point(color = "grey") +
  geom_abline(intercept = filter(ratio_tidy, term == "(Intercept)")$estimate,
              slope = filter(ratio_tidy, term == "WrittenSpokenFrequencyRatio")$estimate,
              colour = "blue") +
  geom_smooth(method = lm, colour = "red")
```

```{r, indent="    "}
naming_lm <- lm(RTnaming ~ WrittenSpokenFrequencyRatio, data = rts)
naming_aug <- augment(naming_lm)
summary(naming_lm)

ggplot(naming_aug, aes(sample = .resid)) + stat_qq()

ggplot(naming_aug, aes(x = WrittenSpokenFrequencyRatio, y = RTnaming)) +
  geom_point(color = "grey") +
  geom_smooth(method = lm, colour = "red")
```

Here we looked at one continuous predictor and a continuous outcome, but regression is much more general than that. So we're going to look at a few other ways you can use it.


### Categorical predictors

Let's say we want to look at the effect of being in one of 2 (or more) groups on some response. For example, we can look at RTlexdec as a function of the age of the subjects

```{r}
rts_age <- rts %>%
  group_by(AgeSubject) %>%
  summarise(mean_rt = mean(RTlexdec))

ggplot(rts, aes(x = AgeSubject, y = RTlexdec)) +
  geom_violin() +
  geom_point(aes(y = mean_rt), data = rts_age, size = 3)
```

```{r}
age_lm <- lm(RTlexdec ~ AgeSubject, data = rts)
summary(age_lm)
```

This looks very familiar but if we go back to our line equation, something is weird. 

$$ \text{RTlexdec} = 6.66 - 0.22 \cdot \text{AgeSubject} + \mathcal{N}(0, \sigma)$$

How do you multiply AgeSubject, which is either the word "old" or the word "young", by a number?

Turns out R automatically turns "old" and "young" into 0s and 1s.

But you can't see this until you convert it into a factor...

```{r}
rts$AgeSubject <- factor(rts$AgeSubject)
contrasts(rts$AgeSubject)
```

It is treating "old" as 0 and "young" as 1 and it just did this alphabetically by default. This is called **dummy coding**.

So how do we read the output of our model?

```{r}
age_tidy <- tidy(age_lm)
age_tidy
```

The Intercept corresponds to the y value when the dummy coded variable is 0, so the intercept is the average RTlexdec for old people.

```{r}
rts_age
```

AgeSubjectyoung is the coefficient that you add when the dummy coded variable is 1. So the average RTlexdec for young people is -0.22 less than the intercept.

```{r}
intercept <- filter(age_tidy, term == "(Intercept)")$estimate
slope <- filter(age_tidy, term == "AgeSubjectyoung")$estimate

old_code <- contrasts(rts$AgeSubject)["old", ]
young_code <- contrasts(rts$AgeSubject)["young", ]

# old RTs
intercept + slope * old_code

# young RTs
intercept + slope * young_code
```


### Dummy coding vs. effects coding

Another option is to do what is sometimes called **effects coding**. Instead of the default dummy coding that R assigns, you can change the contrast codes to something that is more meaningful to you.

```{r}
contrasts(rts$AgeSubject) <- c(-0.5, 0.5)
contrasts(rts$AgeSubject)
```

So now the indicator variable for AgeSubject is -0.5 for "old" and 0.5 for "young".

How will this affect our regression?

```{r}
age_lm_2 <- lm(RTlexdec ~ AgeSubject, data = rts)
age_tidy_2 <- tidy(age_lm_2)
summary(age_lm_2)
```

As you can see the slope is exactly the same -- it's still telling us that going from old to young (1 unit change) means subtracting -0.22. But note that the intercept is different. Now the intercept is the grand mean (across all conditions).

What if we rewrite our formulas from earlier:

```{r}
intercept <- filter(age_tidy_2, term == "(Intercept)")$estimate
slope <- filter(age_tidy_2, term == "AgeSubject1")$estimate

old_code <- contrasts(rts$AgeSubject)["old", ]
young_code <- contrasts(rts$AgeSubject)["young", ]

# old RTs
intercept + slope * old_code

# young RTs
intercept + slope * young_code
```

We still get the exact same thing. In this case, where there's just one predictor, the choice of contrast coding is entirely a matter of preference. In multiple regression, where you have many predictors potentially interacting, the right contrast coding can make a huge difference for how you interpret what's going on in your model, specifically whether you are looking at simple effects or main effects. 

There are many different coding schemes for testing various hypotheses, more examples:
https://stats.idre.ucla.edu/r/library/r-%20library-contrast-coding-systems-for-%20categorical-variables/


### Multiple Predictors

Now remember when we looked at the histogram, we saw that the RTs aren't quite normally distributed in that there were two peaks. And now we know that young and old subjects respond differently. Let's incorporate that into our regression. First some plots:

```{r}
ggplot(rts, aes(x = WrittenFrequency, y = RTlexdec, colour = AgeSubject)) +
  geom_point(alpha = 0.1)
```

We see there are two groups here. Remember we are evaluating our model by seeing how close the fitted values are to the real data points, and we can get a lot closer if we have different lines for different age. 

```{r}
ggplot(rts, aes(x = WrittenFrequency, y = RTlexdec, colour = AgeSubject)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "lm")
```

Let's make a regression model with multiple predictors:

$$ \text{RTlexdec} = \beta_0 + \beta_1 \cdot \text{WrittenFrequency} + \beta_2 \cdot \text{AgeSubject} + \mathcal{N}(0, \sigma)$$

```{r}
freq_age_lm <- lm(RTlexdec ~ WrittenFrequency + AgeSubject, data = rts)
summary(freq_age_lm)
```

* So now we have the intercept, which is the mean of RTlexdec for WrittenFrequency = 0 and across all AgeSubject (because of the effects coding). 

* The effect of WrittenFrequency (-0.037) is pretty much the same as before which makes sense because we were seeing the same effect of Written Frequency in both groups

* The effect of age is very similar in size to when it was the sole predictor in the model as well.

Note that both the effects of Age and Frequency are significant suggesting that they are both independently contributing to explaining the variance in RTlexdec.

Critically, the $r^2$ is much larger, suggesting we're doing a much better job of fitting the data than with the previous model. All that extra variance is now being explained by adding the Age factor.

> Linear model:
> $$ y_i = \beta_0 + \beta_1 \cdot x_{i1} + \beta_2 \cdot x_{i2} + \ldots + \beta_k \cdot x_{ik} + \epsilon_i $$