--- title: "Lab 5" author: "9.59 Lab in Psycholinguistics" --- ```{r setup, include=FALSE, cache=FALSE} source("_lab_setup.R") ``` ## Lab 4 review Recall that last lab we learned about hypothesis testing, using _t-values_ and _p-values_. The basic logic of null hypothesis testing is that we choose some uninteresting null hypothesis, like the difference between two group being 0. Then we show that, under this uninteresting hypothesis, the data that we have are very unlikely. If the probability of our data under the null hypothesis is below some threshold (chosen by convention to be 0.05), we say that we have a _significant effect_. The idea is that the observed difference leads us to conclude, from the sheer unlikeliness of our data under the null hypothesis, that the null hypothesis is false, and something else must be true. To figure out this probability, we convert our mean into a t-statistic: $\frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}}$, which is distributed according to a t-distribution with $n-1$ degrees of freedom. Then we get the probability of a t-distribution generating our t-value or greater, or -t-value or lesser. This is called a t-test and run with `t.test()`. We also talked about power -- if there is a real effect, how likely are you to detect it? Underpowered studies are likely to be waste of time and resources (and your sanity). To review these concepts, we're going to run some simulations seeing how power is affected by sample size and effect size. 1. Simulate how power changes across sample sizes for a two-sample t-test. 2. Plot your results (sample size on the x axis, power on the y axis). How does sample size affect power? 3. Abstract the effect size in your experiment into an argument. Simulate and plot how power changes across effect sizes. How does effect size affect power? ```{r, indent= " ", eval=FALSE} experiment <- function(group_size) { # sample data of group size from each of two groups with different means # return p-value of t-test comparing them } power <- function(group_size) { # run experiment 100 times # return the percentage of experiments that have significant p-values } power_n <- tibble(group_size = 2:200) %>% mutate(power = map_dbl(group_size, power)) ggplot(power_n) # add plot structure power_d <- tibble(d = (1:200)/100) %>% mutate(power = map_dbl(d, ~power(20, .))) ggplot(power_d) # add plot structure ``` ```{r, indent= " "} library(tidyverse) experiment <- function(group_size) { sample1 <- rnorm(group_size, 0, 1) sample2 <- rnorm(group_size, 0.5, 1) p_value <- t.test(sample1, sample2)$p.value return(p_value) } power <- function(group_size) { p_values <- replicate(100, experiment(group_size)) num_sig <- sum(p_values < 0.05) / 100 return(num_sig) } power_n <- tibble(group_size = 2:200) %>% mutate(power = map_dbl(group_size, power)) ggplot(power_n, aes(x = group_size, y = power)) + geom_hline(yintercept = 0.8, colour = "grey", linetype = "dotted") + geom_point() + geom_smooth() ``` ```{r, indent= " "} experiment <- function(group_size, d) { sample1 <- rnorm(group_size, 0, 1) sample2 <- rnorm(group_size, d, 1) p_value <- t.test(sample1, sample2)$p.value return(p_value) } power <- function(group_size, d, alpha) { p_values <- replicate(100, experiment(group_size, d)) num_sig <- sum(p_values < alpha) / 100 return(num_sig) } power_d <- tibble(d = (1:200)/100) %>% mutate(power = map_dbl(d, ~power(20, ., 0.05))) ggplot(power_d, aes(x = d, y = power)) + geom_hline(yintercept = 0.8, colour = "grey", linetype = "dotted") + geom_point() + geom_smooth() ``` More relevant simulations: http://rpsychologist.com/d3/NHST/ So what conclusions can we draw about how we should design experiments as researchers in psycholinguistics? - can't do too much about $\alpha$ because of convention and because it's a reasonable convention that affords a good balance between Type I and Type II error rates - you can increase your sample size - you can try to get good measurements of your effect, with sensitive designs, so the effect size is bigger / the variability is smaller ## Multiple comparisons One thing that's come up recently in the literature is the issue of multiple comparisons. Here's a way researchers sometimes plan studies: I want to see if group A differs from group B in cognitive ability X, but I don't know exactly how to measure X. I'll collect response time, and accuracy, and time spent on response, and also, since none of those might work out, I'll collect some measures of IQ and favorite dog breed. Then I'll see which of those yields significant differences! Do you see the problem here? We know that you can get p-values < 0.05 when the null hypothesis is true, so what happens when we run lots of tests? ```{r} experiment <- function() { sample1 <- rnorm(50, 0, 1) sample2 <- rnorm(50, 0, 1) p_value <- t.test(sample1, sample2)$p.value return(p_value) } tests <- function(num_test) { p_values <- replicate(num_test, experiment()) num_sig <- sum(p_values < 0.05) return(num_sig) } mult_comp <- tibble(num_test = 1:100) %>% mutate(num_sig = map_dbl(num_test, tests)) ggplot(mult_comp, aes(x = num_test, y = num_sig)) + geom_point() + geom_smooth(method = "lm") ``` As you can see, as you run more and more tests, the number of significant p-values increases. So if you throw in enough stuff in your experiment, you're pretty much bound to find a p-value less than 0.05. The lesson isn't that you should just have lots of comparisons in your experiments. The lesson is that the threshold of 5% is no longer appropriate when you have multiple comparisons. There are ways to correct alpha level (Bonferroni, FDR, etc..) but they shouldn't be used as a one-size-fits-all solution. - [See Simmons et al. (2011)](http://journals.sagepub.com/doi/pdf/10.1177/0956797611417632) - [Reproducibility Project](http://science.sciencemag.org/content/349/6251/aac4716) - [Relevant xkcd](https://imgs.xkcd.com/comics/significant.png) ## Linear regression ### Continuous predictors So far we've looked at the types of data where you have samples groups and you want to know if they were generated by the same generative process or two separate ones. But a lot of questions in psycholinguists can't really be answered in this way. For instance, you can't use a t-test to look at whether lexical decision times increase as word frequency increases. This is a bivariate relationship between 2 numerical variables. ```{r} data_source <- "http://web.mit.edu/psycholinglab/www/data" rts <- read_csv(file.path(data_source, "rts.csv")) ggplot(rts) + geom_histogram(aes(x = RTlexdec), bins = 60, color = "white") ggplot(rts) + geom_point(aes(x = WrittenFrequency, y = RTlexdec)) ``` So in the past we've looked at this using a correlation which tells us about the strength of the relationship between 2 variables. And you can go beyond that and show whether this relationship is statistically significant, i.e., different from 0 or not using `cor.test()` ```{r} cor.test(rts$RTlexdec, rts$WrittenFrequency) ``` So this looks similar to the output we got from the t-test because what it's doing is taking the Pearson's coefficient, _r_ = -0.43, and finding a corresponding t-statistic (specific formula: $\frac{r\sqrt{n−2}}{\sqrt{1−r^2}}$), _t_ = -32.63, based on a t-distribution with df = N-2. For this particular t-statistic, the probability that of it occurring given an _r_ = 0 is very very small. So we can conclude that we have a significant correlation. Note that, in general, a bigger _r_ value is likely to be significant, but you can have a big _r_ value that's not significant, if you have a very small sample size. You can also have very tiny _r_ values that are significant if you have a big enough sample size. This is very useful but it's telling us about how the 2 variables relate to each other without making any hypotheses about the direction of that relationship. Often in experiments we have more specific questions because we have manipulated some variable and we're looking at how this affects another variable, or we hypothesize that one variable is driving a change in another variable. Correlation says nothing about that. This is where linear regression comes in. For two continuous variables, linear regression and correlation are almost the same but you'll see that regression is actually much more general because you can use it with categorical predictors, multiple predictors, etc... We've actually been using regression already when we use `geom_smooth()` to draw a line to fit the data... ```{r} ggplot(rts, aes(x = WrittenFrequency, y = RTlexdec)) + geom_point() + geom_smooth(method = "lm") ``` If you remember from high school math class, the formula for a line is $y = ax + b$. To do linear regression, we'll use just this: $$ y_i = \beta_0 + \beta_1 \cdot x_i + \epsilon $$ The $\beta_0$ is the point at which it intersects with $x = 0$: the __intercept__. The $\beta_1$ is the change in $y$ that corresponds to a unit change in $x$: the __slope__. The line that is being drawn by `geom_smooth()` has an intercept and slope parameter but it also has another key component: __error__. ```{r} freq_lm <- lm(RTlexdec ~ WrittenFrequency, data = rts) summary(freq_lm) ``` There's a lot of information here so let's focus first on the coefficients. ```{r} library(broom) tidy(freq_lm, conf.int = TRUE) ``` First, let's look at the estimate: the values here are our intercept and slope defining the best fit line. Those estimates are being compared to a t-distribution in the same way that we've done for previous tests and you obtain a p-value. So here this suggests that: 1. The intercept value is significantly different from 0, so words with 0 WrittenFrequency have a predicted RTlexdec value of 6.74 seconds. 2. The slope is significantly different from 0, so for every unit increase in WrittenFrequency, you expect a unit decrease of 0.04 units of RTlexdec. So our linear model is: $$ \text{RTlexdec} = 6.74 - 0.04 \cdot \text{WrittenFrequency} + \mathcal{N}(0, \sigma)$$ In other words for every value of Written Frequency we can find the value of RTlexdec by taking the Written Frequency, multiplying by -0.04 and subtracting that from 6.74, then drawing from a normal distribution centered on that point. Why the normal distribution? That's the error... ```{r} freq_aug <- augment(freq_lm) head(freq_aug) ggplot(freq_aug, aes(x = WrittenFrequency, y = RTlexdec)) + geom_point() + geom_point(aes(y = .fitted), color = "blue") ``` So clearly the real values don't fall exactly on the line like the predicted values. The line is not a perfect fit and it couldn't be -- there is no line that would fit every single one of those datapoints. The distances between the real datapoints and this line is what's called error. ```{r} errors <- freq_aug$RTlexdec - freq_aug$.fitted residuals <- freq_aug$.resid all.equal(errors, residuals) ggplot(freq_aug, aes(sample = .resid)) + stat_qq() ggplot(freq_aug, aes(x = WrittenFrequency, y = .resid)) + geom_point() + geom_hline(yintercept = 0, color = "blue") ``` The residuals are normally distributed which is good -- this is the mark of an appropriately applied linear regression model. We're also not seeing any huge differences in the residuals for different values of x (homoscedasticity). The residuals are the part of the data that we have failed to explain with our model. So we want them to be minimized. In fact, minimizing those residuals is pretty much how the best fit line is picked. (I don't think we need to get into all the detail about this but sum of the squared distances between each point and the line is what is being minimized: https://seeing-theory.brown.edu/regression-analysis/index.html#section1) https://tellmi.psy.lmu.de/felix/lmfit/ The deviation in residuals appears in the summary as "residual standard error". ```{r} sd(freq_aug$.resid) summary(freq_lm) ``` So once you've found this optimal line and you have those predicted/fitted values, you can assess whether this line is actually explaining a lot of the variation in the data or not. This is the same as asking whether variation in $x$ explains variation in $y$ well or not. You can do this by seeing how close the fitted values are to the actual $y$ values. ```{r} (cor(freq_aug$.fitted, freq_aug$RTlexdec)) ^ 2 ``` It is often interpreted as the proportion of data which your model explains. So frequency here explains 18.9% of the observed variation in RTs. The F-test at the very bottom is saying that $r^2$ is significantly different from 0, i.e. the model is explaining a non-zero portion of the variance. ```{r} freq_lm_null <- lm(RTlexdec ~ 1, data = rts) rss_null <- sum(residuals(freq_lm_null) ^ 2) rss <- sum(residuals(freq_lm) ^ 2) n <- nrow(rts) p_null <- 1 p <- 2 f_value <- ((rss_null - rss) / (p - p_null)) / (rss / (n - p - 1)) pf(f_value, p - p_null, n - p, lower.tail = FALSE) ``` This is not a particularly useful value for our purposes, it'll usually be significant... ### __Try it yourself...__ 1. Model the relationship between RTlexdec and WrittenSpokenFrequencyRatio, plot it (using `geom_abline()`), and summarize in words what the result means. 2. Now look at the relationship between RTnaming and WrittenSpokenFrequencyRatio. Do you notice anything different? ```{r, indent=" "} ratio_lm <- lm(RTlexdec ~ WrittenSpokenFrequencyRatio, data = rts) ratio_tidy <- tidy(ratio_lm) ratio_aug <- augment(ratio_lm) summary(ratio_lm) ggplot(ratio_aug, aes(sample = .resid)) + stat_qq() ggplot(ratio_aug, aes(x = WrittenSpokenFrequencyRatio, y = RTlexdec)) + geom_point(color = "grey") + geom_abline(intercept = filter(ratio_tidy, term == "(Intercept)")$estimate, slope = filter(ratio_tidy, term == "WrittenSpokenFrequencyRatio")$estimate, colour = "blue") + geom_smooth(method = lm, colour = "red") ``` ```{r, indent=" "} naming_lm <- lm(RTnaming ~ WrittenSpokenFrequencyRatio, data = rts) naming_aug <- augment(naming_lm) summary(naming_lm) ggplot(naming_aug, aes(sample = .resid)) + stat_qq() ggplot(naming_aug, aes(x = WrittenSpokenFrequencyRatio, y = RTnaming)) + geom_point(color = "grey") + geom_smooth(method = lm, colour = "red") ``` Here we looked at one continuous predictor and a continuous outcome, but regression is much more general than that. So we're going to look at a few other ways you can use it. ### Categorical predictors Let's say we want to look at the effect of being in one of 2 (or more) groups on some response. For example, we can look at RTlexdec as a function of the age of the subjects. ```{r} rts_age <- rts %>% group_by(AgeSubject) %>% summarise(mean_rt = mean(RTlexdec)) ggplot(rts, aes(x = AgeSubject, y = RTlexdec)) + geom_violin() + geom_point(aes(y = mean_rt), data = rts_age, size = 3) ``` ```{r} age_lm <- lm(RTlexdec ~ AgeSubject, data = rts) summary(age_lm) ``` This looks very familiar but if we go back to our line equation, something is weird. $$ \text{RTlexdec} = 6.66 - 0.22 \cdot \text{AgeSubject} + \mathcal{N}(0, \sigma)$$ How do you multiply AgeSubject, which is either the word "old" or the word "young", by a number? Turns out R automatically turns "old" and "young" into 0s and 1s. But you can't see this until you convert it into a factor... ```{r} rts <- rts %>% mutate(AgeSubject = factor(AgeSubject)) levels(rts$AgeSubject) contrasts(rts$AgeSubject) ``` It is treating "old" as 0 and "young" as 1 and it just did this alphabetically by default. This is called **dummy coding**. So how do we read the output of our model? ```{r} age_tidy <- tidy(age_lm) age_tidy ``` The Intercept corresponds to the y value when the dummy coded variable is 0, so the intercept is the average RTlexdec for old people. ```{r} rts_age ``` AgeSubjectyoung is the coefficient that you add when the dummy coded variable is 1. So the average RTlexdec for young people is -0.22 less than the intercept. ```{r} intercept <- filter(age_tidy, term == "(Intercept)")$estimate slope <- filter(age_tidy, term == "AgeSubjectyoung")$estimate old_code <- contrasts(rts$AgeSubject)["old", ] young_code <- contrasts(rts$AgeSubject)["young", ] # old RTs intercept + slope * old_code # young RTs intercept + slope * young_code ``` ### Dummy coding vs. effects coding Another option is to do what is sometimes called **effects coding**. Instead of the default dummy coding that R assigns, you can change the contrast codes to something that is more meaningful to you. ```{r} contr.treatment(2) contr.sum(2) contrasts(rts$AgeSubject) <- contr.sum contrasts(rts$AgeSubject) ``` So now the indicator variable for AgeSubject is 1 for "old" and -1 for "young". How will this affect our regression? ```{r} age_lm_2 <- lm(RTlexdec ~ AgeSubject, data = rts) age_tidy_2 <- tidy(age_lm_2) summary(age_lm_2) ``` As you can see the slope is exactly the same -- it's still telling us that going from old to young (1 unit change) means subtracting -0.22. But note that the intercept is different. Now the intercept is the grand mean (across all conditions). What if we rewrite our formulas from earlier: ```{r} intercept <- filter(age_tidy_2, term == "(Intercept)")$estimate slope <- filter(age_tidy_2, term == "AgeSubject1")$estimate old_code <- contrasts(rts$AgeSubject)["old", ] young_code <- contrasts(rts$AgeSubject)["young", ] # old RTs intercept + slope * old_code # young RTs intercept + slope * young_code ``` We still get the exact same thing. In this case, where there's just one predictor, the choice of contrast coding is entirely a matter of preference. In multiple regression, where you have many predictors potentially interacting, the right contrast coding can make a huge difference for how you interpret what's going on in your model, specifically whether you are looking at simple effects or main effects. There are many different coding schemes for testing various hypotheses, more examples: https://stats.idre.ucla.edu/r/library/r-%20library-contrast-coding-systems-for-%20categorical-variables/ ### Multiple predictors Now remember when we looked at the histogram, we saw that the RTs aren't quite normally distributed in that there were two peaks. And now we know that young and old subjects respond differently. Let's incorporate that into our regression. First some plots: ```{r} ggplot(rts, aes(x = WrittenFrequency, y = RTlexdec, colour = AgeSubject)) + geom_point(alpha = 0.1) ``` We see there are two groups here. Remember we are evaluating our model by seeing how close the fitted values are to the real data points, and we can get a lot closer if we have different lines for different age. ```{r} ggplot(rts, aes(x = WrittenFrequency, y = RTlexdec, colour = AgeSubject)) + geom_point(alpha = 0.1) + geom_smooth(method = "lm") ``` Let's make a regression model with multiple predictors: $$ \text{RTlexdec} = \beta_0 + \beta_1 \cdot \text{WrittenFrequency} + \beta_2 \cdot \text{AgeSubject} + \mathcal{N}(0, \sigma)$$ ```{r} freq_age_lm <- lm(RTlexdec ~ WrittenFrequency + AgeSubject, data = rts) summary(freq_age_lm) ``` * So now we have the intercept, which is the mean of RTlexdec for WrittenFrequency = 0 and across all AgeSubject (because of the effects coding). * The effect of WrittenFrequency (-0.037) is pretty much the same as before which makes sense because we were seeing the same effect of Written Frequency in both groups * The effect of age is very similar in size to when it was the sole predictor in the model as well. Note that both the effects of Age and Frequency are significant suggesting that they are both independently contributing to explaining the variance in RTlexdec. Critically, the $r^2$ is much larger, suggesting we're doing a much better job of fitting the data than with the previous model. All that extra variance is now being explained by adding the Age factor. > Linear model: > $$ y_i = \beta_0 + \beta_1 \cdot x_{i1} + \beta_2 \cdot x_{i2} + \ldots + \beta_k \cdot x_{ik} + \epsilon_i $$ ### Interactions So far in all these models we've entered multiple predictors that can exert their effects somewhat independently, but what if things are more complicated and they actually interact? For example, what if for young people, more frequent words had faster RTs, but for old people, it was the opposite? This isn't the case in our data but let's make it the case for illusrtration purposes. ```{r} rts_mod <- rts %>% mutate(RTfake = if_else(AgeSubject == "old", 14 - RTlexdec, RTlexdec)) ggplot(rts_mod, aes(x = WrittenFrequency, y = RTfake, colour = AgeSubject)) + geom_point(alpha = 0.2) + geom_smooth(method = "lm") ``` If we analyze this the way we did before: ```{r} mod_lm <- lm(RTfake ~ WrittenFrequency + AgeSubject, data = rts_mod) summary(mod_lm) ``` We see no effect of WrittenFrequency which is clearly wrong it's just that it's getting obscured by the fact that it's in opposite directions for each age group. So how can we include the fact that the effect of WrittenFrequency varies depending on which value of AgeSubject: $$ \text{RTlexdec} = \beta_0 + \beta_1 \cdot \text{WrittenFreqeuncy} + \beta_2 \cdot \text{AgeSubject} + \beta_3 \cdot \text{WrittenFreqeuncy} \cdot \text{AgeSubject} + \mathcal{N}(0, \sigma) $$ This multiplicative term is called an __interaction__. ```{r} mod_int_lm <- lm(RTfake ~ WrittenFrequency * AgeSubject, data = rts_mod) summary(mod_int_lm) ``` So what this is saying is that there is no __main effect__ of WrittenFrequency but there is a main effect of AgeSubject (younger people are faster) and the __interaction__ between AgeSubject and WrittenFrequency is significant. The significance of the interaction term here means that the slope on top is significantly different from the slope on the bottom. A note about the coding: So if we go back to the dummy coding and rerun the model... ```{r} contrasts(rts_mod$AgeSubject) contrasts(rts_mod$AgeSubject) <- c(0, 1) contrasts(rts_mod$AgeSubject) mod_int_lm_2 <- lm(RTfake ~ WrittenFrequency * AgeSubject, data = rts_mod) summary(mod_int_lm_2) ``` What you're getting here is a significant effect of WrittenFrequency because it's looking only at the effect of WrittenFrequency for AgeSubject = 0 (i.e. old) --> this is a simple effect. When we had the effects coding, WrittenFrequency was being measured across both levels Age so there was no main effect of WrittenFreqeuncy. Everything else about the model looks about the same, but if you don't know what coding scheme you used, you can't interpret the effects. People misunderstand this ALL THE TIME. ### Model comparison Let's consider the case of two real-valued predictors. Let's look at the variable FrequencyInitialDiphoneWord which is the frequency of the first two phonemes in a word. Let's say you think that frequency is what determines RT, but then someone else thinks initial diphone frequency is what matters. Let's evaluate that claim. ```{r} diphone_lm <- lm(RTlexdec ~ FrequencyInitialDiphoneWord, data = rts) summary(diphone_lm) ``` We see that initial diphone frequency does predict RT. But what if we include both predictors, initial diphone frequency and written frequency? ```{r} freq_diphone_lm <- lm(RTlexdec ~ WrittenFrequency + FrequencyInitialDiphoneWord, data = rts) summary(freq_diphone_lm) ``` What does this mean? When you have two linear predictors, and try to find coefficients that minimize the squared distance to datapoints, then to get the best coefficients, you try to estimate what is the effect of variable a while holding variable b constant, and what is the effect of variable b while holding variable a constant. So this means, approximately, whole holding WrittenFrequency constant, initial diphone has this effect on RTs, which is not significant. And controlling for initial diphone frequency -- holding it constant -- we estimate this is the slope for WrittenFrequency, basically the same slope as we found before. This offers some evidence that WrittenFrequency is what matters here, not initial diphone frequency. You often find yourself in the position where there are multiple things that predict patterns in your dataset, and you want to figure out which one is more basic, or which one causes the others, or which one is real and which one is confounded. This kind of test offers some evidence about that question, but there are lots and lots of caveats to interpreting this p-value this way. There are better tests to do. The principle thing that makes this hard to interpret is that the controlling is mutual here: this is the effect of written frequency after holding initial diphone frequency constant, and vice versa. Basically, you can trust this kind of test for two variables that are not highly correlated. These two variables aren't highly correlated, so this kind of hypothesis testing makes sense. ```{r} cor(rts$WrittenFrequency, rts$FrequencyInitialDiphoneWord) ``` ### __Try it yourself__... - Using the `rts.csv` dataset, pick an dependent variable (`RTlexdec` or `RTnaming`), a continuous predictor (e.g. `WrittenFrequency`), and a categorical predictor (e.g. `AgeSubject`) (see `?languageR::english` for info on the variables). - Fit a linear model predicting your dependent variable from both of your predictors and their interaction. - Using the estimated parameters from your model, recreate the expected values of your dependent variable for a given value of your continuous predictor (e.g. `WrittenFrequency`=6) and each possible value of your categorical predictor (e.g. `old` and `young` for `AgeSubject`). - Double check your values using `augment()`. ```{r, indent=" "} contrasts(rts$AgeSubject) <- contr.sum freq_age_lm <- lm(RTlexdec ~ WrittenFrequency * AgeSubject, data = rts) freq_age_results <- tidy(freq_age_lm) intercept <- filter(freq_age_results, term == "(Intercept)")$estimate slope_age <- filter(freq_age_results, term == "AgeSubject1")$estimate slope_freq <- filter(freq_age_results, term == "WrittenFrequency")$estimate int_freq_age <- filter(freq_age_results, term == "WrittenFrequency:AgeSubject1")$estimate old_code <- contrasts(rts$AgeSubject)["old", ] young_code <- contrasts(rts$AgeSubject)["young", ] freq <- 6 # old RTs intercept + freq * slope_freq + old_code * slope_age + freq * old_code * int_freq_age # young RTs intercept + freq * slope_freq + young_code * slope_age + freq * young_code * int_freq_age test_data <- tibble(WrittenFrequency = c(6, 6), AgeSubject = c("old", "young")) augment(freq_age_lm, newdata = test_data) ```