---
title: "Lab 2"
---

```{r setup, include=FALSE, cache=FALSE}
source("_lab_setup.R")
```

## Lab 1 review

Last week we started learning how to manipulate and visualize data. Let's re-explore the in lab lexical decision task from last week.

### __Try it yourself...__

How does RT depend on word length? Is this relationship different for real and fake words?

Plot each word's mean RT against its length, separated by real/fake, with an `lm` smoother:

1. Load tidyverse and read in data with read_csv().
1. Add columns for word_type and word_length.
1. Get mean RT for each word.
1. Plot!

```{r, indent = "    ", fig.width=8}
library(tidyverse)

rt_data <- read_csv("http://web.mit.edu/psycholinglab/www/data/in_lab_rts_2020.csv")

words <- unique(rt_data$word)
real_words <- words[c(1, 3, 5, 7, 8, 10, 12, 15, 17, 18, 21, 22)]
rt_words <- rt_data %>%
  mutate(word_length = str_length(word),
         word_type = if_else(word %in% real_words,
                             "Real", "Fake")) %>%
  group_by(word, word_type, word_length) %>%
  summarise(word_rt = mean(rt)) %>%
  ungroup() %>%
  mutate(word_type = fct_relevel(word_type,
                                 "Real", "Fake"))

ggplot(rt_words, aes(x = word_length, y = word_rt)) +
  geom_text(aes(label = word)) +
  geom_smooth(method = "lm") +
  facet_wrap(vars(word_type))
```


## Statistics

### Summary statistics

So we've been using `mean()` and `sd()` already but let's think some more about what they actually represent and why they're important.

```{r}
rt_data <- read_csv("http://web.mit.edu/psycholinglab/www/data/in_lab_rts_2020.csv")
rt_data <- rt_data %>%
  filter(rt < 10, rt > 0.01)
```

```{r, fig.height=7}
ggplot(rt_data) +
  geom_density(aes(x = rt), fill = "blue", alpha = 0.2) +
  facet_wrap(~subject)
```

You can see that even in this tiny sample there is quite a bit of variability in terms of how RTs are distributed. And you can sort of see that if you were to try and characterize these distributions, 2 big pieces of information would be central tendency and spread (there's also things like skew and kurtosis but they're much less used).

Arithmetic Mean and Median tell us about the central tendency of the data. What about the spread of the data?

First thing that might come to mind? How about the mean distance from the mean?

Let's try this out by generating a bunch of random numbers now in R from a normal distribution using `rnorm()`.

```{r}
random_nums <- rnorm(10000)
qplot(random_nums, geom = "histogram", bins = 60)
```

We see our typical Gaussian/bell-curve: most numbers are near zero, and they thin out as you get farther away from zero. We see the mean is zero, or pretty close:

```{r}
mean(random_nums)
```

Let's look at the mean of mean absolute differences from zero:

```{r}
mean(abs(random_nums - mean(random_nums)))
```

The benefit of the absolute value is to throw away the fact that some things are to the left and others are to the right of the mean. This is called the absolute variance. 

But it turns out a similar measure has much nicer mathematical properties down the line. Let's instead look at the average SQUARED distance to the mean.

```{r}
sum((random_nums - mean(random_nums)) ^ 2) / length(random_nums)
```

This is called VARIANCE. The square root of variance is called STANDARD DEVIATION or sd. It has a lot of nice properties. You can see one of them here: for Normal-distributed data, the SD is 1.

```{r}
sqrt(sum((random_nums - mean(random_nums)) ^ 2) / length(random_nums))
```

We can get the variance and standard deviation using `var()` and `sd()`

```{r}
var(random_nums)
sd(random_nums)
```

Note that the number you get out of this is SLIGHTLY difference than the one you got out of the formula. We'll go into why that is later today. Here is the formula which you should use, which gives you the same numbers

```{r} 
sum((random_nums - mean(random_nums)) ^ 2) / (length(random_nums) - 1)
sqrt(sum((random_nums - mean(random_nums)) ^ 2) / (length(random_nums) - 1))
```

Now we have measures of central tendency and spread. Remember that we have data that look like this. 

```{r, fig.height=7}
ggplot(rt_data) +
  geom_density(aes(x = rt), fill = "blue", alpha = 0.2) +
  facet_wrap(~subject)
```

For a high-variance subject, something that's x far from the mean might be equivalent to something that's y far from the mean for a low-variance subject. We might have similar effects here, just on different scales. We can use mean and standard deviation to roughly put all the data on the same scale. This is called z-scoring.

### Writing functions

Before I show you how to compute z-scores, we'll make a slight detour to learn about writing functions in R.

You've seen me use a lot of functions that are either part of base R or part of Hadley Wickham's tidyverse. But let's say you want to do something for which no one has yet written a function. What can you do? Make your own function!

Let's make a silly function, which takes an argument x, and returns x + 1.

```{r}
add_1 <- function(x) {
  return(x + 1)
}

add_1(7)
```

You can put any series of expressions you want in a function.

```{r}
add_x_to_a <- function(x) {
  a <- c(1, 2, 3)
  return(a + x)
}

add_x_to_a(7)
```

Functions can take any type of arguments and return any type of values.

Within the functions (or independently) you can use control structures like an IF STATEMENT. This lets you do something IF some expression is TRUE. (similar to `if_else()`)

```{r}
more_than_50 <- function(x) {
  if (x > 50) {
    return("x is greater than 50")
  }
}

more_than_50(100)
more_than_50(20)
```

We can add an additional clause to this statement, which is something to do OTHERWISE:

```{r}
compared_to_50 <- function(x) {
  if (x > 50) {
    return("x is greater than 50")
  } else if (x < 50) {
    return("x is less than 50")
  }
}

compared_to_50(100)
compared_to_50(20)
```

We can chain together lots of if statements...

```{r}
compared_to_50 <- function(x){
  if (x > 50) {
    return("x is greater than 50")
  } else if (x < 50) {
    return("x is less than 50")
  } else {
    return("x is 50")
  }
}

compared_to_50(100)
compared_to_50(20)
compared_to_50(50)
```

### __Try it yourself...__

1. Write a function that takes a vector of numbers as an argument and returns that vector's mean and standard deviation. (Note that in R functions can only return one object)

2. Use your function to get the mean and sd of `random_nums`.

```{r, indent = "    "}
summary_stats <- function(x) {
  mean_x <- mean(x)
  sd_x <- sd(x)
  stats <- c("mean" = mean_x, "sd" = sd_x)
  return(stats)
}

summary_stats(random_nums)
```

### Z-scoring

```{r}
z_score <- function(x) {
  return((x - mean(x)) / sd(x))
}
```

This says first center your values around the mean value, then divide by SD to stretch out or shrink your space in standardized way. You're normalizing your scale here so the units are standard deviations: if your z-score is 1, that means that the original datapoint was one standard deviation away from the mean. The idea is that distances in terms of SD are comparable across groups which differ in mean and variance in a way that raw values are not.

```{r}
rt_data <- rt_data %>% mutate(z_rt = z_score(rt))

mean(rt_data$rt)
mean(rt_data$z_rt)
sd(rt_data$rt)
sd(rt_data$z_rt)

ggplot(rt_data) +
  geom_histogram(aes(x = rt), bins = 60, color = "white")

ggplot(rt_data) +
  geom_histogram(aes(x = z_rt), bins = 60, color = "white")

rt_data %>%
  group_by(subject) %>%
  summarise(mean_rt = mean(rt),
            mean_z_rt = mean(z_rt))
```

You can see that the z-scores are centered on 0 and the shape is a little more gaussian but this doesn't get rid of outliers.

In general, in the context of the normal distribution we expect about 68% of the datapoints to be within 1 standard deviation of the mean. 

So if sd(x) = 10, that means that we expect 68% of the datapoints to be between mean - 10 and mean + 10. We expect 96% of the data to be within 2 standard deviations of the mean, and 99% of the data to be within 3 standard deviations of the mean. These numbers may seem strange and random to you right now but it'll make more sense as we learn more about the Normal.


### The normal distribution

The normal distribution is parameterized by the mean and the standard deviation. When we talk about the parameters of the distribution we'll call these __$\mu$__ and __$\sigma$__, rather than mean and sd. These are the parameters you would use to generate a specific distribution (e.g., with `rnorm`).

```{r}
# generate 1000 random numbers from a normal distribution with mu 0 and sigma 1
random_nums <- rnorm(1000, 0, 1)
qplot(random_nums, geom = "histogram")
```

We'll use mean and sd to refer to the actual values computed from the data that are the result of such a generative process.

```{r}
mean(random_nums)
sd(random_nums)
```

We see the mean is close to $\mu$, and the sd is close to $\sigma$ but they are not identical. The mean function provides an ESTIMATE of $\mu$, and the sd function provides an ESTIMATE of $\sigma$. (This is why we have to use the corrected sd formula; otherwise it would not be a good estimate of $\sigma$).

Let's look at what happens when we change the parameters of the Normal.
```{r}
qplot(rnorm(1000, 50, 1), geom = "histogram")
mean(rnorm(1000, 50, 1))

qplot(rnorm(1000, 0, 100), geom = "histogram")
sd(rnorm(1000, 0, 100))
```

The normal distribution represents a __generative process for data__. Here it is an R function with two parameters. Say we have lots of datapoints that we know were generated using this function. Now we want to know, what were the values of the two parameters, $\mu$ and $\sigma$, which were passed into the function in order to generate this data? We can estimate $\mu$ using `mean()`, and $\sigma$ using `sd()`.

In general, this is a powerful way to think about statistics, and about science in general: we know there is some function that generated the data in front of us, and we want to guess what were the parameters passed into that function? Or what exactly was the function? Recovering these things is the goal of all inference. In the case of data which is normally-distributed, all you need to recover is these two numbers, and you will have a function which can simulate your data.

Let's see how this works with real data.
```{r}
rts <- read_csv("http://web.mit.edu/psycholinglab/www/data/rts.csv")
```

```{r}
ggplot(rts, aes(x = RTlexdec)) + geom_histogram()
```

This is somewhat normally distributed. Let's pretend the data were generated by `rnorm`, and try to figure out what were the arguments to that function.

```{r}
mean(rts$RTlexdec) # this is an estimate for mu
sd(rts$RTlexdec) # this is an estimate for sigma
```

Let's now draw a bunch of samples from the normal distribution with those parameters, and see if it matches our real data.
 
```{r}
fake_rts <- rnorm(nrow(rts), mean(rts$RTlexdec), sd(rts$RTlexdec))
qplot(fake_rts, geom = "histogram")
```

A QQ-plot plots the quantiles of 2 distributions against one another and allows direct comparison. If they are the same distribution, the points will lay on top of the line x=y. By default `geom_qq()` takes the argument and compares it to the standard normal. 

```{r, fig.height = 5}
ggplot(rts, aes(sample = RTlexdec)) +
  geom_qq() +
  geom_qq_line()
```

```{r, fig.height = 5}
ggplot(rts, aes(sample = fake_rts)) +
  geom_qq() +
  geom_qq_line()
```

### Correlation

Going back to our scatterplots of RTs and word lengths. There was some evidence that there was a relationship there that we could see but what if we want to quantify that? The easiest tool for that is the correlation coefficient.

* The correlation coefficient measures the strength of linear association between two variables.
* It is always between -1 and 1, inclusive, where
    - -1 indicates perfect negative relationship
    - 0 indicates no relationship
    - +1 indicates perfect positive relationship
    
Pearson's correlation coefficient: $r_{xy} =\frac{1}{n-1} \sum_{i=1}^{n}(\frac{x_i-\overline{x}}{s_x})\times(\frac{y_i-\overline{y}}{{s_y}})$

Note that what's being multiplied is the z-scored x and y 
```{r}
rt_data <- rt_data %>%
  mutate(word_length = str_length(word))

word_length_scaled <- (rt_data$word_length - mean(rt_data$word_length)) / sd(rt_data$word_length)
rt_scaled <- (rt_data$rt - mean(rt_data$rt)) / sd(rt_data$rt)
pearson_r1 <- sum(word_length_scaled * rt_scaled) / (nrow(rt_data) - 1)
pearson_r2 <- cor(rt_data$word_length, rt_data$rt)
```

Okay, but what does it mean to multiply all the values together like that? Here is an example of uncorrelated variables.
```{r}
x <- rnorm(1000)
y <- rnorm(1000)
xy <- data_frame(x, y)
ggplot(xy, aes(x, y)) +
  geom_point(shape = 1) +
  coord_equal()
sum(scale(x) * scale(y)) / (length(x) - 1)
cor(x, y)
```

On the other hand:

```{r}
z <- x
xz <- data_frame(x, z)
ggplot(xz, aes(x, z)) + geom_point(shape = 1) +
  coord_equal()
sum(scale(x) * scale(z)) / (length(x) - 1)
```

To get a better sense of what different correlation coefficients represent, check out the awesome visualization by Kristoffer Magnuson: http://rpsychologist.com/d3/correlation/.

* try different correlations and see the effect of moving outliers 
* try different sample sizes and see the effect of outliers

To sharpen your intuitions about correlation (read: have fun with stats!), play Guess the Correlation: http://guessthecorrelation.com/.


## Data wrangling (`tidyr`)

```{r create_files, include=FALSE, eval=FALSE}
library(languageR)

beginningReaders %>%
  select(Word, OrthLength, LogFrequency, LogFamilySize, ProportionOfErrors) %>%
  gather(key = metric, value = v,
         c(OrthLength, LogFrequency, LogFamilySize, ProportionOfErrors)) %>%
  distinct() %>%
  write_csv("word_data.csv")

beginningReaders %>%
  select(Subject, ReadingScore) %>%
  distinct() %>%
  write_csv("subject_data.csv")

beginningReaders %>%
  select(Word, Subject, Trial, LogRT) %>%
  unite(col = trial_rt, c(Trial, LogRT), sep = "t") %>%
  spread(key = Word, value = trial_rt) %>%
  write_csv("logRT_data.csv")
```

So far we've mostly dealt with data that's neatly organized into a format that is easy for us to manipulate with things like `filter()` and `mutate()`. In general, this kind of tidy format makes things easier: one row for every trial for each subject. Often data will not be organized so neatly from the get-go and you will have to do some serious data wrangling to get it into the format you want. In particular, you'll see that Amazon Mechanical Turk output is often in wide format. For these purposes you will need to know functions like `pivot_longer()`, `pivot_wider()`, and `separate()`.

First let's set up our workspace and read in some data files. These contain data from a study of lexical decision latencies in beginning readers (8 year-old Dutch children). This is available as the `beginningReaders` dataset from `languageR` but here I've split it up into separate datasets so that we can see how we could put it back together. 

```{r}
data_source <- "http://web.mit.edu/psycholinglab/www/data/"
word_data <- read_csv(file.path(data_source, "word_data.csv"))
subject_data <- read_csv(file.path(data_source, "subject_data.csv"))
logRT_data <- read_csv(file.path(data_source, "logRT_data.csv"))
```

Let's look at these data:
```{r}
word_data
```

Here we see that we have a list of words and a series of metrics for each word with an associated value. 

```{r}
subject_data
```

Here we have a reading score for each subject. 

```{r}
logRT_data
```

Here we have 1 column for subject and then a column for each word they saw. The cells contain a weird value or NA. The weird value is a combination of the trial number and log response time (separated by the "t").

So ultimately what we want is 1 big dataset where we have 1 subject column, 1 trial number column, 1 word column and then 1 column for each of value, logRT, OrthLenght, LogFrequency, etc...

### Tidying

So let's start by making each individual dataframe tidier.

We want `word_data` to go from 1 "metric" column to a column for each of the 4 metrics. So you can think of this as making the data *wider*.

```{r}
word_data_wide <- word_data %>%
  pivot_wider(names_from = metric, values_from = v)
```

`subject_data` is very simple so we can leave it for now.

For `logRT_data`, we want to do the opposite: we want to have a column for words and for trial numbers and logRTs. So we want to make the data *longer* by making all the individual word columns into 2 columns the word and that weird value that's currently in the cells.

```{r}
logRT_data_long <- logRT_data %>%
  pivot_longer(names_to = "Word", values_to = "weird_value", cols = -Subject)
```

Okay, now we want to *separate* this weird value into 2 columns, Trial and LogRT

```{r}
logRT_data_tidy <- logRT_data_long %>%
  separate(col = weird_value, into = c("Trial", "LogRT"), sep = "t") %>%
  mutate(LogRT = as.numeric(LogRT))
```

### Joins

So now we're ready to join all 3 of these dataframes together. 

First, let's combine the logRTs with the subjects' reading scores. So we want there to be the same reading score across multiple rows because the reading score is the same for a subject no matter which trial.

```{r}
subject_plus_logRT_data <- left_join(x = logRT_data_tidy,
                                     y = subject_data,
                                     by = "Subject")
subject_plus_logRT_data
```

Note that there is only 1 "Subject" column in the resulting data file.

Now let's combine this new dataframe with the word_data. Here we expect repetition within a word but not a subject since the words LogFrequency (for example) will not differ depending on which subject is reading it.

```{r}
all_data <- right_join(x = subject_plus_logRT_data,
                       y = word_data_wide,
                       by = "Word")
all_data
```

Note there are some NA values because we have some words that weren't experienced by all the subjects. We can get rid of those. But be very careful with this. A lot of NAs in your data where you weren't expecting them are often a sign that you made an error somewhere in your data wrangling.

```{r}
all_data <- na.omit(all_data)
```

Now if you want you can compare this to the first 8 columns of the `beginningReaders` dataset from `languageR`.

So now that we have some data in a tidy form we can start looking for patterns...

```{r}
all_data %>%
  group_by(ReadingScore) %>%
  summarise(mean_logRT = mean(LogRT)) %>%
  ggplot(aes(x = ReadingScore, y = mean_logRT)) +
    geom_point(shape = 1) +
    geom_smooth(method = "lm")
```

So it looks like there is a relationship between subjects' reading scores and RTs

```{r}
all_data %>%
  group_by(LogFamilySize) %>%
  summarise(mean_logRT = mean(LogRT)) %>%
  ggplot(aes(x = LogFamilySize, y = mean_logRT)) +
    geom_point(shape = 1)  +
    geom_smooth(method = "lm")
```

### Creating data frames

There are several different ways to create dataframes from scratch. `tibble()` does it by column, while `tribble()` does it by row.

```{r}
tibble(subject = c("Alice", "Bob", "Charlie"),
       height = c(66, 63, 72))

tribble(~subject, ~height,
        "Alice", 66,
        "Bob", 63,
        "Charlie", 72)
```

### __Try it yourself...__

1. Read in the data we collected in lab on the first day (`in_lab_rts_2020.csv`) as `rt_data`.
2. Make a new dataset `rt_data_wide`, which a) doesn't contain the "time" column and b) in which there's one column for each word. 
HINT: you may want to first `unite()` 2 columns.
3. Now take `rt_data_wide` and make a new dataset, `rt_data_long`, which has a column for word and a column for trial.
4. Create a new dataframe `word_types` that has a row for every word in `rt_data` and columns for the word and for indicating whether the word is real or fake.
5. Join `rt_data` and `word_types` together into `rt_data_coded`.
6. How many real and fake words were there in the experiment? How many real and fake trials were there?

```{r, indent = "    ", eval=FALSE, echo=FALSE}
rt_data <- read_csv(file.path(data_source, "in_lab_rts_2020.csv"))

rt_data_wide <- rt_data %>%
  select(-time) %>%
  unite(col = trialinfo, c(trial, word), sep = "_") %>%
  pivot_wider(names_from = trialinfo, values_from = rt)

rt_data_long <- rt_data_wide %>%
  pivot_longer(names_to = "trialinfo", values_to = "rt", cols = -subject) %>%
  separate(col = trialinfo, into = c("trial", "word"), sep = "_")

word_types <- tribble(
  ~word, ~word_type,
  "book", "real",
  "noosin", "fake",
  "eat", "real",
  "goamboozle", "fake",
  "condition", "real",
  "xdqww", "fake",
  "word", "real",
  "retire", "real",
  "feffer", "fake",
  "fly", "real",
  "qqqwqw", "fake",
  "coat", "real",
  "condensationatee", "fake",
  "sporm", "fake",
  "art", "real",
  "goam", "fake",
  "gold", "real",
  "three", "real",
  "seefer", "fake",
  "sqw", "fake",
  "encyclopedia", "real",
  "understandable", "real"
)

rt_data_coded <- left_join(rt_data, word_types)

rt_data_coded %>%
  group_by(word_type) %>%
  summarise(n_trials = n())

rt_data_coded %>%
  select(word, word_type) %>%
  distinct() %>%
  group_by(word_type) %>%
  summarise(n_words = n())
```