---
title: "Lab 3"
---

```{r setup, include=FALSE, cache=FALSE}
source("_lab_setup.R")
```


```{r create_files, include=FALSE, eval=FALSE}
library(languageR)

beginningReaders %>%
  select(Word, OrthLength, LogFrequency, LogFamilySize, ProportionOfErrors) %>%
  gather(key = metric, value = v,
         c(OrthLength, LogFrequency, LogFamilySize, ProportionOfErrors)) %>%
  distinct() %>%
  write_csv("word_data.csv")

beginningReaders %>%
  select(Subject, ReadingScore) %>%
  distinct() %>%
  write_csv("subject_data.csv")

beginningReaders %>%
  select(Word, Subject, Trial, LogRT) %>%
  unite(col = trial_rt, c(Trial, LogRT), sep = "t") %>%
  spread(key = Word, value = trial_rt) %>%
  write_csv("logRT_data.csv")
```

## Data wrangling (`tidyr`)

So far we've mostly dealt with data that's neatly organized into a format that is easy for us to manipulate with things like `filter()` and `mutate()`. In general, long format makes things easier: one row for every trial for each subject.Often data will not be organized so neatly from the get-go and you will have to do some serious data wrangling to get it into the format you want. In particular, you'll see that Amazon Mechanical Turk output is often in wide format. For these purposes you will need to know functions like `spread()`, `gather()`, and `separate()`.

First let's set up our workspace and read in some data files. These contain data from a study of lexical decision latencies in beginning readers (8 year-old Dutch children). This is available as the `beginningReaders` dataset from `languageR` but here I've split it up into separate datasets so that we can see how we could put it back together. 


```{r}
library(tidyverse)

data_source <- "http://web.mit.edu/psycholinglab/data/"
word_data <- read_csv(file.path(data_source, "word_data.csv"))
subject_data <- read_csv(file.path(data_source, "subject_data.csv"))
logRT_data <- read_csv(file.path(data_source, "logRT_data.csv"))
```

Let's look at these data:
```{r}
glimpse(word_data)
```

Here we see that we have a list of words and a series of metrics for each word with an associated value. 

```{r}
glimpse(subject_data)
```

Here we have a reading score for each subject. 

```{r}
glimpse(logRT_data)
```

Here we have 1 column for subject and then a column for each word they saw. The cells contain a weird value or NA. The weird value is a combination of the trial number and log response time (separated by the "t").

So ultimately what we want is 1 big dataset where we have 1 subject column, 1 trial number column, 1 word column and then 1 column for each of value, logRT, OrthLenght, LogFrequency, etc...

### Tidying

So let's start by making each individual dataframe tidier.

We want `word_data` to go from 1 "metric" column to a column for each of the 4 metrics. So you can think of this as *spreading* out the data.

```{r}
word_data_wide <- word_data %>%
  spread(key = metric, value = v)
```

`subject_data` is very simple so we can leave it for now.

For `logRT_data`, we want to do the opposite: we want to have a column for words and for trial numbers and logRTs. So we want to *gather* the data from all the individual word columns into 2 columns the word and that weird value that's currently in the cells.

```{r}
logRT_data_long <- logRT_data %>%
  gather(key = Word, value = weird_value, -Subject)
```

Okay, now we want to *separate* this weird value into 2 columns, Trial and LogRT

```{r}
logRT_data_tidy <- logRT_data_long %>%
  separate(col = weird_value, into = c("Trial", "LogRT"), sep = "t") %>%
  mutate(LogRT = as.numeric(LogRT))
```

### Joins

So now we're ready to join all 3 of these dataframes together. 

First, let's combine the logRTs with the subjects' reading scores. So we want there to be the same reading score across multiple rows because the reading score is the same for a subject no matter which trial.

```{r}
subject_plus_logRT_data <- left_join(x = logRT_data_tidy,
                                     y = subject_data,
                                     by = "Subject")
glimpse(subject_plus_logRT_data)
```

Note that there is only 1 "Subject" column in the resulting data file.

Now let's combine this new dataframe with the word_data. Here we expect repetition within a word but not a subject since the words LogFrequency (for example) will not differ depending on which subject is reading it.

```{r}
all_data <- right_join(x = subject_plus_logRT_data,
                       y = word_data_wide,
                       by = "Word",
                       na.omit = TRUE)
glimpse(all_data)
```

Note there are some NA values because we have some words that weren't experienced by all the subjects. We can get rid of those. But be very careful with this. A lot of NAs in your data where you weren't expecting them are often a sign that you made an error somewhere in your data wrangling.

```{r}
all_data <- na.omit(all_data)
```

Now if you want you can compare this to the first 8 columns of the `beginningReaders` dataset from `languageR`.

So now that we have some data in a tidy form we can start looking for patterns...

```{r}
all_data %>%
  group_by(ReadingScore) %>%
  summarise(mean_logRT = mean(LogRT)) %>%
  ggplot(aes(x = ReadingScore, y = mean_logRT)) +
    geom_point(shape = 1) +
    geom_smooth(method = "lm")
```

So it looks like there is a relationship between subjects' reading scores and RTs

```{r}
all_data %>%
  group_by(LogFamilySize) %>%
  summarise(mean_logRT = mean(LogRT)) %>%
  ggplot(aes(x = LogFamilySize, y = mean_logRT)) +
    geom_point(shape = 1)  +
    geom_smooth(method = "lm")
```

### __Try it yourself...__

1. Read in the data we collected in lab on the first day (`in_lab_rts_2018.csv`).
2. Make a new dataset `rt_data_wide`, which a) doesn't contain the "time" column and b) in which there's one column for each word. 
HINT: you may want to first unite() 2 columns.
3. Now take `rt_data_wide` and make a new dataset, `rt_data_long`, which has a column for word and a column for trial.

```{r, indent = "    "}
rt_data <- read_csv(file.path(data_source, "in_lab_rts_2018.csv"))

rt_data_wide <- rt_data %>%
  select(-time) %>%
  unite(col = trialinfo, c(trial, word), sep = "_") %>%
  spread(key = trialinfo, value = rt)

rt_data_long <- rt_data_wide %>%
  gather(key = trialinfo, value = rt, -subject) %>%
  separate(col = trialinfo, into = c("trial", "word"), sep = "_")
```


## Statistics, continued

### Normal distribution review

```{r, message=FALSE}
rts <- read_csv(file.path(data_source, "rts.csv"))

rts <- rts %>%
  mutate(RTlexdec_z = (RTlexdec - mean(RTlexdec)) / sd(RTlexdec)) %>%
  arrange(RTlexdec_z)
```

If we order all the z-scores from smallest to biggest, what z-score value do we expect 97.5^th^ percentile.

```{r}
rts$RTlexdec_z[0.975 * nrow(rts)]
qnorm(0.975)
```

And conversely we can get the proportion of values below a particular score:

```{r}
sum(rts$RTlexdec_z < 1.96) / nrow(rts)
pnorm(1.96)
```

R gives us this function `pnorm()`, which says the probability that a number drawn from the standard normal distribution is this value or lower.

### __Try it yourself...__

What is the probability that a number drawn from a standard normal is smaller than 3 and larger than -3?

```{r}
pnorm(3) - pnorm(-3)
```

### Central Limit Theorem

The amazing thing about the normal distribution is how often it comes up. Possibly the most fundamental concept in statistics is the Central Limit Theorem and it explains why we see normal distributions everywhere when we are sampling and summarizing data.

The Central Limit Theorem tells us the following: Given any distribution of some data (including very non-normal distributions), if you take samples of several data points from that distribution and get __means of those samples__, those means will be __normally distributed__. This is also true of sds, sums, medians, etc. In addition, the larger the number of data points in your samples, the more the distribution of sample means will be more tightly centered around $\mu$. 

Let's look at an example to see why this is true.

```{r}
hist(rts$NounFrequency)
mean(rts$NounFrequency)
```

Let's draw a bunch of subsamples of the data, and take the mean of each subsample. Then we can look at the distribution over means.

```{r}
sample_mean <- function(sample_size) {
  resampled_rows <- sample_n(rts, sample_size, replace = TRUE)
  return(mean(resampled_rows$NounFrequency))
}

sample_means <- function(num_samples, sample_size) {
  replicate(num_samples, sample_mean(sample_size))
}
```

I'm estimating what the distribution of NounFrequency sample means looks like by looking at some number `num_samples` of samples of size `size_of_sample` drawn with replacement from my dataset.
 
Now let's look at a few different distributions to get a sense of how the estimation is affected by the size of the samples and the amount of resampling.

```{r}
hist(sample_means(10, 3))
hist(sample_means(100, 3))
hist(sample_means(1000, 3))
hist(sample_means(100, 100))
hist(sample_means(1000, 100))
hist(sample_means(1000, 1000))
```

```{r, fig.height = 5}
qqnorm(sample_means(1000, 3))
qqnorm(sample_means(100, 1000))
```

So we can think of data collection during experiments as samples from that underlying distribution. Our goal is to find out its parameters, usually we're most interested in the location (e.g. mean) but not always. You can see that the size of our sample is going to have an effect on how close we are to the true mean. When we "collected data" from 3 participants we got some pretty out there sample means and so our estimate was way off. But when we "collected" larger samples we were approximating the true mean much more closely.

Try out different distributions and statistics: http://onlinestatbook.com/stat_sim/sampling_dist/.