--- title: "Lab 3" --- ```{r setup, include=FALSE, cache=FALSE} source("_lab_setup.R") ``` ## Lab 2 review 1. Read in the data we collected in lab on the first day (`in_lab_rts_2020.csv`) as `rt_data`. 1. Make a new dataset `rt_data_wide`, which a) doesn't contain the "time" column and b) in which there's one column for each word. 1. Now take `rt_data_wide` and make a new dataset, `rt_data_long`, which has a column for word and a column for trial. HINT: the argument `names_sep` might be helpful. 1. Create a new dataframe `word_types` that has a row for every word in `rt_data` and columns for the word and for indicating whether the word is real or fake. 1. Join `rt_data_long` and `word_types` together into `rt_data_coded`. 1. How many real and fake words were there in the experiment? How many real and fake trials were there? ```{r, indent = " "} library(tidyverse) data_source <- "http://web.mit.edu/psycholinglab/www/data/" rt_data <- read_csv(file.path(data_source, "in_lab_rts_2020.csv")) rt_data_wide <- rt_data %>% select(-time) %>% pivot_wider(names_from = c(trial, word), values_from = rt) rt_data_long <- rt_data_wide %>% pivot_longer(names_to = c("trial", "word"), names_sep = "_", values_to = "rt", cols = -subject) words <- unique(rt_data$word) word_types <- tribble( ~word, ~word_type, "book", "real", "noosin", "fake", "eat", "real", "goamboozle", "fake", "condition", "real", "xdqww", "fake", "word", "real", "retire", "real", "feffer", "fake", "fly", "real", "qqqwqw", "fake", "coat", "real", "condensationatee", "fake", "sporm", "fake", "art", "real", "goam", "fake", "gold", "real", "three", "real", "seefer", "fake", "sqw", "fake", "encyclopedia", "real", "understandable", "real" ) rt_data_coded <- left_join(rt_data_long, word_types) rt_data_coded %>% group_by(word_type) %>% summarise(n_trials = n()) rt_data_coded %>% distinct(word, word_type) %>% group_by(word_type) %>% summarise(n_words = n()) ``` ## Normal distribution ### Normal distribution review ```{r, echo=FALSE} knitr::include_graphics("images/normal.gif") ``` ```{r, message=FALSE} data_source <- "http://web.mit.edu/psycholinglab/www/data" rts <- read_csv(file.path(data_source, "rts.csv")) rts <- rts %>% mutate(RTlexdec_z = (RTlexdec - mean(RTlexdec)) / sd(RTlexdec)) %>% arrange(RTlexdec_z) ggplot(rts, aes(x = RTlexdec_z)) + geom_histogram(color = "white", bins = 60) ``` If we order all the z-scores from smallest to biggest, what z-score value do we expect 97.5^th^ percentile. ```{r} rts$RTlexdec_z[0.975 * nrow(rts)] qnorm(0.975) ``` And conversely we can get the proportion of values below a particular score: ```{r} sum(rts$RTlexdec_z < 1.96) / nrow(rts) pnorm(1.96) ``` R gives us this function `pnorm()`, which says the probability that a number drawn from the standard normal distribution is this value or lower. ### __Try it yourself...__ What is the probability that a number drawn from a standard normal is smaller than 3 and larger than -3? ```{r} pnorm(3) - pnorm(-3) ``` ### Central Limit Theorem The amazing thing about the normal distribution is how often it comes up. Possibly the most fundamental concept in statistics is the Central Limit Theorem and it explains why we see normal distributions everywhere when we are sampling and summarizing data. The Central Limit Theorem tells us the following: Given any distribution of some data (including very non-normal distributions), if you take samples of several data points from that distribution and get __means of those samples__, those means will be __normally distributed__. This is also true of sds, sums, medians, etc. In addition, the larger the number of data points in your samples, the more the distribution of sample means will be more tightly centered around $\mu$. Let's look at an example to see why this is true. ```{r} ggplot(rts, aes(x = NounFrequency)) + geom_histogram() mean(rts$NounFrequency) ``` Let's draw a bunch of subsamples of the data, and take the mean of each subsample. Then we can look at the distribution over means. ```{r} mean(sample(rts$NounFrequency, length(rts$NounFrequency), replace = TRUE)) replicate(10, mean(sample(rts$NounFrequency, length(rts$NounFrequency), replace = TRUE))) sample_mean <- function(values, sample_size) { samples <- sample(values, sample_size, replace = TRUE) return(mean(samples)) } sample_means <- function(values, num_samples, sample_size) { replicate(num_samples, sample_mean(values, sample_size)) } ``` I'm estimating what the distribution of NounFrequency sample means looks like by looking at some number `num_samples` of samples of size `size_of_sample` drawn with replacement from my dataset. Now let's look at a few different distributions to get a sense of how the estimation is affected by the size of the samples and the amount of resampling. ```{r} qplot(sample_means(rts$NounFrequency, num_samples = 10, sample_size = 3), geom = "histogram") qplot(sample_means(rts$NounFrequency, num_samples = 100, sample_size = 3), geom = "histogram") qplot(sample_means(rts$NounFrequency, num_samples = 1000, sample_size = 3), geom = "histogram") qplot(sample_means(rts$NounFrequency, num_samples = 100, sample_size = 100), geom = "histogram") qplot(sample_means(rts$NounFrequency, num_samples = 1000, sample_size = 100), geom = "histogram") qplot(sample_means(rts$NounFrequency, num_samples = 1000, sample_size = 1000), geom = "histogram") ``` ```{r, fig.height = 5} ggplot(NULL, aes(sample = sample_means(rts$NounFrequency, 1000, 3))) + geom_qq() + geom_qq_line() ggplot(NULL, aes(sample = sample_means(rts$NounFrequency, 100, 1000))) + geom_qq() + geom_qq_line() ``` So we can think of data collection during experiments as samples from that underlying distribution. Our goal is to find out its parameters, usually we're most interested in the location (e.g. mean) but not always. You can see that the size of our sample is going to have an effect on how close we are to the true mean. When we "collected data" from 3 participants we got some pretty out there sample means and so our estimate was way off. But when we "collected" larger samples we were approximating the true mean much more closely. Try out different distributions and statistics: http://onlinestatbook.com/stat_sim/sampling_dist/. ## Parameter estimation ### Confidence intervals Even if what we really care about is the mean of a sample (for example, what is the average height of humans?), we still need to know how spread out that distribution of the sample means is. In particular we want to know how good our estimate of $\mu$ is. Are we in this case? ```{r} means_100 <- sample_means(rts$NounFrequency, 100, 100) qplot(means_100, geom = "histogram") + lims(x = c(200, 1200)) sd(means_100) ``` Or this case? ```{r} means_10000 <- sample_means(rts$NounFrequency, 100, 10000) qplot(means_10000, geom = "histogram") + lims(x = c(200, 1200)) sd(means_10000) ``` As you can see, the spread of the distribution of sample means is much smaller for the second case and the estimate of the mean is closer to the true mean. So the standard deviation of the distribution of sample means, $\sigma_x$, is inversely related to the size of the sample. When you run an experiment, you can get the standard deviation of that sample, $s$, but you don't know $\sigma_x$, unless you run say 100 experiments. The best you can do with one experiment is to estimate $\sigma_x$ with the *standard error*, $SE = \frac{s}{\sqrt{n}}$. This is going to vary from sample to sample and this extra level of uncertainty is why we use the t-distribution for most inferential stats because when you use a small sample size, the estimates are further off. So the t-distribution is introduced as a correction to the normal that places more weight on the tails of the distribution. We'll come back to this. If you are getting confused between $s$ and $\sigma$ that's because it is a bit confusing. In practice when you're calculating the standard error of the data in your experiment, you will take the `sd()` of your sample values and divide it by the `sqrt()` of your sample size, n. Or you can bootstrap (see below). Let's say we collected some data d (note: when you want to be able to generate random numbers but then reproduce the calculations exactly you can use `set.seed(x)` where `x` is a number of your choosing): ```{r} set.seed(959) d <- tibble(value = rnorm(100, mean = 20, sd = 5)) length(d$value) mean(d$value) std_error <- function(x) sd(x) / sqrt(length(x)) std_error(d$value) ggplot(d) + geom_histogram(aes(x = value), color = "white") + geom_vline(xintercept = 20, linetype = "dotted") ``` Since $SE$ is an estimate of the SD of the distribution of the sample means, you can use it to identify the amount of data that fall between the mean and the mean + 1 SE or the mean - 1 SE. If you created an interval centered around the sample mean with 1 SE on each side, 68% of the data should fall in that interval, and about 95% lies between mean - 2 SE and mean + 2 SE. This is what people call a 95% Confidence Interval. ```{r} ci_upper <- mean(d$value) + qnorm(0.975) * std_error(d$value) ci_upper ci_lower <- mean(d$value) - qnorm(0.975) * std_error(d$value) ci_lower ``` So the interpretation is that we can have 95% *confidence* that the true mean of `d` lies between `r round(ci_lower, 2)` and `r round(ci_upper, 2)`. We can compare this to what we would get if we just resampled from data `d` and got a distribution of sample means and looked at the interval centered at the mean that would contain 95% of the values. ```{r} d_means <- tibble(mean_value = sample_means(d$value, 1000, nrow(d))) mean(d$value) + qnorm(0.975) * sd(d_means$mean_value) ci_upper_boot <- quantile(d_means$mean_value, 0.975) ci_upper_boot mean(d$value) + qnorm(0.975) * sd(d_means$mean_value) ci_lower_boot <- quantile(d_means$mean_value, 0.025) ci_lower_boot ggplot(d_means) + geom_histogram(aes(x = mean_value), color = "white") + geom_vline(xintercept = ci_upper_boot, color = "red") + geom_vline(xintercept = ci_lower_boot, color = "red") + geom_vline(xintercept = 20, linetype = "dotted") ``` ```{r} ggplot(d_means) + geom_histogram(aes(x = mean_value), color = "white") + geom_vline(xintercept = ci_upper_boot, color = "red") + geom_vline(xintercept = ci_lower_boot, color = "red") + geom_vline(xintercept = ci_upper, color = "blue") + geom_vline(xintercept = ci_lower, color = "blue") + geom_vline(xintercept = 20, linetype = "dotted") ``` So the way to interpret these confidence intervals is that you can think of a researcher running experiments over and over to get at the true mean of some distribution of response times, let's say. Every time the researcher runs an experiment, i.e. takes a sample form the distribution, she gets a mean and an SE and computes a 95% CI. *In the long run*, 95% of those CIs will contain the true mean of the population. See also: http://rpsychologist.com/d3/CI/ http://www.ejwagenmakers.com/inpress/HoekstraEtAlPBR.pdf ### Error bars I mentioned before when we were talking about data visualization, that often we want to provide some information about the spread of the data as well as the average. Now that we know how to compute CIs we can add them to our plots in the form of error bars. ```{r} category_rts <- rts %>% group_by(WordCategory) %>% summarise(mean_rt = mean(RTlexdec), se_rt = sd(RTlexdec) / sqrt(n())) ggplot(category_rts, aes(x = WordCategory, y = mean_rt)) + geom_point() ggplot(category_rts, aes(x = WordCategory, y = mean_rt)) + geom_pointrange(aes(ymin = mean_rt - se_rt * 1.96, ymax = mean_rt + se_rt * 1.96)) ``` ### __Try it yourself...__ - Using the in-class RT data, make a plot with the mean RT for each subject and 95% CI error bars. ```{r, indent=" "} in_lab_rts <- read_csv(file.path(data_source, "in_lab_rts_2020.csv")) subject_rts <- in_lab_rts %>% group_by(subject) %>% summarise(mean_rt = mean(rt), se_rt = sd(rt) / sqrt(n())) ggplot(subject_rts, aes(x = subject, y = mean_rt)) + geom_pointrange(aes(ymin = mean_rt - se_rt * 1.96, ymax = mean_rt + se_rt * 1.96)) ``` ### The t-distribution We used a fairly large sample size in the previous example. What if we have a smaller sample size? ```{r} set.seed(43) d_5 <- tibble(value = rnorm(5, mean = 20, sd = 5)) length(d_5$value) mean(d_5$value) std_error(d_5$value) ggplot(d_5) + geom_histogram(aes(x = value), color = "white") + geom_vline(xintercept = 20, linetype = "dotted") ci_upper_5 <- mean(d_5$value) + qnorm(0.975) * std_error(d_5$value) ci_upper_5 ci_lower_5 <- mean(d_5$value) - qnorm(0.975) * std_error(d_5$value) ci_lower_5 ``` ```{r} d_5_means <- tibble(mean_value = sample_means(d_5$value, 1000, nrow(d_5))) mean(d_5$value) + qnorm(0.975) * sd(d_5_means$mean_value) ci_upper_boot_5 <- quantile(d_5$value, 0.975) ci_upper_boot_5 mean(d_5$value) - qnorm(0.975) * sd(d_5_means$mean_value) ci_lower_boot_5 <- quantile(d_5$value, 0.025) ci_lower_boot_5 ggplot(d_5_means) + geom_histogram(aes(x = mean_value), color = "white", binwidth = 0.1) + geom_vline(xintercept = ci_upper_boot_5, color = "red") + geom_vline(xintercept = ci_lower_boot_5, color = "red") + geom_vline(xintercept = 20, linetype = "dotted") ``` ```{r} ggplot(d_5_means) + geom_histogram(aes(x = mean_value), color = "white", binwidth = 0.1) + geom_vline(xintercept = ci_upper_boot_5, color = "red") + geom_vline(xintercept = ci_lower_boot_5, color = "red") + geom_vline(xintercept = ci_upper_5, color = "blue") + geom_vline(xintercept = ci_lower_5, color = "blue") + geom_vline(xintercept = 20, linetype = "dotted") ``` When you use a small sample size, the estimates are further off. The sd is going to vary more with each sample, so the t-distribution is introduced as a correction to the normal that places more weight on the tails of the distribution. This t-distribution is formally defined by the degrees of freedom (sample size minus 1). ```{r} t_ci_upper_5 <- mean(d_5$value) + qt(0.975, df = 4) * std_error(d_5$value) t_ci_upper_5 t_ci_lower_5 <- mean(d_5$value) - qt(0.975, df = 4) * std_error(d_5$value) t_ci_lower_5 ``` ```{r} ggplot(d_5_means) + geom_histogram(aes(x = mean_value), color = "white") + geom_vline(xintercept = ci_upper_boot_5, color = "red") + geom_vline(xintercept = ci_lower_boot_5, color = "red") + geom_vline(xintercept = ci_upper_5, color = "blue") + geom_vline(xintercept = ci_lower_5, color = "blue") + geom_vline(xintercept = t_ci_upper_5, color = "blue", linetype = "dotted") + geom_vline(xintercept = t_ci_lower_5, color = "blue", linetype = "dotted") ``` You can see that the interval is wider which is good because it reflects our greater uncertainty about the estimates. The t-distribution becomes more like the normal distribution in shape as sample size increases (see http://rpsychologist.com/d3/tdist/ for a visualization). R has a series of functions for the t-distribution, same as for the normal: ```{r} ggplot(tibble(value = rt(1000, df = 9)), aes(x = value)) + geom_histogram() ``` We get the probability of values of x or lower under a t-distribution using `pt`: ```{r} pt(2, df = 9) ``` We can get the value such that 97.5% of values drawn from a t-distribution are less than that value -- the quantile -- with `qt`: ```{r} qt(0.975, df = 9) ``` ### Bootstrapping We're interested in the weight ratings of two different groups of nouns, animals and plants, and in whether the rating of the two groups are significantly different. We can look at the distribution of each group's ratings. ```{r} library(languageR) ggplot(weightRatings, aes(x = Rating)) + facet_wrap(~Class) + geom_histogram() ``` They look pretty different, but how to do we actually compare them? A natural starting point is the mean of the two groups. ```{r} weightRatings %>% group_by(Class) %>% summarise(mean_rating = mean(Rating)) ``` This doesn't let us conclude anything -- how far apart do the means have to be to count as same/different? If we took another sample from the same population, how far away would its mean be? So we want some way of knowing how good our sample mean is at estimating the actual true population mean. If we could take a lot of samples from the underlying population, we could get a pretty good idea. But all we have to work with is the one sample we collected. How can we use that one sample to simulate having many samples? One way to do that is to resample it with replacement and take the mean (or whatever statistic of interest) of the resample. Repeat this many times to get a distribution of resample means (the sampling distribution of the statistic), and use that distribution to derive a confidence interval. Let's try this for one group: ```{r} animals <- weightRatings %>% filter(Class == "animal") mean(animals$Rating) resample <- sample(animals$Rating, replace = TRUE) mean(resample) resample_means <- replicate(1000, mean(sample(animals$Rating, replace = TRUE))) animal_means <- tibble(mean_rating = resample_means) animal_lower <- quantile(resample_means, 0.025) animal_upper <- quantile(resample_means, 0.975) ggplot(animal_means, aes(x = mean_rating)) + geom_histogram(colour = "white") + geom_vline(xintercept = animal_lower, linetype = "dotted") + geom_vline(xintercept = animal_upper, linetype = "dotted") ``` Now let's set this up in a more general way: ```{r} resample_mean <- function(values) { samples <- sample(values, replace = TRUE) return(mean(samples)) } resample_means <- function(values, num_samples) { replicate(num_samples, resample_mean(values)) } ci_lower <- function(values) quantile(values, 0.025) ci_upper <- function(values) quantile(values, 0.975) ``` What are the CIs of our data? ```{r} weight_summary <- weightRatings %>% group_by(Class) %>% summarise(mean_rating = mean(Rating), ci_lower_mean = ci_lower(resample_means(Rating, 1000)), ci_upper_mean = ci_upper(resample_means(Rating, 1000))) weight_summary ``` ```{r} ggplot(weight_summary, aes(x = Class)) + geom_pointrange(aes(y = mean_rating, ymin = ci_lower_mean, ymax = ci_upper_mean)) ```