--- title: "Lab 1" --- ```{r setup, include=FALSE, cache=FALSE} source("_lab_setup.R") ``` In this lab, we'll try to cover some of the basic ways of interacting with R and then pretty quickly switch to data wrangling and data visualization. I recommend using RStudio, https://support.rstudio.com, because they have lots of great resources (e.g., Help --> Cheatsheets). Another great resource for many of the functions we'll use: http://r4ds.had.co.nz/. ## Calculator First of all, R can be used as a calculator: ```{r} 1 + 1 2 * 3 3 ^ 2 5 %% 3 ``` It's a good idea not to work directly in the console but to have a script where you will first write your commands and then execute them in the console. Once you open a new R script there are a few useful things to note: + Use `#` to comment lines so they don't get executed + You can send a line directly to console from script with `command + return` (on mac) or `ctrl + enter` (windows) ## Variables Values can be stored as variables ```{r} a <- 2 a b <- 2 + 2 b c <- b + 3 c ``` When you're defining variables, try to use meaningful names like `average_of_ratings` rather than `variable1`. ## Functions A function is an object that takes some arguments and returns a value. ```{r} log(4) print("hello world") help(log) ?log ``` ## Vectors If you want to store more than one thing in a variable you may want to make a vector.To do this you will use the function `c()` which combines values ```{r} v1 <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) v1 ``` HINT: If you are ever unsure about how a function works or what arguments it takes. Typing `?[FUNCTION NAME GOES HERE]` will open up a help file (e.g., `?c`). Now that we have some values stored in a vector, we may want to access those values and we can do this by using their position in the vector or **index**. ```{r} v1[1] v1[5] v1[-1] v1[c(2, 7)] v1[-c(2, 7)] v1[1:3] v1[-(1:3)] ``` You may also want to get some overall information about the vector: + what types of values are in `v1`? ```{r, indent = " "} str(v1) summary(v1) ``` + how long is `v1`? ```{r, indent = " "} length(v1) ``` + what is the average of all these values? ```{r, indent = " "} mean(v1) ``` + what is the standard deviation of all these values? ```{r, indent = " "} sd(v1) ``` Earlier we created `v1` by simply listing all the elements in it and using`c()` but if you have lots of values, this is very tedious. There are some functions that can help make vectors more efficiently ```{r} v2 <- (1:10) v2 v3 <- rep(x = 1, times = 10) v3 v4 <- rep(1:2, 5) v4 v5 <- seq(from = 1, to = 20, by = 2) v5 ``` Note that for v4, I didn't include the names of the arguments but R figures out which is which by the order You can also apply operations to all elements of the vector simultaneously. ```{r} v1 + 1 v1 * 100 ``` You can also do pair-wise operations on 2 vectors. ```{r} v1 + v2 ``` ### Characters So far we've looked at **numeric** variables and vectors, but they can also be **strings** ```{r} name <- "Rachel" name friends <- c("Rachel", "Ross", "Joey", "Monica", "Chandler", "Phoebe") friends str(friends) ``` You can even store numbers as strings (and sometimes data you load from a file will be stored this way so watch out for that) ```{r} some_numbers <- c("2", "3", "4") ``` ...but you can't manipulate them as numbers ```{r, eval = FALSE} some_numbers + 1 ``` So you might want to convert the strings into numbers first using `as.numeric()` ```{r} some_numbers <- as.numeric(some_numbers) some_numbers + 1 ``` ### NA Another important datatype is NA. Say I'm storing people's heights in inches in a dataframe, but I don't have data on the third person. ```{r} heights <- c(72, 70, NA, 64) str(heights) ``` Even though it's composed of letters, `NA` is not a string, in this case it's numeric, and represents a missing value, or an invalid value, or whatever. You can still perform operations on the height vector: ```{r} heights + 1 heights * 2 ``` if you had an NA in a vector of strings, its datatype would be a character. ```{r} friends <- c("Rachel", NA, "Joey", "Monica", "Chandler", "Phoebe") str(friends) ``` If you have NA in your vector and want to use a function on it, this can complicate things ```{r} mean(heights) ``` To avoid returning NA, you may want to just throw out the NA values using `na.omit` and work with what's left. ```{r} heights_no_NA <- na.omit(heights) mean(heights_no_NA) ``` Alternatively, many functions have a built-in argument `na.rm` that you can use to tell the function what to do about NA values. So you can do the previous step in 1 line: ```{r} mean(heights_no_NA, na.rm = TRUE) ``` It can also be useful to know if a vector contains NA ahead of time and where those values are: ```{r} is.na(heights) which(is.na(heights)) ``` ### Booleans This brings us to another important datatype: booleans. They are `TRUE` or `FALSE`, or `T` or `F`. Here are some expressions that return boolean values: ```{r} 1 < 100 500 == 500 # for equality testing, use double-equals! 1 == 2 | 2 == 2 # OR 1 == 1 & 100 == 100 # AND 1 == 1 & 100 == 101 # AND ``` ### __Try it yourself...__ 1. Make a vector, "tens" of all the multiples of 10 up to 200. 2. Find the indices of the numbers divisible by 3 ```{r, indent = " "} tens <- seq(from = 10, to = 200, by = 10) tens which(tens %% 3 == 0) ``` ## Dataframes Most data you will work with in R will be in a dataframe format. Dataframes are basically tables where each column is a vector with a name. Dataframes can combine character vectors, numeric vectors, logical (boolean) vectors, etc. This is really useful for data from experiments where you may want one column to contain information about the name of the condition (a string) and another column to contain response times (a number). Let's read in some data! **But first a digression...** One of the best things about R is that it is open-source and lots of R users who find that some functionality is missing from base R (which is what we've been using so far) will write their own functions and then share them with the R community. Often times they'll write whole **packages** of functions to greatly enhance the capabilities of base R. In order for you to use those packages, they need to be installed on your computer and loaded up in your current session. For current purposes, you will need the `tidyverse` package and you can install it with this simple command: ```{r, eval=FALSE} install.packages("tidyverse") ``` When the installation is done, load up the library of functions in the package with the following command: ```{r, warning=FALSE, message=FALSE} library(tidyverse) library(stringr) ``` Okay, digression over. Let's read in your lexical decision data from earlier using a `tidyverse` function called `read_csv()`. ```{r} data_source <- "http://web.mit.edu/psycholinglab/data/" rt_data <- read_csv(file.path(data_source, "in_lab_rts_2018.csv")) ``` Our data is now stored as a dataframe. The output message tells us what datatype `read_csv()` assigned to every column. It usually does a pretty good job of guessing the appropriate datatype but on occasion you may have to correct using a function like `as.numeric()` or `as.character()`. Note that the path to the data can be any folder on your computer or online (a url). If you just put in the filename without the path, it will look for the file in the local folder. Also note, that if you have column headers in your csv file, `read_csv()` will automatically name your columns accordingly and you won't have to specify `col_names=`. At this point, it's a good idea to look at your data to make sure everything was correctly uploaded. In R Studio, you can open up a viewing pane with the command `View(d)` to see the data in spreadsheet form. You can also use `summary()`, `str()` and `glimpse()` ```{r} glimpse(rt_data) ``` You can also extract just the names of the columns: ```{r} names(rt_data) ``` Or just the first (or last) few rows of the dataframe: ```{r} head(rt_data) tail(rt_data) ``` Or look at the dimensions of your dataframe ```{r} dim(rt_data) nrow(rt_data) ncol(rt_data) ``` To access a specific column, row, or cell you can use **indexing** in much the same way you can with vectors (just now with 2 dimensions) ```{r} rt_data[1, 2] # what's in the 1st row, 2nd column rt_data[1, ] # the 1st row for all columns rt_data[, 2] # all rows for the 2nd column rt_data[, c(2, 5)] # all rows for columns 2 and 5 rt_data[, c("subject", "rt")] ``` Another easy way to extract a dataframe column is by using the `$` operator and the column name ```{r} head(rt_data$rt) ``` `rt_data$rt` is a (numeric) vector so you can perform various operations on it (as we saw earlier) ```{r} head(rt_data$rt * 2) mean(rt_data$rt) ``` ## Data manipulation (`dplyr`) Often when you upload data it's not yet in a convenient, "tidy" form so data wrangling refers to the various cleaning and re-arranging steps between uploading data and being able to visualize or analyze it. I'll start out by showing you a few of the most common things you might want to do with your data. For example, in this dataset, we want to know about how response times on the lexical decision task might differ depending on whether it's a real word or a non-word but this information is missing from our data. Let's see what words were included in the experiment. ```{r} words <- unique(rt_data$word) words ``` Now let's make a vector containing only the real words: ```{r} real_words <- words[c(1, 3, 5, 7, 8, 10, 12, 15, 17, 18, 21, 22)] real_words ``` We can check if a value is represented in an array using the operator `%in%`. ```{r} "cat" %in% c("cat", "dog", "horse") c("cat", "ocelot") %in% c("cat", "dog", "horse") ``` ### `mutate` Now we can add a column to our dataframe, `rt_data`, that contains that condition information. We're going to do this using the `mutate()` function. ```{r} rt_data <- mutate(rt_data, is_real = word %in% real_words) head(rt_data) ``` `mutate()` is extremely useful anytime you want to add information to your data. For instance, the reaction times here appear to be in seconds but maybe we want to look at them in milliseconds. Maybe we also want to code if the word starts with the letter "b" and code which words are longer than 6 letters long. This can be done all at once. ```{r} mutate(rt_data, rt_ms = rt * 1000, starts_d = str_sub(word, 1, 1) == "b", longer_than_6 = if_else(str_length(word) > 6, "long", "short")) ``` Note: + `str_sub()` extracts a subset of `word` starting at position 1 and ending at position 1 (i.e., just the first letter). + `if_else()` is a useful function which takes a logical comparison as a first argument and then what to do if it is `TRUE` as the second argument and what to do if it is `FALSE` as the third. ### `filter` Now let's say we want to look at only a subset of the data, we can `filter()` it: ```{r} filter(rt_data, str_length(word) > 6) ``` ### `select` If your dataframe is getting unruly, you can focus on a few key columns with `select()` ```{r} select(rt_data, subject, word, rt) ``` ### `arrange` You can also sort the dataframe by one of the columns: ```{r} arrange(rt_data, rt) ``` ### `group_by` and `summarise` Most of the time when you have data, the ultimate goal is to summarize it in some way. For example, you may want to know the mean response time for each subject by type of word (real vs. fake). ```{r} summarise(group_by(rt_data, subject, is_real), mean_rt = mean(rt)) ``` As you can see, we often want to string `tidyverse` functions together which can get difficult to read. The solution to this is... ### `%>%` We can create a pipeline where the dataframe undergoes various transformations one after the other with the same functions, `mutate()`, `filter()`, etc. without having to repeat the name of the dataframe over and over and much more intuitive syntax. ```{r} # this is the previous syntax mutate(rt_data, "rt_ms" = rt * 1000, "starts_d" = str_sub(word, 1, 1) == "b", "longer_than_6" = if_else(str_length(word) > 6, "long", "short")) # this is the piping syntax rt_data %>% mutate("rt_ms" = rt * 1000, "starts_d" = str_sub(word, 1, 1) == "b", "longer_than_6" = if_else(str_length(word) > 6, "long", "short")) ``` And we can keep adding functions to the pipeline very easily... ```{r} rt_data %>% mutate("rt_ms" = rt * 1000, "starts_d" = str_sub(word, 1, 1) == "b", "longer_than_6" = if_else(str_length(word) > 6, "long", "short")) %>% filter(rt > 0.002) rt_data %>% mutate("rt_ms" = rt * 1000, "starts_d" = str_sub(word, 1, 1) == "b", "longer_than_6" = if_else(str_length(word) > 6, "long", "short")) %>% filter(rt > 0.002) %>% group_by(subject, is_real) %>% summarise(mean_rt = mean(rt)) ``` We can look just at conditions and add some summary stats ```{r} rt_data %>% group_by(is_real) %>% summarise(mean_rt = mean(rt), median_rt = median(rt), sd_rt = sd(rt)) ``` It looks like average response time was longer for fake words. ### __Try it yourself...__ 1. Add a new column to `rt_data` that codes whether the word ends with "n" 2. Get the means and counts for real and fake words split by whether they end in "n" or not ```{r, indent = " "} rt_data_n <- rt_data %>% mutate("ends_with_n" = str_sub(word, -1, -1) == "n") %>% group_by(is_real, ends_with_n) %>% summarize(mean_rt = mean(rt), n_rt = n()) rt_data_n ``` ## Data visualization (`ggplot2`) Looking at columns of numbers isn't really the best way to do data analysis. You could be tripped up by placement of decimal points, you might accidentally miss a big number. It would be much better if we could PLOT these numbers so we can visually tell if anything stands out. If you've taken an introductory Psych or Neuro course you might know that a huge proportion of human cortex is devoted to visual information processing; we have hugely powerful abilities to process visual data. By using plotting we can leverage that ability to get a fast sense of what is going on in our data. Visualizing your data is an extremely important part of any data analysis. `tidyverse` contains a whole library of functions for plotting: `ggplot2`. I'll be showing you how to use these functions but also I'll be trying to give you some intuitions about how researchers use visualization to get a better understanding of their data. The first thing you might want to know is what your dependent variable, in this case the response time, looks like. In other words, how is it distributed? ### Histograms ```{r} ggplot(data = rt_data) + geom_histogram(mapping = aes(x = rt)) ``` `ggplot` syntax might seem a little unusual. You can think of it as first creating a plot coordinate system with `ggplot()` and the you can add layers of information with `+`. `ggplot(data = rt_data)` would create an empty plot because you haven't told it anything about what variables you're interested in or how you want to look at them. This is where geometric objects, or "geoms", come in. `geom_histogram()` is going to make this plot a histogram. A histogram has values of whatever variable you choose on the x axis and counts of those values on the y. The aesthetic mapping, `aes()`, arguments let us specify which variable, in this case `rt`, we want to know the distribution of. We can also change visual aspects of the geom, like the width of the bins, depending on what will make the graph more clear and informative. ```{r} ggplot(rt_data) + geom_histogram(aes(x = rt), bins = 60) ``` __What can we learn from this histogram?__ In this histogram we can see that there are a lot of response times around 1 second and a few longer outlier response times. This is important to know for when we analyze the response times because certain descriptive statistics like the mean are very sensitive to outliers. ### Scatterplots Let's say we are curious to see if people speed up or slow down over the course of doing the lexical decision task. So, we want to plot trial number and compare it with mean rt. Let's put trial number on the x axis, mean rt on the y axis, and make a scatterplot. For this we're going to use a different geom, `geom_point()`. Contrary to `geom_histogram()` this takes a minimum of 2 arguments, x and y. ```{r} ggplot(rt_data) + geom_point(aes(x = trial, y = rt)) ``` This is a fair number of data points so it's a little difficult to see what's going on. It might be useful to also show what the average across participants looks like at every timepoint. We can just add another layer to this same graph with the `+`, in this case we'll use `geom_line()`. ```{r} rt_by_trials <- rt_data %>% group_by(trial) %>% summarise(mean_rt = mean(rt)) ggplot() + geom_point(data = rt_data, aes(x = trial, y = rt), alpha = 0.2) + geom_line(data = rt_by_trials, aes(x = trial, y = mean_rt), color = "blue") ``` + Because there were so many points and it was difficult to see, I made each point less opaque using `alpha = 0.2` as an argument for `geom_point()` and I made the line connecting averages stand out by making it blue with `color = "blue"` as an argument to `geom_line()` + Note that I'm plotting 2 different datasets withing the same graph and this is easy to do because you can define data separately for each geom. __What can we learn from this scatterplot + line graph?__ Is anything different happening in the first few trials? Does there seem to be a trend over the course of the trials?