Summary & Information

In early 2022, I saw a discussion in the MIT GSU Slack about how MIT was boasting about our high PhD stipends, but that they didn’t consider that we have a higher cost of living than most PhD students.

I wanted to validate that claim with some data. There’s a website called phdstipends.com that has thousands of user-submitted PhD stipend datapoints. They also use an MIT resource to adjust those stipends by cost of living.

I pulled those datapoints (and you can re-run the scripts below to update them based on newly-submitted data). There aren’t a ton of MIT stipends–around 60 in since 2019, vs around 4,000 total since then.

But, based on that data, it looks like MIT stipends fall somewhere around the 80th percentile, when adjusted by cost of living.

The code below is my own () but you are free to modify or use it however you see fit without prior agreement. Thanks for taking a look!

Gathering the data from phdstipends.com

This first bit is in Python 3.9.10, I like the requests module.

import requests 
import json
import pandas as pd

full_stipends = []
reqlimit = 1000


for i in range(0,reqlimit):
    data = requests.get(f'https://www.phdstipends.com/data/{i}').json()['data']
    for entry in data:
        full_stipends.append(entry)
    
    # Gather data from phdstipends pages until the pages no longer return data
    if len(data) < 1:
        break
        
pd.DataFrame(full_stipends).to_csv("stipends.csv")

Data Cleaning

Now going to switch to R, just because I’m more comfortable.

First some imports:

library(data.table)
library(ggplot2)
library(readr)
library(grid)
library(gridExtra)
library(magrittr)
library(DT)
stipends <- fread("stipends.csv")

# How many stipends are there total?
nrow(stipends)
## [1] 11783
# Let's just focus on the last 3 years
stipends_last3 <- stipends[V6 %in% c("2021-2022", "2020-2021", "2019-2020")]
nrow(stipends_last3)
## [1] 4989
# Right now $$ is a character, let's make that numeric
stipends_last3$V4 <- parse_number(stipends_last3$V4)

Let’s take a look at the raw stipend distribution:

# What does the total distribution of stipends look like?
hist(stipends_last3$V4)

Yeah, there are some wild outliers in there. Let’s try to filter on the cost-of-living index, say between 0.1 and 2.5.

stipends_last3_cleaner <- stipends_last3[V5 < 2.5 & V5 > 0.01]
nrow(stipends_last3_cleaner)
## [1] 4376
hist(stipends_last3_cleaner$V4)

And let’s also get anything that looks like an MIT stipend:

mit_stipends_last3 <- stipends_last3_cleaner[like(V2, "MIT")]
nrow(mit_stipends_last3)
## [1] 64

Data preview and summary statistics

Now we’ve got our happy data, let’s take a look.

datatable(mit_stipends_last3)
datatable(stipends_last3_cleaner)
summary(dplyr::select(mit_stipends_last3, c(V4, V5)))
##        V4              V5      
##  Min.   : 2999   Min.   :0.11  
##  1st Qu.:36996   1st Qu.:1.32  
##  Median :40596   Median :1.45  
##  Mean   :37704   Mean   :1.35  
##  3rd Qu.:42945   3rd Qu.:1.54  
##  Max.   :53000   Max.   :1.90
summary(dplyr::select(stipends_last3_cleaner, c(V4, V5)))
##        V4              V5       
##  Min.   :  350   Min.   :0.020  
##  1st Qu.:21695   1st Qu.:0.940  
##  Median :27897   Median :1.180  
##  Mean   :26976   Mean   :1.129  
##  3rd Qu.:32900   3rd Qu.:1.370  
##  Max.   :66000   Max.   :2.360

Plots

p1 <- ggplot(stipends_last3_cleaner, aes(x = V5)) +
  geom_histogram(bins = 300) +
  theme_bw() +
  xlab("Stipend Living Wage Ratio \n(Stipend normalized by cost of living, see livingwage.mit.edu)") +
  geom_vline(xintercept = median(mit_stipends_last3$V5), col = "red", lwd = 1.1) +
  theme(text = element_text(size = 16))

p2 <- ggplot(stipends_last3_cleaner, aes(x = V4)) +
  geom_histogram(bins = 300) +
  theme_bw() +
  xlab("Stipend (USD)") +
  geom_vline(xintercept = median(mit_stipends_last3$V4), col = "red", lwd = 1.1) +
  theme(text = element_text(size = 16))

gridExtra::grid.arrange(p1, p2, top = textGrob("PhD Stipends vs MIT (red) for 2019-2022\n from phdstipends.com", gp = gpar(fontsize = 20, font = 3)))