In early 2022, I saw a discussion in the MIT GSU Slack about how MIT was boasting about our high PhD stipends, but that they didn’t consider that we have a higher cost of living than most PhD students.
I wanted to validate that claim with some data. There’s a website called phdstipends.com that has thousands of user-submitted PhD stipend datapoints. They also use an MIT resource to adjust those stipends by cost of living.
I pulled those datapoints (and you can re-run the scripts below to update them based on newly-submitted data). There aren’t a ton of MIT stipends–around 60 in since 2019, vs around 4,000 total since then.
But, based on that data, it looks like MIT stipends fall somewhere around the 80th percentile, when adjusted by cost of living.
The code below is my own (julianst@mit.edu) but you are free to modify or use it however you see fit without prior agreement. Thanks for taking a look!
This first bit is in Python 3.9.10, I like the requests module.
import requests
import json
import pandas as pd
full_stipends = []
reqlimit = 1000
for i in range(0,reqlimit):
data = requests.get(f'https://www.phdstipends.com/data/{i}').json()['data']
for entry in data:
full_stipends.append(entry)
# Gather data from phdstipends pages until the pages no longer return data
if len(data) < 1:
break
pd.DataFrame(full_stipends).to_csv("stipends.csv")
Now going to switch to R, just because I’m more comfortable.
First some imports:
library(data.table)
library(ggplot2)
library(readr)
library(grid)
library(gridExtra)
library(magrittr)
library(DT)
stipends <- fread("stipends.csv")
# How many stipends are there total?
nrow(stipends)
## [1] 11783
# Let's just focus on the last 3 years
stipends_last3 <- stipends[V6 %in% c("2021-2022", "2020-2021", "2019-2020")]
nrow(stipends_last3)
## [1] 4989
# Right now $$ is a character, let's make that numeric
stipends_last3$V4 <- parse_number(stipends_last3$V4)
Let’s take a look at the raw stipend distribution:
# What does the total distribution of stipends look like?
hist(stipends_last3$V4)
Yeah, there are some wild outliers in there. Let’s try to filter on the cost-of-living index, say between 0.1 and 2.5.
stipends_last3_cleaner <- stipends_last3[V5 < 2.5 & V5 > 0.01]
nrow(stipends_last3_cleaner)
## [1] 4376
hist(stipends_last3_cleaner$V4)
And let’s also get anything that looks like an MIT stipend:
mit_stipends_last3 <- stipends_last3_cleaner[like(V2, "MIT")]
nrow(mit_stipends_last3)
## [1] 64
Now we’ve got our happy data, let’s take a look.
datatable(mit_stipends_last3)
datatable(stipends_last3_cleaner)
summary(dplyr::select(mit_stipends_last3, c(V4, V5)))
## V4 V5
## Min. : 2999 Min. :0.11
## 1st Qu.:36996 1st Qu.:1.32
## Median :40596 Median :1.45
## Mean :37704 Mean :1.35
## 3rd Qu.:42945 3rd Qu.:1.54
## Max. :53000 Max. :1.90
summary(dplyr::select(stipends_last3_cleaner, c(V4, V5)))
## V4 V5
## Min. : 350 Min. :0.020
## 1st Qu.:21695 1st Qu.:0.940
## Median :27897 Median :1.180
## Mean :26976 Mean :1.129
## 3rd Qu.:32900 3rd Qu.:1.370
## Max. :66000 Max. :2.360
p1 <- ggplot(stipends_last3_cleaner, aes(x = V5)) +
geom_histogram(bins = 300) +
theme_bw() +
xlab("Stipend Living Wage Ratio \n(Stipend normalized by cost of living, see livingwage.mit.edu)") +
geom_vline(xintercept = median(mit_stipends_last3$V5), col = "red", lwd = 1.1) +
theme(text = element_text(size = 16))
p2 <- ggplot(stipends_last3_cleaner, aes(x = V4)) +
geom_histogram(bins = 300) +
theme_bw() +
xlab("Stipend (USD)") +
geom_vline(xintercept = median(mit_stipends_last3$V4), col = "red", lwd = 1.1) +
theme(text = element_text(size = 16))
gridExtra::grid.arrange(p1, p2, top = textGrob("PhD Stipends vs MIT (red) for 2019-2022\n from phdstipends.com", gp = gpar(fontsize = 20, font = 3)))