Fun with Facebook


Here are some interesting things I've done with collected data from facebook.com. For those you that don't know facebook is "an online directory that connects people through social networks at schools."

Using aggregate facebook data for social analysis.

Facebook gives millions of college students the chance to create online digital profiles for themselves. Profiles can contain housing information, major, gender, birth date, political stance, sexual orientation, interests, and other things. No where else is there such a readily accessible database of information on people then there already exists on facebook. Facebook may lack the rigor and depth found in administered surveys, but this is made up for by its ubiquity. When I collected this data, over %93 of current MIT freshman holding an account. I went about collecting and analyzing some of this data in aggregate to see what surprising (or totally unsurprising) trends would arise. I've only generated graphs for MIT!

These are from data collected in the summer before most MIT class of 09 had registered.



One interesting part of the MIT housing system is that its entirely up to the students to choose where they want to live. As any MIT student would know, this leads to very diverse and distinct cultures associated within dorms. This can be seen in facebook.




(Major vs. Dorm..only "interesting" ones where included)


Technical Details
The script was written in perl and uses wget to make the HTTP requests. The User-Agent HTTP header is spoofed as firefox. Data is outputed in csv (comma deliminated) form, and microsoft excel was used to make the graphs. (If some gnuplot wizard would like to automate the bar graph creation proccess, let me know)


No longer possible...
Due to changes in facebook in October 2005, I can no longer gather data for schools I don't have an account for. Facebook has also developed a policy against automated scripting and is monitoring for violations. In short, I won't be able to do this again.

More recent data...

Here the result of a more recent run on this on MIT. I've made some improvments.


There's been a few changes, mainly senior getting quite a bit more liberal and everyone else going down a bit. I've changed the scale in this one, and given the number of inputs in parenthesis.


Here's the graph with including all living groups (with sufficent data)



Here's an updated Major vs Political as well.

RAW DATA FOR OTHER SCHOOLS


Here is the data for other schools. This includes arizona, artic, baylor, byu, caltech, case western, caltech, cornell, cua, duke, famu, georgetown, gmu, grinnel, hbu, KU, MIT, oxford, princeton, rice, sewanee, smith, smu, stanford, trinity, U-Miami, U-Mich, U-Penn, USA, USMA, U-Tulsa, vassar, wellesley, wichita, williams, harvard (as www), and yale.

Some brief perl-esque pseudo code...
[User Defined]
$cookie="Cookie: ...." #An authetenticated facebook cookie is needed
$school="...." #Facebook used to allow searching of any school, now only the school of the account owner will work
$parameter1="...." #The first parameter, is crossed with the second parameter
$parameter2="...." #Example: "house" for parameter1, and "political" for parameter2 to create the classic dorm vs. political stance
$restraints="...." #this is passed with every request, for example, limit it to current students with "status=1, or freshman with "class=09""
[/User Defined]

@AllOptions1=ParseOptions(parameter1) #makes a wget request to "search.php?advanced=1" (passes the cookie along too)
@AllOptioins2=ParseOptions(parameter2) #these two calls acquire all possible values of parameters 1 and 2 and stores them appropirately

foreach $index1 (@AllOptions1) {
foreach $index2 (@AllOptions2) {
$hits=QueryFacebook("search?$parameter1=@AllOptions1[$index1]&$parameter2=@AllOptions2[$index2]&$restraints") #the magic happens here
#The above performs a wget request to passed arugment and then reads the results for the number of returns) StoreHits($hits); # this function records the $hits from the last request
}
}

#now we have the number of hits from each cross section. This data can be anaylzed in a number of ways dependent on what exactly the two parameters are.
#note:Things like number of students in school, and number of respondents can be useful for analysis.