Problems in Evaluating Four-Year Colleges
Recent proposals by the federal Spellings Commission to use standardized tests to evaluate four-year college educations have sparked controversy. [See, for example, “Spellings Commission on the Future of Higher Education Hints at National Standardized Testing for Universities,” MIT Faculty Newsletter, March/April 2008.] The Commission’s intent was to produce one or two numbers that purportedly would measure an institution’s “value added” in educating its students. They regarded such numbers as potentially analogous to the EPA gas mileage numbers on new car stickers; they both attempt to make it easy for consumers to comparison-shop.
The National Association of State Universities and Land-Grant Colleges (NASULGC) tried to head off this effort by promoting a Voluntary System of Accountability (VSA), offering one of three standardized tests to measure their contribution to their students’ improvement in critical thinking, analytic reasoning, problem solving, and written communication.
This enterprise, however, of trying to measure and then compare the common benefits of a college education among widely differing educational institutions through two-hour standardized tests reaffirms the truth of H. L. Mencken’s observation that "For every complex problem there is an answer that is clear, simple, and wrong."
These assessments are, at best, useless, and, at worst, subversive to the complex set of abilities that should inform undergraduate education.
The three tests are each about two-hours long. They differ, however, on how they assess these abilities. The Collegiate Learning Assessment (CLA) of the Council for Aid to Education (CAE) is composed of three writing assignments, although one of them is broken up into short answers. The Educational Testing Service’s Measure of Academic Proficiency and Progress (MAPP) consists entirely of multiple-choice questions, while the ACT’s Collegiate Assessment of Academic Proficiency (CAAP) combines an argumentative essay with multiple-choice questions on “critical thinking.” The “value added” is a single number measured by one of two ways: either giving the test the same year to sample first-year and senior cohorts, or, longitudinally, by giving versions of the test to the same students when they enter and then again when they are seniors. In either case, higher scores by seniors are viewed as “objective” evidence of educational improvement, the greater the difference, the greater the institutional contribution to these general educational objectives.
"Your college administration is considering whether or not there should be a physical education requirement for undergraduates. The administration has asked students for their views on the issue and has announced that its final decision will be based on how such a requirement would affect the overall educational mission of the college. Write a letter to the administration arguing whether or not there should be a physical education requirement for undergraduates at your college."
(Do not concern yourself with letter formatting; simply begin your letter, "Dear Administration.")
The CLA “Make An Argument” task asks students to respond in 45 minutes to a prompt such as the one given to me at a recent Web conference:
“Government funding would be better spent on preventing crime than in dealing with criminals after the fact.”
Directions: 45 minutes, present your perspective on the issue, using relevant reasons and/or examples to support your views.
Flawed Character of the Writing Tests
Although these exercises differ from each other and both differ from the kind of “universal” writing prompt (e. g., “Is failure necessary for success?”) found on the SAT and other standardized tests, they are fundamentally unlike college writing (and thinking) in two very profound ways. First, all three exercises occur with an absence of relevant and necessary data. Real arguments about the necessity of a Physical Education Requirement depend on various kinds of relevant evidence. First, the prompt mentions the “overall educational mission of the college.” Most colleges and universities have mission statements, for example, but how many undergraduates are familiar with them? Are students first supposed to define and argue for their institution’s educational mission? Personal experience is one kind of evidence but it is not the primary type found in academic arguments. A real argument requires research and information. The same is true for the prompt on spending money to prevent crime. Indeed, college teaches students how to become information literate in knowing when external data is needed, how to access or create the data, how to evaluate it, and how to use it. Both of these assignments, on the other hand, provide neither data nor any time to seriously think about the topic.
Second, even if there were data, these two questions give students too little time to meaningfully explore and think about the information and revise their argument. No one can do that in 20, 25, or 45 minutes on a topic that they may never have seriously considered beforehand. Students do write timed essays on demand, but they are always writing on topics they have studied, read about, and discussed. Indeed, there are very few real-world situations in which someone out of the blue is going to ask someone else to write a coherent real-reasoned argument on a topic that the writer is not expected to have thought about previously.
MIT students, however, do well on this kind of essay. That is one of the reasons they are here. But doing well on these essays is, at best, largely unrelated to successful academic writing, and is, at worst, subversive to it. Over a year ago, I developed a set of strategies based on both published training samples and essay scoring machine algorithms to hack the SAT Writing Essay. [Click here to view the entire set of strategies.] The key elements are to follow the rigid structure of the five-paragraph essay, fill-up both pages of the test booklet, include lots of detail even if it is made up or inaccurate, use lots of big words, especially substituting “plethora of ” and “myriad number of ” for “many,” and to insert a famous quotation near the conclusion of the essay even if it is irrelevant to the rest of the essay. I then trained several high school seniors to use them. Since then, I have handed out the strategies to scores of students, their parents, aunts, uncles, and friends of the family. The limited results I have seen are frightening. I know of no student who has used my strategies who has scored lower than the 92nd percentile. (See the article on this work “Fooling the College Board” in Inside Higher Education.)
I classify these kinds of essays as data-free writing because although scorers are looking for specific detail, they are usually explicitly told to ignore any factual errors.
Our students know that the best way to succeed in writing these essays is to just make up information as needed. Indeed, they have told me that in these types of time writing situations, it is much more efficient to just make up facts than to try to recall real relevant information.
The Importance of Data-Rich Writing Skills
In contrast, the kind of writing that students perform in almost all their academic experiences at MIT is data-rich. Students learn how to explore a topic, gather information, identify a central assertion, and then use specific information to shape and refine their argument. This description is equally valid for laboratory reports, design documents, and essays on literature. All academic writing involves a design process. A document is an artifact intended for use. The writer discovers, selects, and assembles information for a specific audience to use for a specific purpose. As with any other design problem, the writer has to consider the external constraints imposed by the data and also determine various trade-offs, such as how much detail to include in explanations. These tests offer neither the context nor the time for such a design exercise.
The CLA does include an essay that it classifies as a “Performance Task,” claiming that it “combines skills in critical thinking, analytic reasoning, problem solving and written communication . . . based on real-life scenarios” while assessing the ability to use information effectively. In one sample prompt, a student is asked to assume the role of an assistant to the president of a high-tech company. [Council for Aid to Education. CLA: Collegiate Learning Assessment Brochure. 2008.] A sales manager has recommended that the company buy a particular plane to transport sales personnel to and from customers, but before the plane was purchased there was an accident involving that particular model. The student is provided with the following documents: 1) a newspaper report of the accident; 2) an FAA report on “in-flight breakups in single engine planes;” 3) e-mails from the president to the assistant and from the sales manager to the president; 4) charts displaying the performance of this particular line of planes; and 5) an article from a magazine for amateur pilots comparing this plane to others in its class.
Because the sample prompt just names the documents but does not display them, it is difficult to assess how much information is given to students. A description of another scenario on crime reduction lists the information given to a student as consisting of a newspaper article, a research report, crime statistics and tables, and an annotated bibliography. Academic writing, however, uses an annotated bibliography as a starting point for information gathering not as an end point.
In essence, students are given a six- or eight-piece puzzle, and we are told that mastering it will tell us how well they can navigate the vast sea of information that surrounds us. These tests do not encourage students to learn how to obtain, assess, and use information appropriately; they teach them to formulaically manipulate prepackaged information bites. These kind of essay tests are neither data-free nor data-rich, but data-lite.
The third essay used by the CLA is identical to the format used in both the GRE and GMAT tests, “critique-an-argument.” Students are given 30 minutes to evaluate an argument such as this example given in the CLA brochure:
"The number of marriages that end in separation or divorce is growing steadily. A disproportionate number of them are from June weddings. Because June weddings are so culturally desirable, they are often preceded by long engagements as the couples wait until the summer months. The number of divorces increases with each passing year, and the latest statistics indicate that more than 1 out of 3 marriages will end in divorce. With the deck stacked against 'forever more' it is best to take every step possible from joining the pool of divorcees. Therefore, it is sage advice to young couples to shorten their engagements and choose a month other than June for a wedding."
This kind of question is easily coachable. Most of the logical problems relate to some form of confusing correlation with causation. Moreover, I doubt that many of our entering students would have problems identifying the logical fallacies in the passage, and, consequently, would perform just as well as seniors.
Limited Value of Standardized Writing for Evaluating MIT Students
Lori Breslow, in the last issue of the Faculty Newsletter, statedthat MIT would not participate in the VSA. MIT, however, is participating in a grant from the Fund for the Improvement of Post-Secondary Education (FIPSE) to assess the “validity” of these three tests by having 50 first-year and 50 fourth-year students take all three of them during this coming academic year. The results from each of these tests will be compared “to examine the extent to which disparate measurement tools recommended as part of the VSA can be used interchangeably, whether these tools are measuring similar or dissimilar outcomes or levels of achievement, and the role test format (e.g., multiple choice vs. open-ended/constructed response measures) plays in the correlation among measures.” [National Association of State Universities and Land-Grant Colleges. “Higher Education Consortium Awarded $2.4 million Grant from Department of Education for Project on Student Learning Assessment.” September 28, 2007.] Neither the students nor the Institute, however, will see the scores.
It seems to me that there is little benefit to anyone in MIT’s participation in this study.
The tests, such as the CLA, claim that regressing scores against SAT scores can compensate for response bias because there is a linear relationship between CLA and SAT scores. [Klein, S., R. Benjamin, R Shavelson, R Bolus. “The Collegiate Learning Assessment: Facts and Fantasies." Draft 4/24/2007. Council for Aid to Education. P. 10.]
Similarly, given that almost all undergraduates entering MIT have “critical thinking” and “written communication” abilities that would probably score at over two standard deviations above the mean scores of these respective tests, a pre-test / post-test protocol might likely result in false negatives because of regression toward the mean. In addition, because the student volunteers who take this test are paid simply to take these tests and have no motivation to do well, it is possible that many MIT seniors would take the tests less seriously than entering first-year students and perform perfunctorily at best, creating even more opportunities for false negatives.
The literature on these tests also ignores the extent of measurement error, which will exacerbate regression toward the mean, especially in open-ended essay tests. Although essay tests are clearly more valid measures than multiple-choice tests, they possess an additional dimension of error. The ± 30 points on the SAT indicates test/retest reliability; if a student takes the test again, there is a 67% probability that the second score will be in the range of ± 30 points of the first score. With essay scoring, however, there is the additional issue of reader reliability. Even if two readers read each essay, how reliable is the total score? Or to frame the question another way, if the essay is graded again, what is the probability that it will receive the same score? The best that scoring sessions can produce is a reliability of about 80%. When the essays are scored remotely and by a single reader, as are the CLA essays, the reliability is significantly lower. [Breland, HM & RJ Jones. Remote Scoring of Essays. College Board Report No. 88-3, ETS Research Report No. 88-4. New York, College Entrance Examination Board, 1988.]
Scoring the essay by computer eliminates scoring unreliability but is highly dubious. Originally two of the three essays were scored by ETS’s machine scoring algorithm E-rater. Last year, 90% of each of the three essays was graded by a single human reader (10% of each were double-scored). This coming year the analytic essay will be scored by E-rater. The major factors used by E-rater in scoring an essay include length, a four- or five-paragraph essay structure, and the frequency of infrequently used words. [Attali, Y & J Burstein. Automated Essay Scoring with E-rater® v.2.0. ETS Research Report 04-45. November 2005.] That is why an egregious plethora of malapropisms score well on these tests.
Although MIT is participating in the part of the FIPSE grant that is attempting to validate the reductive assessment practices called for in the Spellings Commission Report, the same grant funds the Valid Assessment of Learning in Undergraduate Education (VALUE) initiative sponsored by the Association of American Colleges and Universities (AACU). This organization has already articulated a more complex set of Essential Learning Outcomes that include Inquiry and Analysis, Critical and Creative Thinking, Written and Oral Communication, Information Literacy, Teamwork and Problem Solving, Civic Knowledge and Engagement – both local and global, Intercultural Knowledge and Competence, and Ethical Reasoning.
These skills are meant to be practiced throughout the curriculum in increasing levels of difficulty and, rather than trying to measure these complex outcomes through a series of timed impromptu tests, this initiative formulates assessment as primarily cumulative and plans to implement it primarily through e-portfolios. [Association of American Colleges and Universities. “VALUE: Valid Assessment of Learning in Undergraduate Education”] Such an approach will not produce a single number for national comparison. Instead, it will produce rich and abundant data that can be used to improve teaching to better meet these objectives. The University of Michigan and Rose-Hulman Institute of Technology are among the diverse group of colleges and universities that have taken a leadership role in this project. I suggest that MIT join them.