This morning I read this thoughtful article by Alan Jacobs about automated essay grading by computers. I think most criticisms of algorithmic grading are overblown, but the article got me thinking some of the objections to algorithmic grading deserve more careful consideration.
Claiming computers can grade as well as humans is a difficult point to defend, because as Professor Jacobs points out, “good” writing is rather ambiguous.
The most obviously evaded question is this: When students are robo-graded, the quality of their writing improves by what measure?
Some measures of good writing that can be assessed by algorithms without much difficulty, and often not even very sophisticated algorithms. Every sentence needs a subject and a verb. There should generally be a comma before a coordinating conjunction separating two independent clauses. These well-established, well-defined rules, which might perhaps be better described as the grammar and lexical semantics of a piece, can easily be graded by an algorithm. If the automated grader fails at these goals, that failure is because the algorithm needs to be improved, not because machine grading itself is fundamentally flawed.
Algorithmic grading enables a valuable feedback loop, and even the most ardent opponents of algorithmic grading would be hard-pressed not to concede that instant evaluation of sentence structure, grammar, and basic semantics would be helpful to novice writers. The important question centers on more advanced writing: How do you create an algorithm that, by necessity, evaluates something as abstract as a paper’s “quality” in a quantitative way? Does following an algorithm’s assessment of good writing stunt a writer’s development because machines lack human sensibility?
I think there is merit to this point, but not because of some lofty notion that following a machine’s opinion results in writing that is formulaic, mathematical, and cold. Yes, good writing is hard to define, but here’s a thought experiment: Imagine giving an essay prompt to a college freshmen on a topic in The Atlantic. My guess is that most everyone will agree that The Atlantic article was better than the student’s paper, even if the reasons given for why it was better are quite varied. The qualities of good writing may be nebulous, but humans seem to be able to identify what good writing is even if they can’t give a complete description of what good writing entails.
Mr. Jacob’s quote from Les Perelman in the Boston Globe explains the risk of giving papers high marks because they contain correlates of quality papers, rather than actually being quality papers:
Robo-graders do not score by understanding meaning but almost solely by use of gross measures, especially length and the presence of pretentious language. The fallacy underlying this approach is confusing association with causation. A person makes the observation that many smart college professors wear tweed jackets and then believes that if she wears a tweed jacket, she will be a smart college professor.
I don’t fully agree with this claim. Like humans trying to describe why one article is of higher quality than another, many learning algorithms have good predictive power without incorporating a causal mechanism. I can predict with some confidence that there will be more car accidents when there are more snowplows on the road — I don’t need to know whether the roads are slick. An algorithm to predict good writing can likewise do well, even without identifying the causes of good writing, provided the algorithm is robust enough. That the current generation of computer grading systems can be tricked by verbosity and florid prose is a sign that the technology is still in its infancy, not that the concept is unworkable.
The subtler, more important objection to learning writing from a computer is often overlooked. Learning how to write from a computer is treacherous because writing is meant to be consumed by humans. Mr. Jacobs rightly points out this too often forgotten aspect of computer grading:
Perelman’s actual arguments belie Paul’s statement that “critics like Les Perelman of MIT claim that robo-graders can’t be as good as human graders, it’s because robo-graders lack human insight, human nuance, human judgment.” It’s perfectly clear even from these excerpts that Perelman’s point is not that the robo-graders are non-human, but that they reward bad writing and punish good.
I suspect, then, that with this automated grading we’re moving perilously close to a model that redefines good writing as “writing that our algorithms can recognize.”
The writing sections of standardized tests suffer from essentially the same problem. The formulaic, uninspired writing characteristic of these timed essays are meant to satisfy the algorithm, in this case the grading rubric, not the reader. The primary reason no one in higher education takes the results of writing sections seriously, the reason the scores aren’t really considered beyond a crude metric for whether the applicant has a basic handle on the English language, is this sort of algorithmic evaluation. Since the context demands everyone write to an algorithm anyway, I don’t really have any issues with eventually replacing all the graders of standardized tests with robots. In fact, the robots may be more reliable and impartial, not susceptible to variations caused by extraneous factors like if the grader missed lunch or got a good night’s sleep.
Outside of standardized testing the consequences of computerized grading are more nuanced. Instant feedback is desirable and evidently also encourages students to write more. However, learning to write from a computer may change composition in unexpected ways. In the past, writing has evolved in lockstep with culture for the enhanced understanding of human readers. Hopefully using an algorithm to train a generation of writers won’t interrupt that relationship. More hazardous still, inexperienced writers may unintentionally adapt their writing to the algorithm. We already do this sort of adaptation every day on the web. Search engines have only recently begun to handle natural language queries like “Where is the nearest coffee shop?” effectively. The response to “Starbucks, San Francisco” by a real person would be a puzzled stare.