Comparing beer styles using Chernoff faces

One of my Professors recently gave us a sneak peak at a nice shortcourse on data visualization, The Art and Science of Data Visualization Using R, that he’s teaching at JSM 2015. One of the more amusing visualizations was Chernoff faces:

Chernoff faces, invented by Herman Chernoff, display multivariate data in the shape of a human face. The individual parts, such as eyes, ears, mouth and nose represent values of the variables by their shape, size, placement and orientation. The idea behind using faces is that humans easily recognize faces and notice small changes without difficulty.

Later on I saw this post about multivariate beer on FlowingData which describes using beer flavors to represent county demographics. Here’s the reverse: Rather than using beer flavor to represent multivariate data, the faces are a multivariate data visualization of beer categories. The faces can give some idea of an unfamiliar beer’s profile based on more familiar beer categories.

As usual, the code is on github.

Chernoff faces beer style comparison

Fix R syntax highlighting in vim

One little annoyance of editing R scripts with vi is the syntax highlighting for newly created ‘.R’ files is all wrong.  The reason is by default vi opens new files with a ‘.R’ extension as a ‘rexx’ filetype instead of an ‘r’ filetype.  Presumably vi’s default behavior was conceived back when R was not the enormously popular statistical programming language it is today.

If you save a ‘.R’ file with some code and some comments and then re-open it vim is usually smart enough to detect the correct filetype, but a more elegant fix is to specify the filetype for new files with a ‘.R’ extension.  The incorrect syntax highlighting can be corrected by putting this snippet into ~/.vim/filetype.vim to specify the filetype for new ‘.R’ files:

if exists("did_load_filetypes")                                                 
augroup filetypedetect                                                          
  au BufNewFile,BufRead *.r,*.R setf r                                          
augroup END  

Getting SVG Graphics to work in R on OS X

I was excited to learn that R supports creating SVG vector images out of the box.  SVG is great because 1) Unlike raster graphics, SVG scales to any size without distortion and 2) The format is XML based and on most modern browsers you can just embed the graphic right into your HTML.  That way you can easily share your R plots on the web — the graphic in this post is indeed SVG.  So you can imagine my dismay when I tried to generate an SVG plot and got this error:

x <- 1:10
y <- 1:10 + runif(10)

svg(filename="plot1.svg", width=5, height=5)

Warning messages:
1: In svg(filename = "plot1.svg", width = 5, height = 5) :
unable to load shared object '/Library/Frameworks/R.framework/Resources/library/grDevices/libs//':
dlopen(/Library/Frameworks/R.framework/Resources/library/grDevices/libs//, 6): Library not loaded: /opt/X11/lib/libXrender.1.dylib
Referenced from: /Library/Frameworks/R.framework/Resources/library/grDevices/libs//
Reason: image not found
2: In svg(filename = "plot1.png", width = 5, height = 5) :
failed to load cairo DLL

The problem is OS X Mavericks does not ship with X11 support. Confusingly, this error is actually expected behavior.  A less cryptic message probably could have saved me and you some googling, but at least the fix is easy.  Just install XQuartz and you should be good to go.

Download and install XQuartz:

  1. Go to and download the latest version (2.7.7 when I wrote this)
  2. Mount the .dmg and install the package as usual
  3. After the installation is complete, XQuartz will give you a message telling you to log out and log back in to make XQuartz your default X11 server.  To do so, just click on the Apple icon on the menubar and select “Log Out {Your Username}…” then log back in

Once you’ve logged back in, SVG graphics should work.  Let’s give that another try:

The .svg file will output to your working directory, and you can open it with your favorite browser.

Top NSF Grant Receiving Institutions

The other day I was thinking about about how to get some grant money, and I started wondering what institutions get the most money from the National Science Foundation.  The results weren’t entirely what I expected.  I would never have guessed, for instance, that the University of Illinois got a few huge grants to further supercomputing and cyberinfrastructure, placing it at number three for the most money received by the NSF last year.  I had never even heard of the Consortium for Ocean Leadership, which at number two, got tens of millions to research off shore drilling and a large scale ocean sensor network among other things.  The most money went to Lockheed Martin for the United States Antarctic ProgramMy institution came in at number sixty-seven, with the greatest proportion of the grants going towards the Graduate Research Fellowship Program.  I guess that’s good news for me.

Code for the chart is on my GitHub page.

Improving Education in America

Elizabeth Green’s new book Building a Better Teacher has been getting a lot of press lately, and since I’ll be a teacher’s assistant again soon, it got me thinking about education in America.  One of the big takeaways from the book is effective teaching is not an innate talent, but rather a set of techniques that can be improved with diligent practice.  In particular, Green encourages apprenticeship and critically evaluating previous lessons to improve future instruction.  In that regard it kind of reminds me of Agile software development’s emphasis on continuous improvement.  That’s a great idea, and I hope that Green’s book will encourage its implementation across America’s educational system.

Of course, raw talent never hurts, and the perennial concerns over America’s middling PISA scores got me wondering about the best way to recruit top talent to teaching.  The Varkey GEMS Foundation has indicated that teaching is considered a relatively low prestige job in America as compared to other countries.  I’d argue that, for better or worse, high prestige jobs in America are closely tied to salary.  Investment banking and consulting are both prestigious and selective, and much of that prestige/selectivity is driven by the high pay for those occupations.  Used appropriately, the relationship between prestige and pay in America could actually be a benefit for teaching.  If true, teaching could be made more prestigious just by increasing pay, which is much easier than attempting to use rhetoric to sway cultural perceptions.

With that in mind, I looked for some data about how much teachers in America make and came across this article arguing that American teachers are not, in fact, poorly compensated.  On the contrary, they are the 6th highest paid in the world.  I don’t think that paying teachers more is a panacea for difficulties in education, but I also don’t think the right question is how much teachers make relative to teachers in other countries.  A student considering becoming a teacher is not wondering about pay relative to teachers in Mexico — prospective teachers are weighing their potential income as a teacher against whatever else they could be doing instead.

I gathered some data from the OECD (code is on github) about teacher pay relative to the average annual salary in that country (I would have preferred median annual salary, but I’ll take the data I can get).  Even though the available data was limited, it’s still pretty clear that a very different picture emerges:

By this metric, America is near the bottom of the pack.  The chart is missing many of the countries with top PISA scores, but a cursory spot check on some of the top performers suggested similar results.  China, for example, which regularly scores near the top of the PISA pack, has only about one-eighth the average household income of the United States, but on a purchasing power parity basis pays its teachers around 40% of what a teacher in the US would make.

Still, I don’t think that teacher pay tells the whole story, and I expect there are many exceptions to the teacher prestige/student performance relationship.  A great point raised by The Atlantic is teaching is often thought of only as the time spent in the classroom:

It is high time to correct a common misimpression: teaching isn’t the relatively leisurely occupation many people imagine, enviously invoking a nine-to-three school day and long summer vacations, which in reality seldom exist. We think of no other white-collar profession in terms of a single dimension of job performance. We don’t, for example, regard lawyers as “working” only during the hours they’re actually presenting a case before a judge; we recognize the amount of preparation and subsequent review that goes into such moments.

Other professions recognize a polished presentation for an hour will take, at a minimum, several hours to prepare.  Yet an hour or two of lesson planning is expected to be sufficient for an entire day of teaching, in addition to grading, parental concerns, and administrative overhead.  The most important educational reform might not be a new technology or pedagogical technique, but simply giving teachers more time outside of the classroom.

Are machine generated articles copyrightable?

In 2011 a macaque stole the camera of wildlife photographer David Slater and took this now widely distributed picture known as the “monkey selfie.”  Here’s the tricky bit: Seemingly in response to this photo, the United States Copyright Office now explicitly stipulates that items created by non-humans cannot be copyrighted.

The "Monkey Selfie"

The “Monkey Selfie”

Like most good controversies, this one started with money.  Under the interpretation that the photograph was generated by the monkey, who cannot hold a copyright, the picture was uploaded to Wikimedia Commons, which only allows free-use images.  Naturally Slater balked at the idea that the photograph was ineligible for copyright and requested that the image be removed, claiming that Wikimedia was costing him thousands in lost revenue.

Wikimedia refused, asserting that the photograph belonged to nobody and was ineligible for copyright.  Fast forward three years later, and the copyright office has determined in no uncertain terms that the photograph is indeed not copyrightable.  This article by David Plost, a law professor at Temple University, describes the whole ordeal in more detail and gives some insight into the potential ramifications of the ruling.  According to Professor Plost, this decision may mean that no machine generated content is copyrightable:

The Report goes on to state that musical works, for instance, like all works of authorship, must be of human origin; thus, a work created by solely by an animal would not be copyrightable, nor would a work, more plausibly, generated entirely by a mechanical or an automated process. And similarly, a choreographic work performed by animals, machines, or other animate or inanimate objects is not copyrightable.

Obviously I don’t have a background in law, and prior to reading Plost’s post I would have assumed that content generated by an algorithm would still be considered a product of human creation.  The human created the algorithm, and any derivative works generated by the algorithm would in some sense also be created by a human.  In that way, the copyright process would be more analogous to ownership of physical property.  If someone built a fully-automated widget factory, no one would claim the widgets belong to the factory.  Yet the language in the Copyright Office’s compendium appears to suggest if an algorithm designed the widgets, the copyright belongs to nobody:

The Office will not register works produced by a machine or mere mechanical process that operates randomly or automatically without any creative input or intervention from a human author.

The Associated Press is already using artificial intelligence to write stories, and evidently for certain categories of news people cannot tell the difference between software generated content and stories written by humans.  Are these articles copyrightable?

Automated news could be just the tip of the iceberg.  At what point are the reports, insights, and analytics generated by a machine learning algorithm disqualified from copyright?  Can a photo selected from a video stream by an algorithm be copyrighted?  I’m sure some law folks have a better idea than me, but for now, the issue seems open to debate.  The number of companies interested in this sort of work is growing rapidly, and this aspect of intellectual property law could have a big impact on their business models.  Regardless of the outcome, intellectual property rights with respect to artificial intelligence will definitely be an area to watch.

Dear computer, this essay deserves an “A”

This morning I read this thoughtful article by Alan Jacobs about automated essay grading by computers.  I think most criticisms of algorithmic grading are overblown, but the article got me thinking some of the objections to algorithmic grading deserve more careful consideration.

Claiming computers can grade as well as humans is a difficult point to defend, because as Professor Jacobs points out, “good” writing is rather ambiguous.

The most obviously evaded question is this: When students are robo-graded, the quality of their writing improves by what measure?

Some measures of good writing that can be assessed by algorithms without much difficulty, and often not even very sophisticated algorithms.  Every sentence needs a subject and a verb.  There should generally be a comma before a coordinating conjunction separating two independent clauses.  These well-established, well-defined rules, which might perhaps be better described as the grammar and lexical semantics of a piece, can easily be graded by an algorithm.  If the automated grader fails at these goals, that failure is because the algorithm needs to be improved, not because machine grading itself is fundamentally flawed.

Algorithmic grading enables a valuable feedback loop, and even the most ardent opponents of algorithmic grading would be hard-pressed not to concede that instant evaluation of sentence structure, grammar, and basic semantics would be helpful to novice writers.  The important question centers on more advanced writing:  How do you create an algorithm that, by necessity, evaluates something as abstract as a paper’s “quality” in a quantitative way?  Does following an algorithm’s assessment of good writing stunt a writer’s development because machines lack human sensibility?

I think there is merit to this point, but not because of some lofty notion that following a machine’s opinion results in writing that is formulaic, mathematical, and cold.  Yes, good writing is hard to define, but here’s a thought experiment: Imagine giving an essay prompt to a college freshmen on a topic in The Atlantic.  My guess is that most everyone will agree that The Atlantic article was better than the student’s paper, even if the reasons given for why it was better are quite varied.  The qualities of good writing may be nebulous, but humans seem to be able to identify what good writing is even if they can’t give a complete description of what good writing entails.

Mr. Jacob’s quote from Les Perelman in the Boston Globe explains the risk of giving papers high marks because they contain correlates of quality papers, rather than actually being quality papers:

Robo-graders do not score by understanding meaning but almost solely by use of gross measures, especially length and the presence of pretentious language. The fallacy underlying this approach is confusing association with causation. A person makes the observation that many smart college professors wear tweed jackets and then believes that if she wears a tweed jacket, she will be a smart college professor.

I don’t fully agree with this claim.  Like humans trying to describe why one article is of higher quality than another, many learning algorithms have good predictive power without incorporating a causal mechanism.  I can predict with some confidence that there will be more car accidents when there are more snowplows on the road — I don’t need to know whether the roads are slick.  An algorithm to predict good writing can likewise do well, even without identifying the causes of good writing, provided the algorithm is robust enough.  That the current generation of computer grading systems can be tricked by verbosity and florid prose is a sign that the technology is still in its infancy, not that the concept is unworkable.

The subtler, more important objection to learning writing from a computer is often overlooked.  Learning how to write from a computer is treacherous because writing is meant to be consumed by humans.  Mr. Jacobs rightly points out this too often forgotten aspect of computer grading:

Perelman’s actual arguments belie Paul’s statement that “critics like Les Perelman of MIT claim that robo-graders can’t be as good as human graders, it’s because robo-graders lack human insight, human nuance, human judgment.” It’s perfectly clear even from these excerpts that Perelman’s point is not that the robo-graders are non-human, but that they reward bad writing and punish good.

I suspect, then, that with this automated grading we’re moving perilously close to a model that redefines good writing as “writing that our algorithms can recognize.”

The writing sections of standardized tests suffer from essentially the same problem.  The formulaic, uninspired writing characteristic of these timed essays are meant to satisfy the algorithm, in this case the grading rubric, not the reader.  The primary reason no one in higher education takes the results of writing sections seriously, the reason the scores aren’t really considered beyond a crude metric for whether the applicant has a basic handle on the English language, is this sort of algorithmic evaluation.  Since the context demands everyone write to an algorithm anyway, I don’t really have any issues with eventually replacing all the graders of standardized tests with robots.  In fact, the robots may be more reliable and impartial, not susceptible to variations caused by extraneous factors like if the grader missed lunch or got a good night’s sleep.

Outside of standardized testing the consequences of computerized grading are more nuanced.  Instant feedback is desirable and evidently also encourages students to write more.  However, learning to write from a computer may change composition in unexpected ways.  In the past, writing has evolved in lockstep with culture for the enhanced understanding of human readers.  Hopefully using an algorithm to train a generation of writers won’t interrupt that relationship.  More hazardous still, inexperienced writers may unintentionally adapt their writing to the algorithm.  We already do this sort of adaptation every day on the web.  Search engines have only recently begun to handle natural language queries like “Where is the nearest coffee shop?” effectively.  The response to “Starbucks, San Francisco” by a real person would be a puzzled stare.

Where do engineers live?

My roommate told me I ask strange questions.

The answer, it turns out, is mostly the west coast, the north east… and Colorado.  It’s a little bit too small to see on the map, but the area with the greatest concentration by far is Washington D.C.  The infographic below is based on the 2012 census data and gives the number of people with science or engineering degrees and related disciplines in each state per every 1000 residents (code is available here, on my GitHub page).  The states with darker shades have more engineers, adjusted by population, than the lighter ones.

I have a hunch that these rankings probably track pretty closely to the total number of college graduates in each state.  It would be interesting to compare these results to the number of STEM graduates adjusted by the total number of residents with bachelor degrees.  Just eyeballing it, it looks like for the most part the rankings seem to align relatively well with the GDP per capita in each state.  Same story with the Gross State Product, although some of the states with low population density (Wyoming, for example) have relatively high GDP per capita but low total economic output.  There are also some striking similarities to the 2012 election results.  States with lots of engineers love Obama I guess.

For the numerically minded, here is the complete list:

Rank State Degrees per 1000 citizens
1 District of Columbia 185.0796006
2 Massachusetts 127.2846981
3 Maryland 121.2871601
4 Colorado 113.5560991
5 Virginia 111.3983507
6 Connecticut 111.0297746
7 New Jersey 109.4002444
8 Vermont 108.1460046
9 New Hampshire 105.0934617
10 Washington 102.7639441
11 California 95.56312689
12 New York 95.20315491
13 Oregon 94.22922723
14 Minnesota 92.63596061
15 Rhode Island 90.65519065
16 Montana 89.5092659
17 Hawaii 88.65588624
18 Delaware 86.81776098
19 Maine 86.47737196
20 Illinois 86.14129653
21 Alaska 84.14647638
22 Pennsylvania 81.36954118
23 North Carolina 76.14895217
24 Wisconsin 75.66918193
25 Kansas 75.62430594
26 Florida 75.33509451
27 New Mexico 75.32202329
28 Michigan 73.70589973
29 South Dakota 73.19827606
30 Arizona 72.97864115
31 Georgia 72.15738012
32 Utah 71.74794221
33 North Dakota 71.34780049
34 Wyoming 71.24064772
35 Nebraska 71.07268464
36 Idaho 71.05623138
37 Texas 70.16535677
38 Missouri 68.5097832
39 Ohio 67.84525839
40 Iowa 66.10403514
41 South Carolina 66.02437642
42 Tennessee 62.29654104
43 Indiana 61.5813976
44 Nevada 60.00980359
45 Alabama 58.19042329
46 Oklahoma 57.78928051
47 Kentucky 57.38795091
48 Louisiana 56.40486313
49 Puerto Rico 54.62100934
50 West Virginia 51.78202208
51 Arkansas 51.70248657
52 Mississippi 47.18997413