A few years ago, Joelle Pineau, a computer science professor at McGill, was helping her students design a new algorithm when they fell into a rut. Her lab studies reinforcement learning, a type of artificial intelligence that’s used, among other things, to help virtual characters (“half cheetah” and “ant” are popular) teach themselves how to move about in virtual worlds. It’s a prerequisite to building autonomous robots and cars. Pineau’s students hoped to improve on another lab’s system. But first they had to rebuild it, and their design, for reasons unknown, was falling short of its promised results. Until, that is, the students tried some “creative manipulations” that didn’t appear in the other lab’s paper.
Lo and behold, the system began performing as advertised. The lucky break was a symptom of a troubling trend, according to Pineau. Neural networks, the technique that’s given us Go-mastering bots and text generators that craft classical Chinese poetry, are often called black boxes because of the mysteries of how they work. Getting them to perform well can be like an art, involving subtle tweaks that go unreported in publications. The networks also are growing larger and more complex, with huge data sets and massive computing arrays that make replicating and studying those models expensive, if not impossible for all but the best-funded labs.
“Is that even research anymore?” asks Anna Rogers, a machine-learning researcher at the University of Massachusetts. “It’s not clear if you’re demonstrating the superiority of your model or your budget.”
Pineau is trying to change the standards. She’s the reproducibility chair for NeurIPS, a premier artificial intelligence conference. Under her watch, the conference now asks researchers to submit a “reproducibility checklist” including items often omitted from papers, like the number of models trained before the “best” one was selected, the computing power used, and links to code and datasets. That’s a change for a field where prestige rests on leaderboards—rankings that determine whose system is the “state of the art” for a particular task—and offers great incentive to gloss over the tribulations that led to those spectacular results.
Read more here: