Short review: 3 books on reproducible research with R

Reproducible research can be defined in a number of ways. My personal, practical definition is the tying together of code and text such that the provenance of the work—from gathering data, doing analysis, and creating graphs and tables to explaining and discussing one’s methodology and findings in a report or presentation—is embedded in the work itself. The code and text are knitted together so that changes in the data, analysis, or discussion are dynamically connected and updated. (The knitting metaphor is from Yihui Xie’s knitr package.)

Three recent books have significantly influenced how I use R in reproducible work: Dynamic Documents with R and knitr (2013) by Yihui Xie, Reproducible Research with R and RStudio (2013) by Christopher Gandrud, and the collection Implementing Reproducible Research (2014) ed. by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. (All three books are in the R Series by CRC Press; I have no affiliation with CRC Press.)

Xi’an’s recent short book review of the latter got me thinking about how these authors have improved my work and reshaped how I think about work flow. It’s a continuous-improvement process—read, practice, implement, reread, gain new insights informed by experience, repeat. Each time I reread a short section or skim several chapters, I come away having learned something new or gained a fresh perspective on my problem.

I recommend all three books to R users at any level. There really is something here for everyone.

Yihui Xie captures the essence of the problems faced by the solo artist using conventional practice when he says,

We import a dataset into a statistical software package, run a procedure to get all results, then copy and paste selected pieces into a typesetting program, add a few descriptions, and finish a report.

The important phrase here is “copy and paste.” Anyone reporting the results of computational work will have experienced returning to the work weeks or months after it was originally performed, redoing some portion, and copying and pasting the results into a revised report. As Gandrud puts it,

Almost no actual research process is completely linear. You almost never gather data, run analyses, and present your results without going backwards to add variables, make changes to your statistical models, create new graphs, alter results tables in light of new findings, and so on. You will probably try to make these changes long after you last worked on the project and long since you remembered the details of how you did it.

My personal shorthand for reproducibility is eliminate all copy and paste. Even if the attempt is not completely successful, you will have dramatically improved your work flow by having made the attempt.

The obstacles to full implementation of reproducible methods are fairly well stated in a An argument for not using reproducible data analysis tools (2012) by Jeromy Anglim. Indeed, authors Hoefling and Rossini in Implementing Reproducible Research (Ch. 8) give great detail about their experiences and recommendations after

applying and adapting the literate statistical analysis [LSA] methodology to the work activities … of a large-scale corporate research and development project. This complex, multi-year, multiclient/stakeholder, multidata analyst (with team members and clients mixed across multiple continents) presented challenges…

Their implementation of reproducibility, in the end, “largely split the code from the final report” with a single “org-document” for code and separate documents for specific reports. They acknowledge the disadvantages, for example, it is “somewhat harder to link a table or figure in a final report back to the code that produced it,” but they clearly state their rationale, trade-offs, and workarounds.

Their work is in contrast Xie’s work which focuses more on the solo artist. For example, Yihui mentions the advantages to an instructor when grading reproducible student work. I too am very much in favor of shortening grading time. And his point that teaching students the essential concepts of reproducible work while they are students will have a significant impact on their future research practice. This is good stuff. But having read Gandrud as well as Hoefling and Rossini provides a larger perspective on what practices one can expect to be effective, especially when one has collaborators (or publishers) who do not share one’s enthusiasm for R—and insist on using Office software.

Leave a Reply Cancel reply