Westat Data Scientists Detail Best Coding Practices
Data scientists and statisticians, interested in making their results accessible to others, face a number of challenges to ensure their analyses are reproducible. These obstacles, as well as recommendations to address them, are spotlighted in Best Coding Practices to Ensure Reproducibility (PDF), an Issue Brief by Westat’s Data Scientists Gonzalo Rivero, Ph.D., and Kristin Chen.
“As professionals in the collection and analysis of data, we face distinct challenges due to the nature of the artifacts with which we interact, the type of output we produce, and our own technical backgrounds and priorities,” says Dr. Rivero. “Chief among these hurdles are lack of specific training in reproducibility, the competing pressure of deadlines, and the subjective and social nature of the problem itself.”
To mitigate these weaknesses, Dr. Rivero stresses that when scientists write code to share results, they must ensure that end users can understand the meaning of the code so that they can verify it and contribute to it as well.
Co-author, Kristin Chen explains that for code to be reproducible, it must be stable, portable, and easily understood: “Good code leads to a transparent, consistent, readable product so that the analyses and the thinking process can be communicated between users or between the statisticians and data scientists and future users.”
Although reproducibility is a matter of communication, workflow, and process, Dr. Rivero and Ms. Chen offer technical recommendations using examples from the ecosystem of the R language. They emphasize that what separates good code from bad code is largely how the information is organized and conveyed.
Their recommendations include
- Embracing conventions in naming functions or objects
- Adopting a style guide that relies on idioms
- Avoiding assumptions about the execution environment
- Structuring the code in predictable ways
The authors also address other challenges, including when code dependencies undergo changes through different iterations, including situations in which the statistical environment itself changes in a way that can affect the original intention of the code. “We cannot assume that we will have access to the same computational environment in which data processing and data analysis originally took place,” says Dr. Rivero. “We must ensure that we can replicate in the future the exact network of dependencies we used today to run our analysis.”
Dr. Rivero says he understands the challenges that statisticians and data scientists face: “We are squarely in the terrain of software engineers, but we can all learn how to write good, usable code, especially if we put ourselves in the shoes of the end users.”
Bottom line, notes Dr. Rivero, writing reproducible code is an evolving and collaborative enterprise among research scientists, and it requires good tools to support good practices and processes. Because of that, Dr. Rivero offers this article as a “starting point for a wider conversation about computational reproducibility within the community of researchers.”
We are squarely in the terrain of software engineers, but we can all learn how to write good, usable code, especially if we put ourselves in the shoes of the end users.
- Gonzalo Rivero, Ph.D., Data Scientist, Statistics & Evaluation Sciences