Reproducible science/research

The Tsunami of data that has engulfed astronomers in the last two decades, combined with faster processors and faster internet connections has made it much more easier to obtain a result. However, these factors have also increased the complexity of a scientific analysis, such that it is no longer possible to describe all the steps of an analysis in the published paper. Citing this difficulty, many authors suffice to describing the generalities of their analysis in their papers.

It is impossible to study the validity a result if you can't reproduce it. The complexity of modern science makes it vitally important to exactly reproduce the final result (given the same raw input data). Even a small deviation can be due to many different parts of an analysis. For example, see this review of a 7-year old conflict in theoretical condensed matter which was only identified after the relative codes were shared. Also see this report, on how a project aiming to reproduce 50 high-impact cancer papers shrinks to just 18.

Nature is already a black box which we are trying so hard to comprehend. Not letting other scientists see the exact steps taken to reach a result, or not allowing them to modify it (do experiments on it) is a self-imposed black box, which only exacerbates our ignorance.

To better highlight the importance of reproducibility, consider this analogy: the English style of scientific papers. Why do non-English speaking researchers have to invest a lot of time and energy in mastering English to a sufficiently high level for publishing their exciting results? Using an online translator is enough to convey the absolute final result of their analysis. For example that galaxies grow as a specific function of the age of the universe. Everyone will get the ultimate point they want to make through a poor translation. Then, why do journals request that the paper be written a very good English level?

It is because journals do not publish raw results. Understanding the method that the result was obtained is more important than the result itself. Good language (English in this case) standardizes the reading and allows the readers to understand the details of the method more easily and focus on the details. Otherwise, the readers have waste a lot of mental energy on what a mis-spelled word, or poorly written sentence (bad grammar/style) may mean in interpreting the result.

Exactly the same logic applies to reproducibility: without sufficient standards, readers cannot focus on the details, and will make different interpretations. Just as non-English speakers are forced to master English, people that are not trained in software MUST invest the time and energy to do so. You hear this statement a lot from many scientists: "software is not my specialty, I am not a software engineer, so the quality of my code/processing doesn't matter. Why should I master good coding style (or release my code), when I am hired to do Astronomy/Biology?". This is akin to a French scientist saying that "English is not my language, I am not Shakespeare. So the quality of my English writing doesn't matter. Why should I master good English style, when I am hired to do Astronomy/Biology?".

Other scientists should be able to reproduce, check and experiment on the results of anything that is to carry the "scientific" label. Any result that is not reproducible (due to incomplete information by the author) is not scientific: the readers have to have faith in the subjective experience of the authors in the very important choice of configuration values and order of operations: this is contrary to the definition of science.

This topic is recently gaining attention in the community. The National Academies of Sciences, engineering and medicine recently published a very complete report on the necessity of "Open science" along with guidelines and recommendations on how to implement it. The Nature journal's editorial board also recently announced their new policy regarding software and methods in their published papers along with a Code and software submission checklist. Here is an excerpt from the new policy:

Authors must make available upon request, to editors and reviewers, any previously unreported custom computer code or algorithm used to generate results that are reported in the paper and central to its main claims. Any reason that would preclude the need for code or algorithm sharing will be evaluated by the editors who reserve the right to decline the paper if important code is unavailable.

A similar policy was also introduced in the Science journal since 2011. Stodden et al. (2018) have done a very interesting study on the effectiveness of this policy change and how authors share their data and processing. Unfortunately the results aren't too encouragining: Only ~10% of the papers they reviewed provided links to their data and scripts (enabling reproducibility without contact). Including those that gave the scripts after private contacts, this paper only deems 26% of the papers reproducible (those that share sufficient details of the method). They also compare with the period before the policy and see a slight (but not significant) improvement, concluding that while policy alone is useful, it is not sufficient. From that study it may be concluded that even when the policy exists, without its strict enforcement by journal editors, it is not effective. This paper mentions how some authors didn't even know about this policy.

For some recent discussions in the astronomical community in particular, please see Shamir et al. (2018), Oishi et al. (2018) and, Allen et al. (2018). This is ofcourse not a new problem, discussed as far back as Roberts (1969). Gentleman and Temple Lang (2004) introduced the concept of a "research compendium" including the text, code and data of a research project paper in what they call a "dynamic document" which is based on Literate programming. A nice discussion on the advantages of literate programming is also given by Rob Hicks. More recently, Hinsen (2018) also gives a very thorough and fundamental discussion of this this situation and provides an interesting solution, mainly applied to analytic processes (continuous mathematics, not data analysis which often deals with discrete elements).

This critical issue is also being discussed outside of the scientific community. For example this 2018 essay, or this 2016 analysis in the New York Times. The former is mainly based on a reproducible experiment design. While the latter shows how the of the lack of interest in reproducing other people's results (and only trying to continue the biased methods) can grow into absurd claims (that are widely cited and used by the scientific community).

Here, a complete and working solution/template to doing research in a fully reproducible manner is proposed. We then discuss how useful this proposed system can be during the research and also after its publication. This system has been implemented and is evolving in my own research. The GNU Astronomy Utilities (a collection of software for astronomical data analysis) was also created (and improved) in parallel with my research to provide the low-level analysis tools that were necessary for maximal modularity/reproducibility. Please see the Science and its tools section of the Gnuastro book for further discussion on the importance of free software and reproducibility.

Some slides are also available with figures to help demonstrate the concept more clearly and supplement this page.

Implementation

All the necessary software (dependencies) to do (and reproduce) the analysis are free software. They are automatically downloaded, built and locally installed at the configuration step to allow very good control of the working environment (highly independent of the host operating system). The installation of dependencies is done locally (not affecting the host operating system and not needing root/administrator access). This also enables working on different projects with (possibly) different versions of the same software/dependency.

All the processing steps in the proposed reproduction pipeline are managed through Makefiles. Makefiles are arguably the simplest way to define dependency between various steps and run independent steps in parallel when necessary (to improve speed and thus be more creative).

When batch processing is necessary (no manual intervention, as in a reproduction pipeline), shell scripts usually come to mind. However, the problem with scripts for a scientific reproduction pipeline is the complexity. A script will start from the top/start every time it is run. So if you have gone through 90% of a research project and want to run the remaining 10% that you have newly added, you have to run the whole script from the start again and wait until you see the effects of the last few steps (for the possible errors, or better solutions and etc).

The Make paradigm, on the other hand, starts from the end: the final target. It builds a dependency tree to find where it should actually start each time it is run. Therefore, in the scenario above, a researcher that has just added the final 10% of steps of her research to her Makefile, will only have run those extra steps. This greatly speeds up the processing (enabling creative changes), while keeping all the dependencies clearly documented (as part of the Make language), and most importantly enabling full reproducibility. Since the dependencies are also clearly demarcated, Make can identify independent steps and run them in parallel (further speeding up the process). Make was designed for this purpose and it is how huge projects like all Unix-like operating systems (including GNU/Linux or Mac OS operating systems) and their core components are built.

The output of a research is either some numbers, a plot, or more formally, a report/paper. Therefore the output (final Makefile target) of the reproduction pipeline described here is a PDF that is created by LaTeX. Each step stores the necessary numbers, tables, or images that need to go into the final report into separate files. In particular, any processed number that must be included within the text of the final PDF is actually a LaTeX macro. Therefore, when any step of the processing is updated/changed, the numbers and plots within the text will also correspondingly change.

The whole reproduction pipeline (Makefiles, all the configuration files of the various software and also the necessary source files) are plain text files, so they don't take much space (usually less than 1 mega-byte). Therefore the whole pipeline can be packaged with the LaTeX source of the paper and uploaded to arXiv when the paper is published. arXiv is arguably one of the most important repositories for papers in many fields (not just astronomy) with many mirrors around the world. Therefore, the fact that the whole processing is engraved in arXiv along with the paper is arguably one of the best ways to ensure that it is kept for the future.

A fully working template based on the proposed method above has been defined which can be easily configured for any research. The README.md file in that template gives full information how to configure and adapt the pipeline to your research needs. Please try it out and share your thoughts to make it work more robustly and integrate better in the future.

The implementation introduced here was independently designed, but has many similarities with the "Research compendium" concept of Gentleman and Temple Lang (2004). The major difference is that here, we don't use Literate programming. The basic reason is modularity and abstractions. At the first/highest level, readers just want a general introduction to the whole research in an easy to read and less technical manner. Only when they are very interested in the result will they look into the details of the analysis (figures in a paper) and only if that further raises their curiosity will they actually invest the time look into the raw processing steps. A similar process happends during the definition phases of a research project by its authors. Following this logic, this pipeline is designed on the principle of maintaining the research project as two separate/modular components: 1) its coding/scripts/software and 2) plain text descriptions used to generate the final paper/report.

This modularity allows the use of optimized tools for each step separately without any intermediate tool (extra dependency) to weave or tangle the separate components. For example you can only find, test and fix a bug in the coding step on the tangled output. After fixing it there, if you forget to re-implement it in the tangled source, the next tangle will re-write all the changes. Human-readable text within code is indeed very important, but by definition, such low-level, or technical, description is only interesting/useful after the reader has obtained the generic view. Comments in the code surve this purpose (low-level documentation) wondefully, thus easily complementing the high-level paper's descriptions in the LaTeX source without the need for any intermediate tool.

Practical benefits

Reproducibility is the main reason this system was designed. However, a reproduction pipeline like this also has lots of practical benefits that are listed below. Of course, for all the situations above to be maximally effective, the scripts have to be nicely/thoroughly commented for easy human (not just computer) readability.

While doing the research

  • Other team members in a research project can easily run/check/understand all the steps written by other members and find possibly better ways of reaching the result or implement their part of the research in a better fashion.
  • During the research project, it might happen that one of the parameters is decided to be changed or a new version of some of the used software is released. With this system, updating all the numbers and plots in the paper is as simple as running a make command and the authors don't have to worry about part of the paper having the old configuration and the other part with the new configuration. Manually trying to change everything in the text will be prone to errors.
  • If the referee asks for another set of parameters, they can be immediately replaced and all the plots and numbers in the paper will be correspondingly updated.
  • Enabling version control (for example with Git) on the contents of this reproduction pipeline will make it very simple to revert everything back to a previous state. This will enabling researchers to experiment more with alternative methods and new ideas, even in the middle of an on-going research. GitLab enables free private repositories which is very useful for collaborations to privately share their work prior to its publication.
  • The authors can allow themselves to forget the details and keep their mind open to new possibilities. In any situation they can simply refer back to these scripts and see exactly what they did. This will enable researchers to be more open to learning/trying new methods without worrying about loosing/forgetting the details of their previous work.

After publication

  • Other scientists can modify the parameters or the steps in order to check the effect of those changes on the plots and reported numbers and possibly find enhancements/problems in the result.
  • It serves as an excellent repository for students, or scientists with different specialties to master the art of data processing and analysis in this particular sub-field. By removing this barrier, it will enable the mixture of the experiences of the different fields, potentially leading to new insights and thus discoveries.
  • By changing the basic input parameters, the readers can try the exact same steps on other data-sets and check the result on the same text that they have read and have become familiar with.

Similar attempts

Fortunately other scientists have also made similar attempts at reproduction. There is also a nice technical guideline and discussion is also available in Nature (August 20, 2018). Also see Gentleman and Temple Lang (2004, with some examples in research-compendium.science), and Hinsen (2018). A rich collection of resources is also available in reproduciblescience.org. Two MOOCs (massive open online cources) are also available: one in opensciencemooc.eu, and the other by INRIA.

A list of papers that I have seen so far is available below (mostly in my own field of astronomy). Hopefully this list will greatly expand in the near future. If you know of any other attempts, please let me know so I can update the list.

  • Essential skills for reproducible research computing: Contents of a workshop on reproducible research, some slides are also provided, along with a nice literature review and a Manifesto. In general Lorena Barba (author of the links here) and her team are making some great progress in this regard.
  • Kulkarni et al. (2018, arXiv: 1807.09774). (as of July 2018, submitted to) Monthly Notices of the Royal Astronomical Society. The data and processing to derive the results is availabe on GitHub.
  • Paxton et al. (2018, arXiv:1710.08424). Astrophysical Journal Supplement Series, 234:34. Their reproduction system is described in Appendix D6, as part of a larger system for all research using the MESA software.
  • Parviainen et al. (2016, arXiv:1510.04988) Astronomy & Astrophysics, 585, A114. The reproduction scripts are available on GitHub.
  • Moravveji et al. (2016, arXiv:1509.08652). Monthly Notices of the Royal Astronomical Society, 455, L67. The reproduction information is available on Bitbucket.
  • VanderPlas and Ivezic (2015, arXiv:1502.01344) Astrophysical Journal, 812, 1. The reproduction pipeline is available on GitHub. Their implementation is described in these nice slides. Just as a comment on the slides: GitLab allows unlimited private repositories for free, along with the capability to host Git repos on your own servers. Therefore, the paper's Git repository can remain private until the paper is published.
  • Robitaille et al. (2012, arXiv:1208.4606) Astronomy & Astrophysics, 545, A39. The reproduction scripts are available on GitHub.

Since there is no adopted/suggested standard yet, each follows a different method which is not exactly like this paper's reproduction pipeline. This is still a new concept and thus such different approaches are great to make the concept more robust. Besides the suggested style here, please have a look at these methods too and adopt your own style (what you find the best in each) and share it.

Non-reproducible papers

Unfortunately not all researchers have the view described above on scientific methodology. In their view to science, only results are important. Therefore, a vague description (in the text of the paper) is enough and the exact method can be kept as a trade secret. To this class of researchers, doing science is similar to doing magic tricks (where the magician's methods are his/her trade secrets, and their audience only want results/entertainment). Here is a list of such papers that I have come across so far:

  • Zhang et al. 2018 (Nature, June 4th, 2018, arXiv:1806:01280). In the "Code availability" section close to the end (which they had to add due to Nature's new policy on Availability of computer code and algorithm), they blatantly write: "We opt not to make the code used for the chemical evolution modeling publicly available because it is an important asset of the researchers’ toolkits".

Data aquisition

The discussions above were mainly concerned with the reproducible analysis of data after it is aquired (saved as raw data on a computer, for example tables or images). Fortunately in astronomy, raw data from major telescopes become publicly available a year or two after their aquisition and all the meta-data (telescope settings or observation strategies/protocols) are also stored in the datasets (usually as FITS format header keywords). So in astronomy, full information on data aquisition isn't usually a problem.

In other sciences, especially when such major data aquisition facilities don't exist and experiments are done in local labs with different instruments, it is very important to keep the exact data aquisition process documented and versioned. One interesting solution to these can be places such as protocol.io.

Ofcourse, higher-level analysis, that starts from the moment the researcher(s) work on files in a computer rather than hardware, is still done on raw data to achieve a result. Therefore as described above, to have a reproducible/scientific result, the raw aquired data and the detailed processing steps done on it must be published along with the result/paper independently of how it was aquired. Hence reproducible data aquisition protocols are indeed necessary, but not sufficient step for a fully reproducible result.

Other related pages

There are many links to various resources during the text of the article above. Below are some links to other similar work on this subject.

  • Arizona State Univerity library, open data/science page. This page also contains links to useful resources for doing science reproducibly.

Acknowledgments

Mohammad-reza Khellat, Alan Lefor, Alejandro Serrano Borlaff kindly provided very useful comments or found bugs in the suggested reproduction pipeline. Alice Allen, Mosè Giordano, Gérard Massacrier, Ehsan Moravveji, Peter Mitchell, Vicky Steeves, David Valls-Gabaud and Paul Wilson (in alphabetical order) kindly informed me of some of the links mentioned here.