Preparing your code for computational verification

Published:

The steps in this document are being used in a pilot project.

This document describes how to prepare your code for verification, taking into account some of the most frequent issues that the Data Editor and his team have encountered in submitted replication packages.

Overview

We will describe a few checks and edits you should make to your code, in order to ensure maximum reproducibility. We then describe how to test for reproducibility before submitting to the Data Editor. All steps have been tested by undergraduate replicators, and should be easy to implement. Whether they take a lot of time depends on your specific code, but generally, these adjustments can be made by somebody with good knowledge of the code base very quickly.

Much more extensive guidance on the issues addressed here is available at https://larsvilhuber.github.io/self-checking-reproducibility/. We reference specific chapters there at each of the steps.

Checklist

Print off (as PDF or on paper) the following checklist, and tick off each item as you complete it. Provide the completed checklist as part of the replication package.

  • Main file: A single main file is provided that runs all code.
  • Path names: All paths in code use / (forward slashes) relative to a single top-level project directory ($rootdir, $basedir, etc.)
  • Dependencies: All packages/libraries/dependencies are installed via code once.
    • For Stata, these packages are installed into a subdirectory in the project ($rootdir/ado, $basedir/adofiles, etc.), and used by the code.
    • For R, renv is used (exceptions made for other package management systems if such a system is explained).
    • For Python, environments are used (native venv or conda), and the necessary top-level requirements specified (no OS-specific dependencies are included).
  • Displays: All figures and tables are written out to clearly identified external files, and the authors’ versions, as used in the manuscript, are provided.
  • Testing in containers: After all changes were made, the code was run using an appropriate authorized container, and the generated log files are provided.

Detailed instructions

Preliminary: Directory structure of a replication package

A generic replication package, housed at /my/computer/users/me/project, might have the following structure:

README.pdf
data/
   raw/
      cps0001.dat
   analysis/
      combined_data.dta
      combined_data.csv
      combined_data_codebook.pdf
code/
  01_readcps.do
  02_readfred.do
  03_table1-5.do
  04_figures1-4.do
results/
  table1.tex
  table2.xlsx
  ...
  figure1.png
  figure2.pdf

where

  • data/raw has the externally acquired raw data files (not modified by the authors)
  • data/analysis has the processed data files, generated by the code in this repository.
  • code has the code files.
  • results has all the results files.

For illustration purposes, we have used Stata .do files, and outputs in a variety of formats, but the same principles apply to other software, and to any output formats.

Note that we did not specify where the main.do file will be!

Step 1: Main file

You may or may not have a main file. The following should be adapted to your circumstances. You do not need to create a file that is called main.do if you already have one, but you may need to update your existing main file.

Reference: https://larsvilhuber.github.io/self-checking-reproducibility/02-hands_off_running.html

Creating a single main file is straightforward. However, you will want to make some minor edits depending on where, in the above template setup, the file is located:

Scenario A: main is in the code directory

The most frequent scenario we see (which we call Scenario A) amongst economists is that the main file is in the code directory:

README.pdf
data/
...
code/
  main.do
  01_readcps.do
  02_readfred.do
...

In this case, the following generic main file will work: 1

local scenario "A"          // Scenario A: main is in code directory
local pwd : pwd                     // This always captures the current directory

if "`scenario'" == "A" {             // If in Scenario A, we need to change directory first
    cd ..
}
global rootdir : pwd                // Now capture the directory to use as rootdir
display in red "Rootdir has been set to: $rootdir"
cd "`pwd'"                            // Return to where we were before and never again use cd

// Now run the rest of the code
do "$rootdir/code/01_readcps.do"
do "$rootdir/code/02_readfred.do"
do "$rootdir/code/03_table1-5.do"
do "$rootdir/code/04_figures1-4.do"

Scenario B: main is in the top-level directory

More common in other computational sciences, but also present amongst economists, is that the main file is in the top-level directory:

README.pdf
main.do
data/
...
code/
  01_readcps.do
  02_readfred.do
...

In this case, the following generic main file will work:

local scenario "B"          // Scenario B: main is in project top-level directory
local pwd : pwd                     // This always captures the current directory

if "`scenario'" == "A" {             // If in Scenario A, we need to change directory first
    cd ..
}
global rootdir : pwd                // Now capture the directory to use as rootdir
display in red "Rootdir has been set to: $rootdir"
cd "`pwd'"                            // Return to where we were before and never again use cd

// Now run the rest of the code
do "$rootdir/code/01_readcps.do"
do "$rootdir/code/02_readfred.do"
do "$rootdir/code/03_table1-5.do"
do "$rootdir/code/04_figures1-4.do"

Important

In neither scenario did we hard-code the path to our project directory /my/computer/users/me/project. This is not an omission, and it is important, because it allows the code to be run on any computer, without modification.

  1. See the LDI Replication Lab’s setup here

Published: