Preparing your code for computational verification

Published:

The steps in this document are being used in a pilot project.

This document describes how to prepare your code for verification, taking into account some of the most frequent issues that the Data Editor and his team have encountered in submitted replication packages.

⚠️❗ IMPORTANT: At this point, you should only be seeing this page if you were asked by the Data Editor team to do so, and if your replication package relies on a single software. Admissible containers are listed in the Step 5 section: authorized containers. We are not currently attempting to generalize this to multi-software replication packages, though it is possible to do so.

Overview

We will describe a few checks and edits you should make to your code, in order to ensure maximum reproducibility. We then describe how to test for reproducibility before submitting to the Data Editor. All steps have been tested by undergraduate replicators, and should be easy to implement. Whether they take a lot of time depends on your specific code, but generally, these adjustments can be made by somebody with good knowledge of the code base very quickly.

Much more extensive guidance on the issues addressed here is available at https://larsvilhuber.github.io/self-checking-reproducibility/. We reference specific chapters there at each of the steps.

⚠️❗ IMPORTANT: All but the last steps can be done by anybody, no special system requirements required. However, if you are in an institution that does not allow you to install container software (Docker, OrbStack, etc.), and does not have such technology installed on a Linux cluster you have access to, then please do all the other steps, and the AEA Data Editor team will take care of the last step.

Checklist

Print off (as PDF or on paper) the following checklist, and tick off each item as you complete it. Provide the completed checklist as part of the replication package.

  • Main file: A single main file is provided that runs all code.
  • Path names: All paths in code use / (forward slashes) relative to a single top-level project directory ($rootdir, $basedir, etc.)
  • Dependencies: All packages/libraries/dependencies are installed via code once.
    • For Stata, these packages are installed into a subdirectory in the project ($rootdir/ado, $basedir/adofiles, etc.), and used by the code.
    • For R, renv is used (exceptions made for other package management systems if such a system is explained).
    • For Python, environments are used (native venv or conda), and the necessary top-level requirements specified (no OS-specific dependencies are included).
  • Displays: All figures and tables are written out to clearly identified external files, and the authors’ versions, as used in the manuscript, are provided.
  • Testing in containers: After all changes were made, the code was run using an appropriate authorized container, and the generated log files are provided.

Detailed instructions

Preliminary: Directory structure of a replication package

A generic replication package, housed at /my/computer/users/me/project, might have the following structure:

README.pdf
data/
   raw/
      cps0001.dat
   analysis/
      combined_data.dta
      combined_data.csv
      combined_data_codebook.pdf
code/
  01_readcps.do
  02_readfred.do
  03_table1-5.do
  04_figures1-4.do
results/
  table1.tex
  table2.xlsx
  ...
  figure1.png
  figure2.pdf

where

  • data/raw has the externally acquired raw data files (not modified by the authors)
  • data/analysis has the processed data files, generated by the code in this repository.
  • code has the code files.
  • results has all the results files.

For illustration purposes, we have used Stata .do files, and outputs in a variety of formats, but the same principles apply to other software, and to any output formats.

Note that we did not specify where the main.do file will be!

Step 1: Main file

You may or may not have a main file. The following should be adapted to your circumstances. You do not need to create a file that is called main.do if you already have one, but you may need to update your existing main file.

Reference: https://larsvilhuber.github.io/self-checking-reproducibility/02-hands_off_running.html

Creating a single main file is straightforward. However, you will want to make some minor edits depending on where, in the above template setup, the file is located:

Scenario A: main is in the code directory

The most frequent scenario we see (which we call Scenario A) amongst economists is that the main file is in the code directory:

README.pdf
data/
...
code/
  main.do
  01_readcps.do
  02_readfred.do
...

In this case, the following generic main file will work: 1

local scenario "A"          // Scenario A: main is in code directory
local pwd : pwd                     // This always captures the current directory

if "`scenario'" == "A" {             // If in Scenario A, we need to change directory first
    cd ..
}
global rootdir : pwd                // Now capture the directory to use as rootdir
display in red "Rootdir has been set to: $rootdir"
cd "`pwd'"                            // Return to where we were before and never again use cd

// Now run the rest of the code
do "$rootdir/code/01_readcps.do"
do "$rootdir/code/02_readfred.do"
do "$rootdir/code/03_table1-5.do"
do "$rootdir/code/04_figures1-4.do"

Scenario B: main is in the top-level directory

More common in other computational sciences, but also present amongst economists, is that the main file is in the top-level directory:

README.pdf
main.do
data/
...
code/
  01_readcps.do
  02_readfred.do
...

In this case, the following generic main file will work:

local scenario "B"          // Scenario B: main is in project top-level directory
local pwd : pwd                     // This always captures the current directory

if "`scenario'" == "A" {             // If in Scenario A, we need to change directory first
    cd ..
}
global rootdir : pwd                // Now capture the directory to use as rootdir
display in red "Rootdir has been set to: $rootdir"
cd "`pwd'"                            // Return to where we were before and never again use cd

// Now run the rest of the code
do "$rootdir/code/01_readcps.do"
do "$rootdir/code/02_readfred.do"
do "$rootdir/code/03_table1-5.do"
do "$rootdir/code/04_figures1-4.do"

Important

In neither scenario did we hard-code the path to our project directory /my/computer/users/me/project. This is not an omission, and it is important, because it allows the code to be run on any computer, without modification.

Step 2: Path names

Windows computers use \ (backslashes) in path names, while Mac and Linux computers use / (forward slashes).

Two facts:

  • Every statistical programming language can use generic path names using / (forward slashes). This ensures wide reproducibility.
  • The use of \ (backslashes) in path names breaks code on Mac and Linux computers.

About 40% of replication packages in economics appear to be submitted by researchers using computers running MacOS or Linux. With a bit of simplified math, if we believe that is representative of what future replicators will do, that means that 40% of users will not be able to run 60% of replication packages without some potentially widespread edits, because of those backslashes.

You should thus replace all path names in your code to use / (forward slashes), or appropriate functions. This is straightforward:

Stata

// Instead of
use "data\analysis\combined_data.dta", clear
// Use
use "data/analysis/combined_data.dta", clear
// or better
use "$rootdir/data/analysis/combined_data.dta", clear

R

# Instead of
data <- read.csv("data\\analysis\\combined_data.csv")
# Use
data <- read.csv("data/analysis/combined_data.csv")
# or better
data <- read.csv(file.path(rootdir, "data", "analysis", "combined_data.csv"))

and similarly for other languages.

Implementing

In many cases, you can just globally replace all \ with / in your code files. Caution however is warranted if your code explicitly writes out $LaTeX$ code, which also (legitimately) uses \. In that case, you will need to be more careful.

Expert tip

If using a (Bash or Zsh) terminal, you likely have the sed command available. You can use it to replace all backslashes with forward slashes in all .do files in the code directory as follows:

sed -i 's+\\+/+g' code/*.do

Step 3: Dependencies

Stata packages

Stata users frequently use user-written packages, which are made available to the Stata community via the Stata Journal, SSC, or Github. They are typically installed using a small number of variants of the net install command (including ssc install).

Replicators need to have the same versions of these packages installed. Stata does not (currently) provide a way to install older versions of packages, and a regular occurrence of reproducibility failure is due to changes in packages over time. We have some simple solutions to this problem.

First, use an environment to permanently install-project specific packages once and for all.

Define the environment in your main file, after setting $rootdir:

Reference: https://larsvilhuber.github.io/self-checking-reproducibility/12-environments-in-stata.html and https://github.com/AEADataEditor/replication-template/blob/master/template-config.do#L129

/* install any packages locally */
di "=== Redirecting where Stata searches for ado files ==="
capture mkdir "$rootdir/ado"
adopath - PERSONAL
adopath - OLDPLACE
adopath - SITE
sysdir set PLUS     "$rootdir/ado/plus"
sysdir set PERSONAL "$rootdir/ado"       // may be needed for some packages
sysdir

From this point on, all installed packages will be installed into $rootdir/ado, and Stata will look there first when loading packages.

Install packages once if not present, but don’t reinstall if already present.

Reference: https://github.com/AEADataEditor/replication-template/blob/master/template-config.do#L174

*** Add required packages from SSC to this list ***
local ssc_packages ""
    // Example:
    // local ssc_packages "estout boottest"
    // 
    display in red "============ Installing packages/commands from SSC ============="
    display in red "== Packages: `ssc_packages'"
    if !missing("`ssc_packages'") {
        foreach pkg in `ssc_packages' {
            capture which `pkg'
            if _rc == 111 {                 
               dis "Installing `pkg'"
                ssc install `pkg'
            }
            which `pkg'
        }
    }

Some special cases (usually not necessary)

For some packages, the package name is not the same thing as the command name. Example: moremata. For these packages, the above code does not work. Use this code:2

Reference: https://github.com/AEADataEditor/replication-template/blob/master/template-config.do#L187

    // If you have packages that need to be unconditionally installed (the name of the package differs from the included commands), then list them here.
    // examples are moremata, egennmore, blindschemes, etc.
local ssc_unconditional ""
/* add unconditionally installed packages */
    display in red "=============== Unconditionally installed packages from SSC ==============="
    display in red "== Packages: `ssc_unconditional'"
    if !missing("`ssc_unconditional'") {
        foreach pkg in `ssc_unconditional' {
            dis "Installing `pkg'"
            cap ssc install `pkg'
        }
    }

Packages that are not on SSC may need to be net installed from other sources, including Github and personal websites. Again, this does not neatly work with a specific command check, and thus you may need to unconditionally install them. Use this code:

    // If you have packages that need to be unconditionally installed from other sources (not SSC), then list them here.
    // Example: grc1leg
  net install grc1leg, from("http://www.stata.com/users/vwiggins/")
    // Example when net install is not an option 
  cap mkdir "$rootdir/ado/plus/e"
  cap copy http://www.sacarny.com/wp-content/uploads/2015/08/ebayes.ado "$rootdir/ado/plus/e/ebayes.ado"

Adding to replication package

The following files should be included in your replication package:

code/ado/*

R packages

For R packages, we suggest that users use renv, and do not set a specific CRAN mirror. We refer users to the renv documentation for details, but in a nutshell, for an existing R project that is not using renv, the following commands should be run in the R console:

install.packages("renv")  # only once
renv::init()               # only once per project
renv::snapshot()           # only once per project, after all packages are installed. You should choose to install all packages detected, then snapshotting.
renv::status()             # to check status

This will create a file renv.lock in the top-level directory of your project.

Adding to replication package

The following files should be included in your replication package:

.Rprofile
renv.lock
renv/activate.R
renv/settings.json

Do not include the entire renv directory, in particular not the renv/library subdirectory, as it is platform-specific (of no use to other platforms), and can be very large.

Step 4: Displays

Displays (figures and tables) should be written out to external files, and the authors’ versions, as used in the manuscript, should be provided. In the prototypical replication package structure above, these files would be in the results directory.

Reference: https://larsvilhuber.github.io/self-checking-reproducibility/03-automatically_saving_figures.html and https://github.com/labordynamicsinstitute/replicability-training/wiki/How-to-output-tables-and-figures

Figures

  • All figures can be written out to files. Journals like pdf and eps files, but png are convenient. You can output multiple formats.
  • Whenever you have displayed a figure, also exportit to a file. It’s a simple command.

Stata

// Example for PNG
graph export "$rootdir/results/figure1.png", replace width(1200) height(800) 
// Example for PDF
graph export "$rootdir/results/figure1.pdf", replace

R

# Example for PNG if using standard R
png(filename = file.path(rootdir, "results", "figure1.png"), width = 1200, height = 800)
plot(x, y)  # your plotting code here
dev.off() 
# Example if using ggplot2
ggsave(filename = file.path(rootdir, "results", "figure1.png"), plot = myplot, width = 12, height = 8, units = "in", dpi = 100)

More complex figures

For more complex figures, it may be easier to simply write out the data underlying the figure to an Excel sheet, and create the figure there. See https://github.com/labordynamicsinstitute/replicability-training/wiki/How-to-output-tables-and-figures#arbitrary-data-to-excel on how to write out the underlying data. You would then include the Excel file that maps the data into a figure with your replication package.

Tables

Tables may be more complex. Simple tables can be written out using various tools:

Stata

esttab or outreg2, also putexcel. For fancier stuff, treat tables as data, use regsave or export excel to manipulate.

R

xtable, stargazer, others.

More complex tables

For more complex tables, it may be easier to simply write out entire matrices, or individual numbers, to an Excel sheet, and compose the table there. See https://github.com/labordynamicsinstitute/replicability-training/wiki/How-to-output-tables-and-figures#examples for an example, especially if you have already been compiling your tables in Excel. You would then include the Excel file that maps the data into your preferred table layout with your replication package.

Step 5: Testing in containers

After you have made all the above changes, you should test your code in an appropriate authorized container.

⚠️❗ IMPORTANT: If you do not have Docker installed on your computer, do not have the rights to install Docker on your computer, or do not have access otherwise to Docker, please do not attempt this, and skip this step. The AEA Data Editor team will take care of the last step.

Reference: https://larsvilhuber.github.io/self-checking-reproducibility/80-docker.html

Authorized containers

The following list of containers are authorized for testing, as they are reliably available, and achieve the desired transparency.

  • Stata containers for versions 19now back to 11, provided by the Social Science Data Editors at https://hub.docker.com/u/dataeditors, such as dataeditors/stata18_5-mp:2025-02-26. (requires a license)
  • R containers provided by the Rocker Project, such as rocker/r-ver:4.3.1 or rocker/tidyverse:4.3.1 (which includes the tidyverse packages).
  • MATLAB + Dynare containers provided by the Dynare Project at https://hub.docker.com/r/dynare/dynare, such as dynare/dynare:5.3-2024-05-21. See the project page for the mapping betweeen containers and MATLAB versions. (requires a MATLAB license)

If you know of a different container that we should add to this list, please let us know. The AEA Data Editor’s Github profile has a few other containers that have worked, but may be too advanced for the typical user.

⚠️❗ IMPORTANT: Do not simply provide us with a custom container not on the list above. Transparency requires that the container be built, using a Dockerfile or apptainer.def file, from publicly available sources. While we will happily use your container, it must be built from one of the above sources, or well-known “standard” sources, such as “Docker Official Images” in the Dockerhub library space (e.g., https://hub.docker.com/_/python).

Steps

  • Install the software necessary for running containers.
  • All example commands below are from a Bash or Zsh terminal, which are standard on Mac and Linux, as well as on Windows if using WSL. If you do not have WSL on Windows and are using the Powershell, the same principles apply, but the syntax may be different.

When code has been adjusted as in Steps 1-4, no complex adjustment of containers is necessary.

  • Run the container, mounting your project directory into the container. For example, if your project is in /my/computer/users/me/project, you would use a command such as this (example for Stata):

Preliminaries

(may need some adjustment, depending on your license)

VERSION=18_5
TAG=2025-02-26
MYHUBID=dataeditors
MYIMG=stata${VERSION}
CONTAINER=$MYHUBID/${MYIMG}-${TYPE}:${TAG} 
TYPE=mp
STATALIC=/path/to/your/stata/stata.lic

Explanations:

  • VERSION: This is the Stata version. StataNow is referenced with a _5 suffix, otherwise, this corresponds to your (major) Stata version number.
  • TAG: This is the date the container was built, in YYYY-MM-DD format. Recent Stata containers do not (on purpose) have a latest tag, but older ones (that are no longer maintained) do, and can replace the date with latest.
  • CONTAINER: is the fully qualified name of the container to be used. It is built from various components. For Stata images, these are maintained by dataeditors on Dockerhub. All available Stata containers and tags can be viewed on https://hub.docker.com/u/dataeditors. The precise way to call the container may depend on the version. For instance, for versions prior to 18, the -${TYPE} suffix is not used.
  • STATALIC: Is the path (in the notation used by the terminal you are using) to your Stata license file stata.lic. You need to have a valid Stata license file for the version of Stata you are using.

If you have only an older license, or a non-MP license, you may need to replace VERSION, TAG, and TYPE accordingly. For instance, if you have a Stata 16 SE license, you would set VERSION=16, TAG=2023-06-13, and TYPE=se, and remove -${TYPE} from the CONTAINER definition.

Test the container

docker run -it --rm \
  --volume ${STATALIC}:/usr/local/stata/stata.lic \
  --entrypoint stata-${TYPE} \
  ${CONTAINER}

You should see the usual Stata prompt. Type exit to leave Stata.

Run the container

docker run -it --rm \
  --volume ${STATALIC}:/usr/local/stata/stata.lic \
  --volume $(pwd):/project \
  --workdir /project \
  --entrypoint stata-${TYPE} \
  ${CONTAINER} -b main.do

if using a Scenario B setup. If using a Scenario A setup, use

docker run -it --rm \
  --volume ${STATALIC}:/usr/local/stata/stata.lic \
  --volume $(pwd):/project \
  --workdir /project/code \
  --entrypoint stata-${TYPE} \
  ${CONTAINER} -b main.do

Success

If your code runs without error, and produces all expected output files, you are done! You can now submit your replication package to the Data Editor, along with the completed checklist from above, and the generated main.log (which should be in the same directory as main.do) as evidence.

If your code does run into problems, the generated main.log should have clues as to what went wrong. You should be able to fix these issues, and re-run the code in the container, until it runs without error.

Problems?

If you run into problems in Step 5, no worries, simply submit all the files as modified in Steps 1-4, along with the completed checklist, and we will handle the remaining issues.

  1. See the LDI Replication Lab’s setup here

  2. A more customized setup might check for a package-specific file in the ado directory, such as the <package>.pkg, but this is more complex and may not always work. 

Published: