Preparing your code for computational verification
Published:
The steps in this document are being used in a pilot project.
This document describes how to prepare your code for verification, taking into account some of the most frequent issues that the Data Editor and his team have encountered in submitted replication packages.
Overview
We will describe a few checks and edits you should make to your code, in order to ensure maximum reproducibility. We then describe how to test for reproducibility before submitting to the Data Editor. All steps have been tested by undergraduate replicators, and should be easy to implement. Whether they take a lot of time depends on your specific code, but generally, these adjustments can be made by somebody with good knowledge of the code base very quickly.
Much more extensive guidance on the issues addressed here is available at https://larsvilhuber.github.io/self-checking-reproducibility/. We reference specific chapters there at each of the steps.
Checklist
Print off (as PDF or on paper) the following checklist, and tick off each item as you complete it. Provide the completed checklist as part of the replication package.
- Main file: A single main file is provided that runs all code.
- Path names: All paths in code use
/
(forward slashes) relative to a single top-level project directory ($rootdir
,$basedir
, etc.) - Dependencies: All packages/libraries/dependencies are installed via code once.
- For Stata, these packages are installed into a subdirectory in the project (
$rootdir/ado
,$basedir/adofiles
, etc.), and used by the code. - For R,
renv
is used (exceptions made for other package management systems if such a system is explained). - For Python, environments are used (native
venv
orconda
), and the necessary top-level requirements specified (no OS-specific dependencies are included).
- For Stata, these packages are installed into a subdirectory in the project (
- Displays: All figures and tables are written out to clearly identified external files, and the authors’ versions, as used in the manuscript, are provided.
- Testing in containers: After all changes were made, the code was run using an appropriate authorized container, and the generated log files are provided.
Detailed instructions
Preliminary: Directory structure of a replication package
A generic replication package, housed at /my/computer/users/me/project
, might have the following structure:
README.pdf
data/
raw/
cps0001.dat
analysis/
combined_data.dta
combined_data.csv
combined_data_codebook.pdf
code/
01_readcps.do
02_readfred.do
03_table1-5.do
04_figures1-4.do
results/
table1.tex
table2.xlsx
...
figure1.png
figure2.pdf
where
data/raw
has the externally acquired raw data files (not modified by the authors)data/analysis
has the processed data files, generated by the code in this repository.code
has the code files.results
has all the results files.
For illustration purposes, we have used Stata .do
files, and outputs in a variety of formats, but the same principles apply to other software, and to any output formats.
Note that we did not specify where the
main.do
file will be!
Step 1: Main file
You may or may not have a main file. The following should be adapted to your circumstances. You do not need to create a file that is called
main.do
if you already have one, but you may need to update your existing main file.
Reference: https://larsvilhuber.github.io/self-checking-reproducibility/02-hands_off_running.html
Creating a single main file is straightforward. However, you will want to make some minor edits depending on where, in the above template setup, the file is located:
Scenario A: main
is in the code
directory
The most frequent scenario we see (which we call Scenario A) amongst economists is that the main file is in the code
directory:
README.pdf
data/
...
code/
main.do
01_readcps.do
02_readfred.do
...
In this case, the following generic main file will work: 1
local scenario "A" // Scenario A: main is in code directory
local pwd : pwd // This always captures the current directory
if "`scenario'" == "A" { // If in Scenario A, we need to change directory first
cd ..
}
global rootdir : pwd // Now capture the directory to use as rootdir
display in red "Rootdir has been set to: $rootdir"
cd "`pwd'" // Return to where we were before and never again use cd
// Now run the rest of the code
do "$rootdir/code/01_readcps.do"
do "$rootdir/code/02_readfred.do"
do "$rootdir/code/03_table1-5.do"
do "$rootdir/code/04_figures1-4.do"
Scenario B: main
is in the top-level directory
More common in other computational sciences, but also present amongst economists, is that the main file is in the top-level directory:
README.pdf
main.do
data/
...
code/
01_readcps.do
02_readfred.do
...
In this case, the following generic main file will work:
local scenario "B" // Scenario B: main is in project top-level directory
local pwd : pwd // This always captures the current directory
if "`scenario'" == "A" { // If in Scenario A, we need to change directory first
cd ..
}
global rootdir : pwd // Now capture the directory to use as rootdir
display in red "Rootdir has been set to: $rootdir"
cd "`pwd'" // Return to where we were before and never again use cd
// Now run the rest of the code
do "$rootdir/code/01_readcps.do"
do "$rootdir/code/02_readfred.do"
do "$rootdir/code/03_table1-5.do"
do "$rootdir/code/04_figures1-4.do"
Important
In neither scenario did we hard-code the path to our project directory
/my/computer/users/me/project
. This is not an omission, and it is important, because it allows the code to be run on any computer, without modification.