Preparing your files for verification - details - Guidance by the AEA Data Editor

Describing the contents of your replication package with a README¶

Every replication package requires a document outlining where the data comes from, what data is provided, what requirements are needed to run the code in the replication package, how to run the code, what results to expect, and where to find the results. This is conventionally called the “README”.

The AEA requires that the README contain a number of information elements. A convenient way to ensure that these elements are present is to use the template README for social science replication packages v1.1; however, you are free to provide this information in a format of your choice as well.

The following information is required in the README (unless a modifier indicates otherwise):

Data Availability and Provenance Statements
- Statement about Rights
- License for Data (optional, but recommended)
- Details on each Data Source
Dataset list
Computational requirements
- Software Requirements
- Controlled Randomness (as necessary)
- Memory, Runtime, and Storage Requirements
Description of programs/code
- License for Code (Optional, but recommended)
Instructions to Replicators
- Details (as necessary)
List of tables and programs
References (Optional, but recommended)

Some more information is provided below, a full discussion is available at the template README for social science replication packages.

The README should be in a format that is easily readable online, such as a PDF or a TXT file. You can provide a Word or LaTeX source for the README as well, but that is not required.

Provide the README as part of your replication package, ideally in the root directory.

Data Citations¶

All manuscripts will be checked for data citations. If you have not done so, now is the time to add them to your manuscript.

Code¶

It should be obvious that your replication package should contain all code used to generate the results in your manuscript.

Data structure of a replication package¶

The AEA uses the openICPSR platform for replication packages. The platform allows users to download complete “deposits”, or only subdirectories thereof. However, deposits of replication packages at other trusted repositories are also accepted, as long as they satisfy the requirements described here.

Users must not upload ZIP packages as files - rather, ZIP files can be used to structure code and data, but should be unzipped on the platform (“import from ZIP”). The exception is when there are more than 1,000 files in the repository (see below).

The code and data should run as downloaded from the repository, without further manual modifications (creating empty subdirectories programmatically is acceptable). Because code tends to be small, but data can be large, we strongly advise to not commingle data and code - interested researchers can download the code directory by itself if they wish, without also downloading a potentially very large data directory.

README.pdf
data/
   raw/
      cps0001.dat
   analysis/
      combined_data.dta
      combined_data.csv
      combined_data_codebook.pdf
code/
  01_create/
      01_readcps.R
      02_readfred.R
  02_analysis/
      01_table1-5.R
      02_figures1-4.R
results/
  table1.tex
  table2.tex
  ...
  figure1.pdf
  figure2.pdf

ICPSR cannot accept deposits with more than 1,000 files. Therefore, we relax the rule that all data and code must be unzipped, though we still insist on the “smallest possible configuration”.

In most cases, it is a particular directory that is the primary culprit. Say you have

Structure-pre:

/code: 20 files
/data/
   src1/: 25 files
   src2/: 101 files
   src3/: 3,000 files

then the ideal structure, taking into account the 1,000 file limit, would be:

Structure-post:

/code: 20 files
/data/
   src1/: 25 files
   src2/: 101 files
   src3.zip: = 1 file, containing 3,000 files

(src3/ and its 3,000 files have been removed!)

Your README should provide instructions to the replicators how to recover the fully unzipped structure (there are cross-platform differences in unzipping, so be precise about the final structure, rather than the method of getting there).

Alternatively, the code can handle the unzipping - optional, but more robust.

Once you’ve adjusted that, zip up the whole structure (so a ZIP file that has inside it another zip file, plus the /code, /data/src1/, and /data/src2/ directories), and “Import from ZIP” when uploading to ICPSR.

See also a similar entry to our FAQ.

When the replication package relies on confidential data that cannot be shared, or is shared under different conditions, authors will have to

prepare a confidential (partial) replication package
- this would contain the contents of data/confidential and possibly data/conf_analysis from the example below.
preserve (archive) the confidential replication package
- If the data cannot be removed from a secure enclave, they should nevertheless be archived wherever the confidential data are kept (see this FAQ)
- If the data can be shared, but are subject to access restrictions, follow this guide on creating a separate data deposit and, when creating the restricted deposit at ICPSR, follow these instructions on how to do so.
prepare a non-confidential replication package that contains all code, and any data that is not subject to publication controls
- this would contain the contents of data/raw, data/analysis, code/, and for reference, results/ from the example below.
ensure that replicators have detailed instructions on how to combine the two packages
specify which (if any) of the results in their paper can be reproduced without the confidential data.

Clearly separate the restricted from the open-access data, both in terms of the raw data as well as the processed data:

data/
   raw/
      cps0001.dat
   confidential/
      ssa.csv
   conf_analysis/
      confidential_combined.dta

Keep in mind that you may be able to provide a subset of your replication package privately to the AEA Data Editor, see the Sharing restricted-access data with the AEA Data Editor page, or that you might be able to create a separate data deposit with a more limited license.

Authors might want to investigate the possibility of providing “fake” or “simulated” data that might allow replicators to run code, without necessarily obtaining meaningful results (functionality test).

Sharing restricted-access data with the AEA Data Editor
Licensing guidance.
The Social Science Editors’ FAQ describes a related issue.

Choosing a license¶

By default, the openICPSR deposit attributes a Creative Commons Attribution 4.0 International Public License to your deposit, but you can choose a different license. If you do, you must add a file called “LICENSE.txt” (by convention capitalized) to the deposit.

See our Licensing guidance for more details.

Preparing to upload¶

Once you are done preparing your replication package, you should upload it:

if you have received a conditional acceptance, your replication package must be in a trusted repository. The default trusted repository is the AEA Data and Code Repository. Other trusted repositories may be acceptable (see list), but replication packages should meet the display guidelines.
if you have confidential data that you want to transmit to the AEA Data Editor but do not want published, communicate with the AEA Data Editor directly (see this FAQ and Sharing restricted-access data with the AEA Data Editor).
if you have received instructions during the revise-and-resubmit process to have a reproducibility check conducted, you may use the AEA Data and Code Repository, but other methods are also acceptable. Do not forget, however, that once the paper is accepted, it must be made available on a trusted repository - other methods are then no longer acceptable.

Considering the replicator¶

The replicator of your package is likely to be less qualified than you are. After all, you are publishing something novel.

For less frequently used software, provide a URL where the software can be obtained.

essentially, if not listed in the figure above, provide information on how to obtain software
if using commercial compilers, we also suggest to compile your code using open-source or free compilers (including any free performance packages, such as Intel MKL), even if the resulting code is not the most efficient.
as of 2021, the AEA Data Editor has access to the software on this list, and any open-source (free) software that can be installed on Windows, Linux, and macOS.

Re-run your replication package¶

Ideally, once you have prepared your replication package, you should re-run the code again, in a clean environment, possibly a fresh computer, to ensure that (a) the package is, in fact, reproducible with minimal interaction (b) the results are numerically identical.

Wherever possible, we strongly encourage running in batch (non-interactive) mode.

Final checklist¶

Before proceeding, do check:

you have prepared a README that provides all the relevant information, as per the README template
your manuscript includes data citations
your data and code deposit contains all code, including code to read in raw data, even when the data cannot be provided.
you have chosen a license (if relevant)
your replication package has been re-executed, and reproduces the tables and figures in your manuscript faithfully.

Next step¶

If you are ready, you can proceed to upload to the AEA Data and Code Repository (or your chosen alternate trusted repository).

References¶

Lars Vilhuber, Connolly, M., Koren, M., Llull, J., & Morrow, P. (2022). A template README for social science replication packages. 10.5281/ZENODO.7293838
Daniels, B., Das, J., Do, Q.-T., & Siwakoti, S. (2023). Template instructions which can be included in README files that describe re-using DHS data. 10.5281/ZENODO.10983009