What to do when PII is detected in a replication package?
Published:
I had previously described a very similar scenario in a blog post about the PSID, but I thought it would be useful to write a more general post about what to do when personally identifiable information (PII) is detected in a replication package. This reprises a lot of the same tasks as the earlier post.
The AEA Data and Code Availability Policy asks that authors provide access to all data used in a published paper. But it also acknowledges that there are cases where data cannot be shared. This may be due to privacy concerns, proprietary rights to the data that are not the author’s, or terms of use that prohibit the sharing, redistribution, or publication of data.
In principle, Personally Identifiable Information (PII) should never be included in replication packages. There are multiple checks for this. Since the start of creating reports, we would (manually, or semi-automatically) run PII checks (based on J-PAL’s PII_stata_scan.do, and the replicator would add notes on that. More recent reports (since 2024) persistently include text alerting authors to check their data, regardless of the outcome of the check. When depositing data, authors are required to assert that individuals cannot be identified.
But… it happens. Authors believe that their team has removed all PII, and yet some leaks in. Sometimes, it’s because they forgot to remove the verbatim answer to “who takes care of your child when you are at work”. Sometimes, it’s the IP address of the respondent.
So what happens when an author inadvertently does share such data, and now needs to remove it? As a reminder, the author is committed (by signing an agreement) to not withdrawing the replication package. How to proceed? Read on!
In a nutshell
The gist of the solution is the following, for AEA publications.
- The author requests that the deposit (in the AEA case, ICPSR) unpublish the original replication package (call it V1) (also called “de-accessioning”), i.e., remove the ability to download the package. The journal (data editor, me) should be notified at the same time.
- Remove the infringing data from V1 of the replication package.
- [Ideally] Create a new, separate deposit that is compliant with IRB restrictions on PII. I expand on this below.
- Document the new location of the now absent data in the README of the replication package, and how somebody can access it.
- Revise your V1 replication package to take into account changed location, or coding of variables, as per our Policy on Revisions of Data and Code Deposits in the AEA Data and Code Repository.
- Submit the revised replication package to the journal.
- The journal publishes V2 of the replication package, which is now compliant.
Important notes:
- At no time should any code be removed from the package. If at all, minimal changes should be made.
- It is never acceptable to simply remove the old package, without handling replacement.
Preliminary steps
The first steps involve getting access to your own replication package. This may differ by when the replication package was published.1
- For authors with papers published before 2019 (approximately)
- All authors will need to get an openICPSR login, which will serve them both for the changes to the AEA and for creating the new PSID deposit.
- They will need to request access to the deposit. The openICPSR helpdesk may be able to help with that. The authors should mention the URL or DOI, and the openICPSR login they want the deposit associated with.
-
For authors with more recent papers:
- They will need to re-create an ICPSR account with the same email as previously used since ICPSR changed their login system in Feb 2024. The openICPSR helpdesk may help with that.
- They need to then create a new version.
]
Rules that apply to revisions of replication packages
Here are a few simple rules:
- The modified package should differ only as necessary from the original package. Less work for the author, greater clarity for replicators, everybody wins.
- Journal-specific policies may apply. For instance, the AEA has a Policy on Revisions of Data and Code Deposits in the AEA Data and Code Repository. Among other things, it requires that all changes be clearly identified.
Fixing the problem: where to put the data?
Once the preliminaries are done, here’s what authors need to do to remain compliant with AEA policy (similar rules may apply at other journals):
- Consider how to preserve the PII-laden data.
Ideally, you already have a copy of the data, to be compliant with AEA policy (5 years post-publication). There are options for preserving the data at your institution, or archives like ICPSR, but they vary over time. Talk to your data librarian, or to data custodians at ICPSR. Talk to the Data Editor.
- The AEA prefers strongly that such deposits NOT require the authors’ consent to access.
Rather, access should be granted based on clear terms of use (“provide IRB approval”), vetted by individuals not connected to the author, and likely to be present long after the author has retired. Institutions, like your own university or ICPSR, are ideal.
Fixing the problem: what data to include?
If the variables that contain PII are not essential to the replication of the paper, then they can simply be removed from the replication package. This must still be documented!
If however, they are important to the analysis (left-hand side or right-hand side), you will need to be more creative. You may document that certain analyses are no longer feasible without the PII-laden, but should ensure that the code continues to run for all other analyses (for instance, because the PII-laden variables only affect one out of a gazillion robustness checks).
You may also be able to recode the PII, replacing them with innocuous but still unique identifiers. Finally, you may need to replace variables with fake or synthetic data, which will ensure the code runs, but may not produce the right results. Again, very important that this is documented in the README!
Back in the AEA deposit
- Log on to the V1 deposit. You should be in “Modifying” mode.
- Remove (delete) the PII data (
my_survey_1.dta) - Update the README with “PII-laden data for this package can be obtained from ICPSR upon provision of an IRB approval.” in an appropriate location. The ideal README follows the Template README published by the Social Science Data Editors, but make only minimal changes to the README in this deposit.
- Delete the old README (completely),
- Upload the new README (in PDF format) to the same location the original one was
- Create a
CHANGES.txt. Identify changes made. E.g. “V2: Removed PII-laden data, adjusted code.”. It should look something like this:
V1: Original deposit
V2: Removed PII-laden data, adjusted code.
- Upload the
CHANGES.txt. - (If possible) Link to the DOI of the PII-laden deposit in the “Related publications” section of the deposit (use the
Import from DOIfunctionality, chooseis supplemented byfor theRelationshipfield) - You may have to add the “manuscript number” to the metadata, see this link. Use the last part of the DOI of the paper, e.g. if the paper DOI is
https://doi.org/10.1257/aer.p20161120then the “manuscript number” should beaer.p20161120. - Re-Submit to the AEA

The end state
After all the fixes,
- V1 is still visible, but can no longer be downloaded.

- V2 is published, contains no PII data, (possibly) links to the deposit of the restricted data, and remains otherwise available to replicators. You are still in compliance with the AEA publication agreements and your obligations as an AEA author!
Final comments
Authors should feel free to reach out to their journal’s data and reproducibility editor with questions about the available options.
-
The dividing line is approximately Oct 1, 2019. In 2019, we migrated approximately 2,500 older replication packages into openICPSR. Authors were not involved in this process, and so do not have operational access to the replication package. They do, however, have ownership of the deposit, and can request access to it. ↩