NDAs, DUAs in the verification process

10 minute read

Published:

Toni Whited (EIC of Journal of Financial Economics) had the following question on the ex-bird site:

Econ and finance journals are getting their acts together wrt reproducibility. A bigger issue now is data acquired via NDAs. I wish it were normal for NDAs to have a clause letting third parties replicate results on an air-gapped machine.

Don’t know how to make it happen.

A few thoughts.

NDAs with clauses

I have started to see researchers incorporate such clauses into NDAs. Very rare, but possible. More generally, certain NDAs allow to add “third parties” to an existing contract - no special clause necessary, but researchers must be aware of the possibility.

NDAs without clauses

In some cases, no additional clause is needed. Read the NDA carefully: can you assign a new member of the research team (e.g., a new RA at your co-author’s university) to the contract, with minor hassles (simple notification)? Yes? Then you can probably also assign a journal verifier to the contract! Verifiers can be the “RA” on the contract, and can obtain the data (compliant with any other access restrictions). In many cases, authors have then been able to provide us with a non-publishable extract of the same data, for the duration of the verification process. We can sign a NDA with the authors, if needed, but most are fine with an email promise to delete the data. Our process is outlined here.

Journal verifiers with access

Another thing is that in some cases, as long as both the authors and the verifiers are (separately) subscribers, data may be able to be shared. But caution: not every “subscription” is the same. For instance, when researchers use WRDS data, they cannot directly share the data. In principle, I should be able to access the same data sources - as Cornell is also a subscriber. But WRDS subscriptions have a huge menu of sub-options. The researcher may be able to access WRDS via the Python API, but as it turns out, my subscription does not extend to that!1

Journal verifiers acquiring access

In some cases, there is a generalized access mechanism. Thus, while the researcher cannot share the data with us, because the DUA or NDA does not allow for it, it is relatively straightforward for the journal verifier to obtain access. For instance,

all do not allow for redistribution of the data, even of their “publicly accessible” data.

Hint: if there is a login, a checkbox, or something else you need to click on in order to download the data, it is very likely that you cannot simply re-publish (distribute, post) the data (but that is neither a sufficient nor a necessary condition).

But in all three cases, the journal verifier can obtain access to the data, and verify the replication package.

Some of these mechanisms have easy ways for researchers to make re-acquiring the data easy:

While a bit of a hassle, this does allow journals to obtain the same data. Note that this is not always easy or possible: for instance, if the original author used the SOEP data in Europe, they will have had a different access than my US-based team will have, because non-European users only get access to a 95% sample.

Also, Cornell has so far not been able to agree to the terms of use required to even sign a data use agreement, something about GDPR etc… So what do we do then?

Using third party verifiers with access

When NDAs or DUAs cannot be quickly adjusted, or no parallel data access exists, one can reach out to other verifiers. The AEA has a policy on that, and we regularly reach out to folks who we know have access. That would be the case for the SOEP above: reach out to a replicator in Europe (for instance, a fellow Europe-based data editor).

More generally, this will be true for data that have complicated and long-winding access procedures, and where simply it is completely out-of-scope (because illegal) for researchers to provide journals with access to the data. This was the case for the French customs data that we used in the Piveteau (2021) replication. The data are available through CASD, the French data access center, and the cascad replicators have access to CASD, quickly, on a case-by-case basis. We can reach out to them for access to these data. Sometimes, agency or data custodian staff themselves can help out - we’ve had pilot projects with the German IAB, with the Bureau of Labor Statistics, and with the World Bank. This will not always work - volunteering time may be fine in academia, but is not always feasible when working with government or supra-national agencies.

But back to Toni’s point: in principle, if we applied that principle to the journal-centric process, one solution is to have external referees, selected because they have access to the data, conduct the verification, in accordance with some basic principles.

Taking the computations to the data, instead of the data to “my” computer

Toni also wanted to “replicate results on an air-gapped machine.” While that specific security protocol almost certainly won’t satisfy many data providers these days (they want encrypted hard drives, secure login processes, etc., and want the whole thing documented), this does also suggest another process: rather than bring the data to “my” computer, take the computation to “their” computer (or some other authorized system). In a few cases, we have been granted (legitimate, documented) computer access to servers that are authorized to access the restricted data, always remaining in compliance with the authors’ data access protocol. We have so far been granted legitimate and documented access to computers at UC San Diego, the NBER, Harvard Business School, and University of Uppsala, to name a few (thank you!). This not only solves the data access problem, in some cases it also solves the computational resource problem - sufficient parallel cores, licensed software, etc. This is different from the third-party verification process, above, because in this case, we (the journal verifiers) access the data, and run the code that the authors provided elsewhere.

The big caveat

Of course, none of the above will work if the original NDA and data access is idiosyncratic. In all of the above processes, a secondary outcome is that we can improve the authors’ description of how somebody (not just the editorial folks) can get access to the data, and restrictions limiting this. In some cases, the only possible documentation is “this will not work for anybody else.” Getting access to firm-specific data is often heavily dependent on interpersonal relationships, and while that might work to simply extending an NDA, it only works if the original relationship continues to exist. So none of these methods will work in all cases. The fallback is then something that Toni’s journal, and a few other econ journals, but not the AEA (!), require: provide some synthetic or simulated data, that at least allows to assess that the code would in principle run. But at least the fact that access is idiosyncratic should be made clear.

Some data from us

So how does this work at the AEA in practice? I pulled some stats today from our tracking system. Over the past 12 months, authors have identified 112 times that they might be able to provide us privately access to the data, through one of the mechanisms above. In 86 cases, after review of conditions, cost, and availability of resources, we pursued those offers in some fashion. That includes figuring out NDAs or privately shared data. It does not cover when we requested data ourselves from third parties, which happened in 22 cases (some of those cases overlap). In 84 cases, one of those methods culminated with us coming to terms with whoever controlled the data, and obtaining access to the data.

Conclusion

Accessing confidential data as part of the verification process is not easy, but it is possible. Ensuring that others can re-use the data is sometimes easier, especially for researchers in the same field, in others harder, when a special clause in the NDA allowed the journal access, but cannot be extended to arbitrary future researchers. And the process must be allowed to fail, because interesting and illuminating research can be conducted with data that cannot be shared. In fact, an update to one of those NDAs negotiations just hit my inbox as I am writing this, and is likely to fail, due to the data provider not being willing to grant an exception or an extension in line with Toni’s original hope and request. That’s life, and is fine - it will not hold up the rest of the verification process that we are conducting, and will still ultimately lead to the publication of the paper.

I see our goal as (data) editors to make these kind of constraints as clear as possible, for all future readers, for a very long time. After all, somebody might still want to correct the record in 100 years, or review what we knew now. It’s not like those articles are going anywhere, even if the data just might disappear! See one of the first “data corrections” I’m aware of, by Irving Fisher, 1911.2

Additional readings

Footnotes!

  1. We ultimately used a third-party replicator with access to the Python API to verify the article. The replication package was found to be perfectly reproducible. See Conlon, Chris, and Eric So. 2021. “Common Ownership and Corporate Control.” American Economic Review, 111 (3): 1012-48. for the article, and https://doi.org/10.3886/E120083V1 for the replication package. 

  2. Fisher, Irving. “‘The Equation of Exchange,’ 1896-1910.” The American Economic Review, vol. 1, no. 2, 1911, pp. 296–305. JSTOR, http://www.jstor.org/stable/1804304. Accessed 1 Nov. 2023.