What data to provide
What do we mean by “data”?
What data should be provided?
One of my data editor colleagues at Social Science Data Editors relayed a question to me recently (“asking for a friend…”): “do extract files need to be included in AEA data replication submissions?”
They noted “The closest I could find was https://www.aeaweb.org/journals/data/data-code-policy#content which has a blanket request for “data” but it might help to distinguish between “extracts” and “analysis” files.
The short answer
- extract (raw) data can be reliably re-acquired today?
- extract (raw) data can be reliably re-acquired in the future?
- analysis data can be reproduced with reasonable resources?
The longer answer
For the raw input data (which may be an extract - that’s actually not the key criterion), we require that the acquisition of the data be reproducible.
You don’t have to provide any data, if
- the results can be reproduced by downloading precisely specified raw data (reasonableness constraint)
- can reproduce analysis files reliably - subject to a reasonableness constraint.
But you might also have to provide all the data.
Reproducibility, now and in the future, is key, not whether the data is part of the package or not.
But: this is not always straightforward. Take IPUMS as a (frequent) example.
- IPUMS curates its entire database (assign a DOI), so the “reliably in the future” part is OK ✅
- But IPUMS has a purely (as of today) manual data extract system. There is no computer code that can reliably re-acquire an extract today or in the (near) future. ❌
- And even manual extraction is fraught with the potential for error (and can be quite tedious) ❌
We thus prefer to have a copy of IPUMS extracts - and the read-in code, and the data dictionary that they provide with every extract (if you didn’t grab yours, go back into the IPUMS system and download it now!) The AEA’s Data and Code Repository can curate as well as IPUMS, but make the data more easily available, lowering the cost of reproducibility for all future researchers.
If IPUMS were to allow for sharing or archiving of machine-actionable extract specs (not an easy problem), we would reconsider and NOT request extracts!
We ask for extracts from various online query systems (WDI, OECD, etc.) because the precise query cannot be saved, or is complicated, etc.
One obvious solution is for all data query systems to have some system to save or export query code.
A nice (simple, but not perfect) example for such a stored-query system is the PSID:
- They do curate their database, though no DOIs are assigned, and you need to know to ask (meh… but OK: ✅)
- While they do not assign DOIs to historical queries, you can (as a user) share a query, see https://simba.isr.umich.edu/DC/c.aspx. ✅
- Which is good, since you are not allowed to share the data itself! ❌ (Actually, you can, via https://www.openicpsr.org/openicpsr/psid)
The obvious corollary is:
- If a data provider curates the data (say, at a trusted repository, or with a credible preservation policy) (for which a DOI is a signal, but not a condition) ✅
- and the data can be downloaded, or queried through an API1 or something similar ✅
we will REJECT your data deposit - please simply leave it where it is!
But cite it, and provide precise instructions on how to get it (if a simple click is not sufficient).
And those analysis files?
So when do we want analysis (intermediate) files?
- when we cannot reliably get the raw data, and there are no issues with redistributing the analysis files
- when the raw data can be redistributed, but are “too large” to acquire (at the AEA, that boundary is around 30GB)
- when the raw data can be redistributed, but the analysis data are already the result of a very resource (computer count or time) intensive processing
In all of those cases, we want the analysis data.
When do we NOT want the analysis data? Well, if the raw data can be trivially acquired (download, API1), and can be reasonably quickly processed, then we (and future researchers) do not need your copy of the analysis data. But it also doesn’t hurt…
BTW, “reasonably” means we are not using more than maybe 50 compute cores. “Quickly” means that it runs no longer than 2-3 weeks - to produce the analysis data.
And just contact us
In all cases, if you have questions, concerns, or doubts, do contact me - as many others have already done.