The template README required by various econ journals asks for a statement about the rights to RE-distribute data. Many economists are confused by this: “But the data is publicly available.” Let me try to disentangle that somewhat.
The statement about rights
The statement says
I certify that the author(s) of the manuscript have documented permission to redistribute/publish the data contained within this replication package. Appropriate permission are documented in the LICENSE.txt file.
Two questions arise: what does this mean, and what is an author supposed to do with the LICENSE.txt file? And then of course, how to cite it all.
Questions and some answers
Here are a few questions that arise
Do you require licensing information for all the data that we are using, or just the specific data that is provided in the replication package?
Many econ papers use data from multiple sources, including data that are clearly confidential. Some data sources are not confidential, but are subject to copyright or licensing agreements. The replication package should ONLY contain data that can be redistributed, and the statement ONLY pertains to such data. It does not pertain to data not included within the replication package.
Of course, a good chunk of the README is spent describing ALL the data sources, including the confidential ones, the non-confidential ones that are copyrighted, the non-confidential ones that are proprietary, or otherwise subject to redistribution restrictions. ALL data provenance must be described, sufficient for a replicator to start from scratch, ignoring even the data provided within the replication package. That description should contain the licensing information for every dataset as well, at least in summary. It should list pertinent conditions and restrictions, for instance: application process, required residency or citizenship, etc. The LICENSE.txt is more likely to contain a long text in legalese, the formal permission to use the data that is in fact included in the package.
Why can’t I redistribute this data, since it is publicly available?
I’ll take the IPUMS CPS and the S&P 500 as an example.
“Publicly available” does not mean that you have the rights to distribute. For instance, you can use the S&P 500 data distributed by FRED, but you are not allowed to re-distribute that data, as noted on the page there. You would describe that in the README, but the license would not be in the LICENSE.txt, because you are not, in fact, including the data!
Turning to the CPS: When CPS data is pulled directly from the US Census Bureau website, it is in the “public domain,” i.e., not subject to copyright (Note that this is not clear from the Census Bureau’s website, which simply assumes that you know that! Check the USA.gov website for an explanation.)
Note that other licenses are “sticky” - if you obtain data that is under a CC-BY (Creative Commons) license, then you can do with the data what you want - but you must include the CC-BY license and the original source attribution. In essence, that part of your data is also under a CC-BY license.
What goes into the LICENSE.txt?
This is straightforward if you have a single source. It should state the permissions and conditions that are attached to that file. If the data you obtained has a LICENSE.txt (some do), then simply include that with your deposit. Licenses can be long - they are written in legalese, because they are, in fact, legal permission to do something with the data.
If you created or collected the primary data yourself, you define the LICENSE. By default, deposits on openICPSR have a CC-BY license. We have a small discussion of how to choose licenses at the AEA Data Editor’s website and the Social Science Data Editors’ website.
OK, so how do I cite the data?
Thank you for that question, since most researchers in economics do NOT cite the data. See this extensive discussion over on the Social Science Data Editor website, consult the AEA Style Reference, and remember that not all data distributors provide suggested citations that satisfy the Data Citation Principles. If they ask you to cite a working paper, do so, but also cite the data correctly.
Most of this data was collected in 2016, but the IPUMS AHTUS data citation page only lists the 2018 version. I am unable to find a historic example of this specific citation.
IPUMS doesn’t make it easy, I know. There is a not-so-obvious list of versions of the data at this URL: https://www.ipums.org/projects/ipums-time-use. The IPUMS Time Use Revision history page does NOT list the DOIs for the various version. So as far as I can deduce from that, you would have used V1.0 of the AHTUS data, which has DOI 10.18128/D061.V1.0. You can use https://citation.crosscite.org/, or simply adapt their suggested citation accordingly.
Data providers have not all adapted to a world where data citations are ubiquitious, or should be. Sometimes it takes a little extra work.
Note that data citations should appear both in the manuscript (upon first mention of the data) and in the README (properly formatted as in the manuscript).