I'm reproducing and updating here the most popular post I've ever had on my personal blog. Although I originally wrote the piece in 2014, nearly three years later we find ourselves in much the same situation. We can do better.
There's a serious problem with the current state of shared data - it is almost completely unusable! Here are some ideas for sharing more effectively.
I often have a question I'd like to answer for which I know data are available. Most recently I wanted to look up the incidence (number of new cases) of various infectious diseases over the last decade. This should be easy - CDC publishes the Morbidity and Mortality Weekly Report of just that. Well, the data are indeed available - put only in PDF. [Update: This is no longer true. CDC now has an entire data portal (hallelujah). Project Tycho has taken on the task of digitizing all of the historical MMWRs. Despite this incredible progress, most public health data is still published in PDF so the problem endures.]
Why even bother with computers? They might as well mail around a printout. If I wanted to actually analyze it, I would first need to enter a decade's worth of data by hand. Ain't nobody got time for that.
I don't mean to pick on CDC. County Health Rankings is an awesome website that aggregates and releases for download public health data from a variety of sources. I'm grateful for that, but the Excel files they release each have multiple sheets, nested headers, merged cells, and extra columns with confidence intervals. It's pretty much impossible to analyze that data in a program other than Excel. To do so, I first have to manually select and reformat the data I want, rename the variables, and then copy/paste it into a new file - which rather defeats the purpose.
There are about eight million other examples that I had to restrain myself from enumerating. The point is that sub-optimal sharing practices make it difficult for researchers (of both the professional and citizen variety) to actually use shared data. The research either a) won't get done because it's too much of a hassle, b) will have errors from manual data entry, c) will take way longer than it should. Possibly all of the above. With that in mind, I came up with some tips to level-up your data sharing.
Learn how to step up your sharing game:
Bookmark these guidelines. Next time you reach for the 'export to PDF' button, or begin to use the change-cell-border feature on Excel, pull this out and remind yourself, 'this is not machine-readable. Nobody will use my data if I release it like this.' Then rejoice that you are awesome for sharing your data, and for doing so in a way that is actually useful. And for that, I thank you.
Epidemiologists changing the future of public health.