Data, open data, and the stuff at the back of your filing cabinet
Every year research projects finish, and doctoral and master’s students graduate. In the vast majority of cases, the data generated during the research then becomes inaccessible. This is either because the main student affiliated with the data loses that affiliation (and sometimes even access to the data) or that the professor who supervised the project has moved on. Even if that is not the data’s fate, only in the rarest of cases are datasets properly archived in a manner that someone other than the researchers directly involved in producing the research can actually use it. This raises the question of so what? In the case of data reuse, the “so what?” is actually a pretty big one.
First, science budgets for research are generally shrinking. Second the number of researchers competing for those funds has greatly expanded, as even many of the most modest teaching colleges are being pressured to conduct research. Consequently, the available funds per researcher are becoming ever scarcer. Third, the cost of doing the research only increases each year (e.g. overheads, tuition charges, recharge rates on equipment). Fourth, most science follows some form of continuum. Thus, good datasets are needed to validate new techniques, calibrate computer models, and show comparative progress. Nowhere is this easier than when there is either a classic, closed-form solution for a problem or a set of data/tests against which the community benchmarks itself. So given the shrinking pie of research funds, archiving and distributing datasets becomes critical to move the field forward in a timely manner, especially for new and younger researchers who may not yet have large research grants.
So while it may be obvious why a scholarly community wants to preserve, archive, and make accessible its data, the benefits for individual researchers may not be as well articulated. Most researchers who are currently making their data accessible are doing so because it is (1) a grant requirement (2) an institutional policy, or (3) is seen as good public relations, with most researchers responding with resigned compliance in the spirit of “you should eat your vegetables, because they are good for you”. While all of this may be true, I believe it is the wrong attitude with which one such undertake this work.
Publishing your data has an enormous professional upside. The act shows your academic peers the complexity, diversity, and/or progression of your work and validates ownership in a way that publishing a paper cannot not necessarily do. Making your data available to others is simply a good business model, as it attracts collaborators and may provide new insights as others bring their own training and skill sets to the analysis of your data. Publishing your data has other benefits. Foremost it creates a permanent record of the data that you as the researcher are not necessarily responsible for maintaining. Next, sharing data when done with those around you fosters collegiality and builds resources through the synergy of combined/layered data sets; in many ways the field of epidemiology has always worked this way. Data sharing also forces the timely usage of our own data – we are all guilty of having at least a small treasure trove of data that has never been analysed, to so nothing of published. Pushing our data out into the public realm simply forces us to up our game a bit and to extract promptly all of the benefits of the work we have already done, instead chasing that next bit of money and starting some new. Finally, the newest (and possibly most personally important) reason to publish data is that the datasets can now be archived and cited with their own digital object identifiers (same as those assigned to a paper). This is done through organizations such as Datacite. Thus, when other researchers use your data, you should be able to start collecting citations on the published datasets.
Regardless of the motivation, the mechanisms for publishing are fairly limited. In addition to Datacite, the basic storage options are as follows: (1) selectively linked data (e.g. open access journal); (2) investigator’s home page; (3) investigator’s institutional repository; and (4) professional organization repository. Another less long-term but rather high impact means of dissemination is to offer your datasets as the basis for academic contests (student and otherwise) that are regularly held at many conferences. This is usually done by approaching a conference chair or steering committee and proposing a track of papers that would be the output of the contest results. In that way, your are perceived as the expert and you entice others to use your data in ways that you think are most appropriate. This is done by both setting the contest tasks and predefining the metrics by which it will be judged.
For over a decade, scholars at top institutions worldwide have been discussing the need to permanently archive (not just store) research datasets and the challenges associated with such storage (e.g. space, multi-mode indexing, mixed data granularity, widely varying file types, the simultaneous presence of two-dimensional, three-dimensional, and four-dimensional data, plus versioning control, and trust-worthiness). I believe, however, that with the help of the computer science community most of those issues will eventually sort themselves. What is, however, still needed is a major paradigm shift in thinking about datasets as highly valuable outcomes in and of themselves. Perhaps the next Google foray will be “Google data”. Happy archiving.