Impressions from the 2nd Plenary of the Research Data Alliance

The Research Data Alliance (RDA), which came into being a little over a year ago following a realization that the challenges surrounding research data need to be addressed on a global scale and is currently  supported by the European Commission, the U.S. Government, and the Australian Government, held its second plenary meeting from Sep. 16-18, 2013 at the National Academy of Sciences in Washington, DC.   These are my personal impressions (and mine alone) from the event, without claim to completeness or complete accuracy, representation of the speakers’ opinions, etc.

Tom Kalil, Deputy Director for Technology and Innovation in the White House Office of Science and Technology Policy, stated that one reason for providing open access to research data is that replicability of results is currently a real problem in science.  Several federal agencies are promoting “Big Data” development and investing into research data initiatives.  Presidential innovation funds will be launched to bring in entrepreneurs to work on application to make federal data more accessible and usable.   The RDA, as one of its roles, should help to generate empirical findings that Open Data is useful, taking us beyond the assumptions that it is.  A longer-term goal should be the establishment of an international research data commons.

John Wilbanks, Chief Commons Officer at Sage Bionetworks, suggested that as the amount of research data that could potentially be shared rapidly increases, a prioritization of what data is to be actually be shared will become important; and that for a shared dataset to be and remain useful to future researchers, a certain commitment to it by its creators will be important.  John showed an example of incomplete data publishing, in a Google Earth-based visualization of wind turbines and their functioning under different environmental conditions – here, the underlying dataset was not downloadable, and documentation on how the data came to be was difficult to find.  Speaking about the value of applying an explicit license to a dataset, he advised that a potential (re)user has to assume, in the absence of such a license, that the copyright statement from the web site through which the dataset is retrievable applies to it, although that may not be the intent of the dataset provider – with the dataset intended to be more directly reusable than any content of the organization’s web site.  Lastly, John proposed that not the present value of a dataset should be the sole determinant of it remaining available and usable, but rather its potential future value – without this approach, many scientifically useful datasets may be discarded.

One important difference between many professional conferences and the RDA Plenaries is that the second day of the latter, and much time in between plenaries, is dedicated to the activities of working groups – which aim to produce usable output for the research data community – and interest groups, which may form working groups (WGs).  RDA  WGs try to solve a problem within a timespan of 12-18 months, in a work mode inspired by the Internet Engineering Task Force (IETF).

One that caught my attention is the Metadata Standards Directory (MASDIR) Working Group, whose “overriding goal is to develop a collaborative, open directory of metadata standards applicable to scientific data” – an important activity given that research data management plan requirements now include statements about metadata standards used, while confusion abounds regarding which standards even exist, or may be applicable to their own data, in the research community.    Another one, given that a good deal of funding and data curation attention and resources goes to large projects with large datasets, while many smaller projects with smaller datasets go under-served regarding taking care of the data they produce, is the “long tail of research data” interest group, which is targeted at universities. Lastly, the (so currently named) Data Citation WG is actually focusing more specifically on the problems of citing data sources where the underlying data-stream is frequently or constantly updating/updated, so the data source is not a static, finite object, posing challenges to some established ideas of citing sources in research.  In this proposed WG’s meeting, Micah Altman presented an overview of the work of the FORCE11 Data Citation Synthesis Group.

With the sudden interest in research data growing so rapidly across communities including governments, funding agencies, libraries, research centers, and others, I believe one challenge the RDA as an organizational entity will face is successfully scaling up activities in line with its own increased exposure and growth, especially given its global operation, with plenaries held (so far, and as planned for 2014) in Europe and the USA.  The targeted model for fee-based institutional RDA membership (which remains free for individuals) may help sustain this, going beyond what is provided by the original supporting agencies, if it is structured appropriately.   Furthermore, activities of the relatively new RDA have also been taken up in other professional communities earlier, if not at the same scale, so coordination and communication activities will remain important for the RDA in order not to “reinvent wheels.”

-Stefan Kramer