Elements of a Distributed Research Data Network
As preparation for a panel discussion at Health Datapalooza 2018 on distributed data models, I wanted to jot down some thoughts on what I think you need to have a successful distributed research network. (Possibly these things are beside the point of that panel, but I've been wanting to write this for a little while. So here it is.)
What I mean by a 'research network' is a group of data-having organizations who intend to offer data access to remote users via some interface, in order to answer research questions. For me the paradigmatic example is of course the HCSRN. That's the organization I'm most familiar with, having been in on its VDW effort from pretty much the beginning in 2002.
So what do you need for a distributed research data network?
A Data Model
A data model is generally a collection of specifications for the tables that are available in the interface you're offering to remote users. It communicates to your users:
- What data they will find in your instance (e.g., lab results; insurance periods, pharmacy fills)
- How to refer to the data—e.g., table and field names.
- How the data is structured—what relationships the tables bear to one another.
At least as important, the data model should communicate to your implementers just what data has a place in the data model, what it needs to look like, and where in the scheme it should go.
To that end, good specifications are more than just those table and field names. They should also feature some verbiage that sets out what data does and does not belong in each table (e.g., the Race fields should hold the patient's self-reported race values only—do not use non-self-report sources for these fields). Optimally, they will also specify some natural key for the table—some combination of substantive fields that should uniquely identify a record (e.g., there should be one record per (MRN, RXDATE, NDC, and RXPROVIDER)). At some point this meta-description verges into full-on implementation guidelines, discussed below, but there should certainly be some general description to help users and implementers alike know what should be in the tables.
Table specifications are the bare minimum you need for a successful DRDN. In addition to these, it is optimal to have:
- Objective QA tests
- Implementation guidelines
- An authority to construe the specifications and answer user and implementer questions.
That last element is potentially very expensive of course—it may involve paying knowlegable humans to respond to questions and make recommendations/answer questions. If your network is very successful and can gain many implementers, then it's possible that an active community will spring up around it sufficient to keep it going without money changing hands (in much the same way that happens with some open-source software). But it would be unwise to count on that happening very early—much better to seed the ground by starting out with a paid group.
As the VDW effort matured I came to really appreciate the existence of objective QA tests as a supplemental source of data specification. Just the exercise of agreeing on what checks you should have (apart from 'all fields should exist & be the proper type') is incredibly valuable for surfacing the frequently diverse understandings and assumptions of your implementers.
In addition to the above, it would be lovely to also include a sample implementation from widely available reference data (e.g., Optum, Medicare Claims). The OMOP folks have put together this implementation of an ETL process that creates OMOP data from CMS' synthetic public use file, for example.
Likewise, since the healthcare and health research fields are both ever-changing, it's good to have a process by which the DM can be amended.
Finally, in addition to a human-readable specification of your data model, it's always nice to include one or more flavors of SQL DDL scripts that create an empty set of the tables that constitute your data model.
A Query Processing Engine
This is what your users will use to manipulate the data from afar. Because they are doing so "from afar" it is crucial to be as specific as possible as to what that environment is like. Examples include:
- python 3.6 or later, with pyodbc 4+ and pandas v02+
- R 3.4 or later
- Oracle PL/SQL version 9i or later
- SAS 9.3 or later
- ::centrally created, possibly open-sourced, web app query tool with corresponding execution nodes at implementing sites::
For actual programming environments, where queries consist of human-generated, small batch free range artisinal code, you want to specify that environment in a lot of detail. Generally, you want to enable your users to create a similar environment locally (either with their local implementation of your data model, or from synthetic data) so they can formulate/test their queries before inflicting them on the network.
Opting for a custom web-based GUI application for specifying queries takes out a lot of the guesswork in terms of "will it run at the sites?" but of course that's at the cost of having to create said GUI application, and the sophistication of what users will be able to do.
A Registry of Implementations
Having some central clearinghouse where implementers can register that they have implemented your data model, are open for business, and set expectations in terms of what queries they're willing to entertain is crucial. The popularity of a data model comes from how many organizations have implemented it. This is also a great place for implementers to set expectations in terms of:
- The extents-in-time of their data holdings.
- The date they last ran the QA tests, and their results.
- What QA issues are currently known, and whether they are fixable/in process of being fixed.
- What specific versions of the query engine software they have available.
- What hoops there are to jump through in order to get a response—e.g., data use agreements, fees, approvals, applications, etc.
On the HCSRN we have a portal website that we use to document these things.
An Implementing Community
If nobody implements your data model, you don't have a network. And as soon as more than one org implements, you have the problem of ensuring that different implementers are interpreting those specifications in something like the same way. In a lot of ways, specifications are like legislated laws—they usually sound logical, coherent, and reasonable when they are formulated, in the abstract, with particular use-cases in mind. But just like we need judges to construe those laws in order to apply them to specific situations (often disturbingly different from those anticipated when the law/specification was first written) we need at a minimum, a way to arrive at consensus regarding how specific decisions should be made.
Arriving at consensus means having a conversation, be it live at meetings, or on conference calls, or asynchronously in e-mail for on forum threads. The means for having those conversations are what I mean by Community.
No need to belabor this one. Without people interested in using your data model, again, you don't have a network.
What have I forgotten? Tell me in the comments.