Implementing the Virtual Data Warehouse: March 2011

In which I plan how to answer questions put to me by one of the moderators of the upcoming CRN Scholars concurrent session at this year's HMO Research Network conference...

Q: The VDW appears to offer the opportunity for new investigators to launch a multi-site study with relatively modest resources. Is this true?

I would say it allows multi-site studies to spend much less on programming. How much less is very dependent on the needs of the study. If you can stay entirely within the domain of data that are available in the VDW, then you should not need very heavy programming work done at the non-lead sites. So if in a pre-VDW world you might budget a biggish project w/30% FTE programmer at all of 6 sites, post-VDW you can probably get away with more like 40% at the lead site, and 10% or less at the other sites. But that's a big 'if' of course--you can do a lot of studies just with VDW data, but you can't do everything. And of course programming is just one of many items in a study budget--you've also got to have Investigators, Project Managers, and so forth.

Q: How much effort is required to obtain VDW data across multiple sites?

From a technical standpoint, not a ton, frankly. You've got to write a VDW program of course, and writing those is generally more difficult than writing single-site extraction programs, just because you have to be extra defensive in your programming, and because there's usually an extra round of programs to write to collate data you get back from your sites. But that work is amortized over your sites & so in the aggregate it's a win, effort-wise.

But of course that doesn't really touch much of the effort in doing multi-site research. You've got to engage with scientific staff at the sites to get collaborators. You've got to get a project funded, and teams at the sites staffed. You've got to get someone to run that program you've written, and send you back the results. You've got to have conference calls and agree on exactly what bits of data should move from one site to another (just aggregates? person-level? event-level?) and what by what means? VDW doesn't solve any of that of course. It's very very useful. But not magic.

Q: What are the specific tasks that have the VDW Implementation Group, content area Work Groups, and Site Data Managers have done “up-front” to make it easier for researchers to obtain their research data, and to combine similar data across sites?

Of those three groups, the efforts of the Site Data Managers--the people actually implementing the VDW files at the sites--dwarfs that of the other two groups. That's just inescapable--the rest of us can plan, scheme, pontificate even, but the only people that can actually touch/poke/prod data, and have any clue what's going on on the ground at site X are the people at site X. That's just comes with having different sites.

Probably the biggest thing SDMs have done (at most sites anyway) is simply collating the data from X different disparate local systems into the very few files that comprise a VDW implementation. So for instance at Group Health, in order to get utilization, px & dx data from 1993 up to the present, from inpatient & outpatient settings, we have to hit something like 13 different data systems (and we are not the most complex site, not by a long shot). Some are claims-based, some are EMR-based, and some are pre-EMR legacy system data. Having data from all of those systems in a single set of 3 files is enormously convenient, even just for single-site studies. (Which is why most VDW files are sources of first resort for all of our programmers--not just the ones on multi-site studies.)

So--SDMs good. SDMs wonderful. But the other groups also contribute of course. The Work Groups steward our specifications, which are our precious islands of shared understanding, in a sea of ambiguity. The specs are our touchpoints with one another--where we figure out just how alike and different our sites are from one another. In addition to developing the specifications, they also construe them for implementers--by which I mean respond to questions, both abstract and concrete regarding implementations.

I've been doing multisite research programming long enough to remember the pre-VDW days, in which we would distribute English descriptions of the data we needed, have many many conference calls hashing out details of each site's implementations, and go back and forth over anomalies & misunderstandings. And lots of that effort would be just plain duplicative--writing code to implement a particular definition of "continuously enrolled" for example.

And then at the end of the study you would throw it all away! It was insane.
WGs have also historically presided over our shared Quality Assurance efforts (and several have been going strong with QA work even in these intense days of V3 implementation). That's an area that we hope to take up anew, now that we have the many many V3 changes (mostly) behind us.

As for the VDW Implementation Group & VDW Operations Comittees, I think of the former as the community of SDMs & the latter as those Scientist collaborators who have graciously chosen to help give us direction so the VDW stays relevant to the needs of HMORN researchers. Strategizing for VDW is a very difficult thing, because the Scientist/users know the "bang" part of the equation, but the SDMs know the "buck" part. So no single person can really balance everything.

Q: How consistent /uniform are the data across sites?

This is a horrible question to try and answer in the abstract. The answer is of course, that "it depends". Depends on which sites, over what time periods, what sorts of data, for what purpose, and what you mean by "consistent".

I will say that the domain of VDW applications for which you can use the data uncritically, without testing the waters, investigating whether the particular data you expect--those left-handed-Dentists-without-tonsils you love to study--are in fact there as you expect them to be, is close to empty. Probably you can do quick counts for feasibility without investigating consistency/quality. But short of that, users are well advised not to assume very much in terms of consistency.

The specs that we all code to mean the files should exist, the variables should exist, be of the type specified, and contain only allowable values. We guarantee that the best humans we could find have populated those files at the sites, using their best judgment, the best data they have available, with the best guidance we could give them. But our data are so thoroughly touched by fallible fallible human hands and minds (starting at the care setting mind you) that there are bound to be points of inconsistency--not all of which reflect e.g., expected practice variation. VDW is not a magic bullet, alas.

This was revealed pretty starkly on the CRN's Pharmacovigilance project when we looked at chemotherapy data. Chemotherapy can be coded in both pharmacy and procedures data, with varying levels of precision in each. Across the eight sites on that study, we found that at some you'd find evidence of chemo almost exclusively in procedures data, and in others you'd find it only in pharmacy data. This was due to differences in the way the different sites' health plans delivered and/or paid for chemo treatment. Sites that didn't do any chemo in-house had everything coming off of rx claims data.

Q: The VDW tables are described as “Quality Assured”: What does this QA include? What does it not include?

Frankly, I don't think I would describe them that way--not generally anyway. I don't mean any disrespect to my colleagues on the Workgroups or at the sites. I just don't think we have done enough with QA to really make a general representation that the data is "quality assured" (and what that means to different people will be different, so it's probably perilous to claim even under the best circumstances). I know we at GHRI have certainly not done enough. We aspire to it, and are striving toward better QA, but we are not even close to where we should be, alas.

So--that sounds really bad right? Holy cow, this guy doesn't believe in his own project! But here's what I will claim about VDW file quality: it is just about guaranteed to be better than what you would get if you put together ad-hoc extracts from indigenous sources for your study. There are 2 reasons I'm comfortable claiming this:

we have better programmers at the sites than you're likely to get for a project, and
you're not the only one using the VDW--it has been heavily scrutinized by many prior uses. That scrutiny discovers issues, which get fixed to benefit subsequent studies.

The first point owes to the generosity and foresightedness of the Site Directors, who agreed to get us the people we have. The second is one of my favorite things about the VDW--when used properly, it is a vehicle for cumulating the value of programming efforts across many studies.

I could go on, but who will even read this far? ;-)

Implementing the Virtual Data Warehouse

Friday, March 4, 2011

VDW Q & A: CRN Scholars' Questions

Followers

Blog Archive