Friday, September 3, 2010

Fundamentals

In which we ponder some very basic questions.

What are the goals of the VDW? Why does it exist?

The primary reason the VDW exists is to ease the burden of multi-site health-data-based research. By putting health data in a common format and specifying standard access methods, the VDW allows programming work done at one site to be used by the staff at most any other VDW-implementing site. The result is significant savings in cost and time spent on programming work.

In addition to that primary goal, the VDW has also had a couple of very nice side effects:
  • At many sites, because the VDW offers a comparatively simple interface to its data to local programmers, they can be more productive even for local-only work.
  • It has enabled program code to be reused across sites in addition to over time. This allows an economy of scale that should offer us better code--because it's been used and results scrutinized by many people on many different projects.

That first side effect has a nice side effect of its own: the additional scrutiny that local-only use gives results in additional vetting and defect detection. Sites whose local users use their VDW files almost certainly have better-quality files because bugs are discovered (and we presume, fixed) more quickly.

What sorts of things can be "in the VDW"?


In general, the VDW as a system consists of datasets and program code that manipulates those datasets. Datasets fall into 2 categories: those that hold substantive data (e.g. enrollment; utilization), and lookup or reference datasets (e.g., the datasets of chemotherapy-signifying codes that the CRN has put together.)

Code is generally SAS programs--either macros that get incorporated into the standard macro library, or as freestanding .sas programs that do a specific task.

What does it mean for a dataset to be "in the vdw"? What should it mean?


First and foremost, putting a dataset in the VDW means that we expect the dataset to support ongoing programs of research--not just individual projects. The VDW is infrastructure--at a level over projects.

While we hope and expect that nearly all projects can benefit from the VDW to some extent, and that some portion can even be accomplished exclusively with VDW data, we do not expect that every project will be accomplishable using just VDW data. We do not consider striving to be the end-all-be-all of research data to be an attainable goal. Implementing sites should not discard their indigenous data (or programmers ;-).

More practicaly, having a dataset in the VDW means that some group of people (technical and scientific) want the dataset to exist enough to form a workgroup, and:
  • articulate the sorts of uses they imagine for the data,
  • hammer out a spec (and optimally full-on implementation guidelines)
  • answer questions from implementers (e.g., construe the spec)
  • write qa program(s) and verify implementations, and
  • Sites are able and willing to implement, update, document and support the files.

If we knew that a given dataset (or variable) was not getting used, would it make sense to drop it from the VDW?

That is certainly worth thinking about. I'm looking at you, Census.

What makes a good vdw spec?

A good VDW dataset specification is implementable. The best indication of this is that it has in fact been implemented--preferably at more than one site. You never really know what issues and questions you will run into until you actually go to implement. Many directions that seem clear and complete from your armchair raise knotty questions once you dig in to your indigenous data and see exactly what you have.

A good VDW spec is specific enough to give clear guidance to implementers as to how to wrangle their indigenous data into the spec, and to users as to what they should expect to find.

On the other hand, a good VDW spec is not unnecessarily specific--does not add requirements that do not serve the intended use of the data. There are many details of a dataset that do not really bear on its use. Variable labels and formats are easily in this category. I would argue that in most cases variable lengths are similarly of no import to end-users. In some cases (specifically, MRN, which is designed never to leave an implementing site) even variable type may not matter to users.

A good VDW spec preserves detail available in indigenous data without making it impossible for sites with less available detail to implement. So for examplethe old-timers among us will recall the time when the enrollment spec called for one record per member per month, and bore only the year and month. If a record existed, it meant that Joe Blow was enrolled for some portion of that month, and that was it.

Compare the new spec, which allows for pinpoint dates of enrollment & disenrollment. Now, sites that have that level of detail can include it, and those who only have month-level granularity can document that, and use first/last day of month in the enr_start and enr_end variables.

This can be a neat trick of course--there's a real tension between accomodating full detail and still giving a uniform interface across sites w/different levels of detail.

A good VDW spec uses relevant industry standards as much as possible--for example CPT, LOINC, NDCs, etc. Coming up with our own codesets for these sorts of things is a burden that we should try to avoid wherever possible.

A good VDW spec follows a coherent set of conventions for things like variable names (CamelCase or under_scores?) and missing value handling. This is an area where we could use some work, frankly--we have not paid enough attention to it in the past.
What are your thoughts on these questions? Please comment!

1 comment:

  1. Another thing having a dset "in the vdw" should mean--we coin an official stdvars macro variable to use to refer to the dataset.

    ReplyDelete