Implementing the Virtual Data Warehouse: Going to an incremental update strategy on Utilization

We've been having trouble w/our V2 ute job for a while now--crazy long runtimes, running out of scratch space & such. With the work we're doing for V3, we started brainstorming on how we could go from the current complete-rip-and-replace job, to something more surgical & efficient. We arrived at the below steps for a strategy & I thought others might be interested and/or have valuable feedback.

Current V2 Strategy

In general, our V2 strategy has been to reshape each indigenous source of data into a denormalized version of encounter/dx/px--something like:

mrn	adate	provider	dept	dx1	dx2	...	dxN	px1	px2	...	pxN

Then we mash each resulting file together, and process it as a single file (called 'source'):

extract 'encounters' out of it according to the V2 primary key (mrn|adate|enctype|provider)
reconcile logical conflicts & do encounter-level cleanup

sweep non-inpatient encounters up into embracing inpatient stays;
merge contiguous-in-time professional claims in inpatient settings into single inpatient encounters (so e.g., Joe Patient no longer looks like he had 30 individual one-day stays in June).

clean up weirdo dx/px codes, etc.

Now, because we had that PK & encounter-level cleanup to do, the easiest way for us to save prior processing was to store off the reshaped chunks of indigenous data. For inactive sources (e.g., from systems no longer producing new data) we did this once & stored a static chunk. For active sources (claims, until recently a pre-EPIC legacy feed, now a new direct-from-clarity feed) we would create static chunks for stuff back far enough in time that we expect it to essentially be static, and then process more recent data anew. That saved a fair bit of work & shaved time off the job, but we still have issues w/it barfing if the server is particularly loaded at the wrong time. So we're eager to try and do something smarter.

Version 3 + MD5() = Incremental Updates

So now in version 3 of utilization, we don't have to shoehorn the data into a uniform definition of an encounter. We still need to do our best to ensure that the records in the main encounter file are indeed encounters, and I believe we still need to do the encounter-level cleanup steps, but we're no longer tied to a particular definition of encounter. So if one of our sources has, say, the time of the encounter in addition to the date, we're free to take that into account in figuring out which records are dupes or otherwise mergeable and which are indeed distinct encounters. Another nice change is that we will have a single field (enc_id) to use as a unique key--much nicer/more convenient to use than the V2 "gang of four".

So here's what we're thinking of trying for V3.

The Two Horseman of the Utilization Apocalypse

In general, the boogeyman around doing an incremental update is, "what if the already-processed data gets deleted or corrected in some way?". If we just take for granted that the first time we see a record & stash it in the vdw ute files it's good forevermore, we may miss out on subsequent corrections. (I'm not actually sure how much of a problem this is at GH--it may be negligible, but we are too anal-retentive to just take it on faith.) In addition to that boogeyman, you have the basic problem of figuring out which records are indeed already processed. Particularly for claims sources, where claims may bounce back and forth between the originating org & GH a couple of times to work out disputes, it's not uncommon to have new claims show up 3-months post service. Others sail right through in a single month. So claims data is almost always going to be a melange of new-and-old. We can't be adding the same data over & over again, but running through every var to see if anything has changed seems too grim to contemplate.

MD5 To The Rescue (?)

One of the newer functions available in SAS (as of 9.1 maybe?) is md5(). MD5 is a cryptographic hashing algorithm that's wonderful for creating "digests" of documents--that is, a gibberish-looking hexadecimal number that acts as a fingerprint for the document. There are several cool things about cryptographic hashes.

First, tiny changes in the input result in pretty massive changes in the output. Check out the MD5s for the words 'cat' and 'cot' below:

'document'	MD5
cat	d077f244def8a70e5ea758bd8352fcd8
cot	97223fab7b0d4c64c07e6e004c602302

Second, the likelihood of two different inputs resulting in the same output is infinitesimally small. So it really is something like a fingerprint.

Our plan is to compute and store (in an off-spec variable) an md5 based on all the substatntive data fields on their way into the main ute file, and use that to figure out what already-processed data has changed, and what data is not already in the vdw.

In short:

We build a set of files from scratch (status quo), plus we have the enc_id's (defined per-source, in such a way as they are unique across sources)
Pull the last X months of data from each still-productive source (claims, clarity, DRG, OIC, short stay, lab results, & micro results)
For each record read from source data, we compute an md5 hash for the smushing together of let's say all substantive (encounter-defining) fields (and the list can differ across sources if we want them to), plus all dx's and px's (at least for VIST-shaped stuff--it may not make sense to denormalize clarity data--or it might I suppose). Call this variable source_data_hash.
Combine each new chunk of source data into a single dset (right?)
Remove any record from source where the combination of enc_id and source_data_hash are found in vdw.utlilzation (e.g., source left join ute, where ute.enc_id is null)
That done, remove any record from vdw.utilization (and px/dx) whose enc_id is found in source.
Denormalize/clean source--these are the new recs for ute/px/dx
Append & update dset labels to indicate the new extent-in-time of the data therein.

That should get us pretty far. I think the only things it *won't* get us are the encounter-level cleanups we do, for stuff where we've already processed some of the data that need to be involved. To wit: sweeping non-inpatient data that occurs w/in the extent of an inpatient stay into the stay (when the stay in question was already processed), and collapsing contiguous-in-time inpatient claims into a single stay record (when some of those stays were already processed). In both cases we'll have to look at the substantive fields on the already-processed inpatient stay records to evaluate the need to sweep/merge. But godwilling it'll be a small enough amount of data that it won't be too intensive to do.

What do you think--does this hold water?

Implementing the Virtual Data Warehouse

Thursday, May 20, 2010

Going to an incremental update strategy on Utilization

Current V2 Strategy

Version 3 + MD5() = Incremental Updates

The Two Horseman of the Utilization Apocalypse

MD5 To The Rescue (?)

1 comment:

Followers

Blog Archive