Current V2 Strategy
In general, our V2 strategy has been to reshape each indigenous source of data into a denormalized version of encounter/dx/px--something like:mrn | adate | provider | dept | dx1 | dx2 | ... | dxN | px1 | px2 | ... | pxN |
---|
Then we mash each resulting file together, and process it as a single file (called 'source'):
- extract 'encounters' out of it according to the V2 primary key (mrn|adate|enctype|provider)
- reconcile logical conflicts & do encounter-level cleanup
- sweep non-inpatient encounters up into embracing inpatient stays;
- merge contiguous-in-time professional claims in inpatient settings into single inpatient encounters (so e.g., Joe Patient no longer looks like he had 30 individual one-day stays in June).
- clean up weirdo dx/px codes, etc.
Version 3 + MD5() = Incremental Updates
So now in version 3 of utilization, we don't have to shoehorn the data into a uniform definition of an encounter. We still need to do our best to ensure that the records in the main encounter file are indeed encounters, and I believe we still need to do the encounter-level cleanup steps, but we're no longer tied to a particular definition of encounter. So if one of our sources has, say, the time of the encounter in addition to the date, we're free to take that into account in figuring out which records are dupes or otherwise mergeable and which are indeed distinct encounters. Another nice change is that we will have a single field (enc_id) to use as a unique key--much nicer/more convenient to use than the V2 "gang of four".So here's what we're thinking of trying for V3.
The Two Horseman of the Utilization Apocalypse
In general, the boogeyman around doing an incremental update is, "what if the already-processed data gets deleted or corrected in some way?". If we just take for granted that the first time we see a record & stash it in the vdw ute files it's good forevermore, we may miss out on subsequent corrections. (I'm not actually sure how much of a problem this is at GH--it may be negligible, but we are too anal-retentive to just take it on faith.) In addition to that boogeyman, you have the basic problem of figuring out which records are indeed already processed. Particularly for claims sources, where claims may bounce back and forth between the originating org & GH a couple of times to work out disputes, it's not uncommon to have new claims show up 3-months post service. Others sail right through in a single month. So claims data is almost always going to be a melange of new-and-old. We can't be adding the same data over & over again, but running through every var to see if anything has changed seems too grim to contemplate.MD5 To The Rescue (?)
One of the newer functions available in SAS (as of 9.1 maybe?) is md5(). MD5 is a cryptographic hashing algorithm that's wonderful for creating "digests" of documents--that is, a gibberish-looking hexadecimal number that acts as a fingerprint for the document. There are several cool things about cryptographic hashes.First, tiny changes in the input result in pretty massive changes in the output. Check out the MD5s for the words 'cat' and 'cot' below:
'document' | MD5 |
---|---|
cat | d077f244def8a70e5ea758bd8352fcd8 |
cot | 97223fab7b0d4c64c07e6e004c602302 |
Second, the likelihood of two different inputs resulting in the same output is infinitesimally small. So it really is something like a fingerprint.
Our plan is to compute and store (in an off-spec variable) an md5 based on all the substatntive data fields on their way into the main ute file, and use that to figure out what already-processed data has changed, and what data is not already in the vdw.
In short:
- We build a set of files from scratch (status quo), plus we have the enc_id's (defined per-source, in such a way as they are unique across sources)
- Pull the last X months of data from each still-productive source (claims, clarity, DRG, OIC, short stay, lab results, & micro results)
- For each record read from source data, we compute an md5 hash for the smushing together of let's say all substantive (encounter-defining) fields (and the list can differ across sources if we want them to), plus all dx's and px's (at least for VIST-shaped stuff--it may not make sense to denormalize clarity data--or it might I suppose). Call this variable source_data_hash.
- Combine each new chunk of source data into a single dset (right?)
- Remove any record from source where the combination of enc_id and source_data_hash are found in vdw.utlilzation (e.g., source left join ute, where ute.enc_id is null)
- That done, remove any record from vdw.utilization (and px/dx) whose enc_id is found in source.
- Denormalize/clean source--these are the new recs for ute/px/dx
- Append & update dset labels to indicate the new extent-in-time of the data therein.
What do you think--does this hold water?