Implementing the Virtual Data Warehouse: 2010

Wednesday, September 8, 2010

1970 called--they want their programming language back

I've had a couple of conversations over the past few months in which people have questioned whether SAS is really still our best bet for the VDW. That is a question worth asking--SAS is not cheap, and has some significant shortcomings. The amount of disk space (and I/O work) required for long text fields can be close to heartbreaking, for example. The SQL optimizer can be maddeningly stupid (and it's pretty easy to write programs that have no hope of taking advantage of indexes, even if you have a decent indexing strategy). There's no persistant data-serving process to cache data across requests--unless you have some smarts & memory devoted to disk access (e.g., SAN/NAS type stuff) each SAS job does its own I/O tabula rasa.

I'm writing to defend SAS as a data-access technology for the VDW. I believe it is the one bit of software that we should require sites to have licensed in order to play in our VDW reindeer games. The only bit.

Some Distinctions

To start out I need to make a couple of distinctions. The first distinction is user location--is the user for a given VDW query local (from my site) or remote (from a sister site)? The second distinction is between the VDW's user interface--the parts of the system that an end-user actually touches/interacts with (by, e.g., pointing/clicking around a graphical interface, or writing commands, or speaking into a futuristic microphone of some kind ("diabetics please")), and the parts of the system that actually store and enable access to the VDW data. Call this second distinction front-end vs. back-end. The front-end is what defines what an end-user can do with the data.

Right now our user interface is of course SAS. Anything you can express in SAS using the files/variables we have, you can do with the VDW.

Location, Location, Location

Local and remote users have very different abilities and expectations. Local users are bound to learn the local data environment (VDW and non-VDW alike) including any tools supported at the site for accessing their site's data. They can pretty easily eyeball the data, and see when something looks goofy. They will also have higher expectations around performance--they will be the ones actually staring at the hourglass or spinning beachball or "Query executing..." status bar message while their queries run. (They also know exactly who to complain to when these things take up too much of their time.)

By contrast, remote users are at two distinct disadvantages. To begin with, they have to elicit the help of a local user to actually cause their program to run. (This is by design--the VDW was/is fearsome enough to site staff, who so feared a loss of control over their data that it was difficult even communicating what the VDW is. We figured trying to automate the local humans out of the process would be a non-starter.) That involves e-mail and maybe phone calls. The local humans then need to make sure the query is approved (frequently by other local humans), run it, evaluate whether it ran as expected, report any errors, evaluate whether the output is within the scope of the approval, figure out a good means for sending the output (is e-mail sufficient, or do we have PHI or other sensitive data requiring fancier measures?) and then actually exercise those means. The remote user in this scenario should not expect results back in minutes, the way a local user would. Lucky remote users will get results back in a matter of hours or days. Unluckier ones, well...longer. Remote users know this, and tend not to spend too much time staring at their in-boxes waiting for results to come in.

They are also at a disadvantage with respect to learning the local data environment. They don't know what your servers are named, what sorts of pipes exist between where their program runs & where the data actually live. They don't know what operating system you're running, or what server database, or that your DBA has a weakness for chocolate and works from home on Fridays.

No doubt you can see where I'm going with these distinctions. Here's why I don't see VDW moving away from SAS anytime soon.

As a user-interface, it's incredibly expressive, and offers a lot of convenience tools that would be painful to reproduce or do without.

SAS has a decent SQL implementation, an excellent procedural language (data step) and even a capable macro facility (so you can write code that writes code).

Imagine having to implement the functionality of one or all of these in say, Java:

TRANSPOSE
TABULATE or REPORT
UNIVARIATE

I'm no java programmer (my tastes run more to ruby and .net) but I bet you dollars to doughnuts it would be exceedingly unpleasant.

Even MEANS is a significant tool--try figuring out percentiles in say, SQL. No fun.

Even if SAS isn't entirely unique in these abilities (is SPSS a contender here maybe? R?) it's awfully good, even at large data volumes.

Just because SAS is the front-end does not mean you have to use it on the back-end

SAS has for years supported a suite of ACCESS products, which offer binary client access to such diverse products as:

Oracle
Teradata
Netezza
Sybase
Informix
MySQL
OLEDB (which gets you MSSQL, and anything else anybody's written a driver for)
ODBC (see OLEDB)

And (as they say) many more. If you license the right ACCESS module you can store your data pretty much anyplace you like.

Using SAS as the front-end means your remote users don't have to care what your back-end is

Because SAS/ACCESS pretty completely abstracts the back-end away, nobody remote really has to know or care what your back-end is, so long as it's available and has the expected data. It looks like a SAS dataset to SAS--I can feed it to TABULATE. Unless performance is so horrible that run time becomes a significant contributor to my wait (and recall all the waiting I'm having to do on the humans in the system) however many squirrels have to run on their exercise wheels to get me the data is really not my concern.

Now, will remote users get all the benefits of using e.g., a server database? Alas, no. But they will get many of them--SAS/ACCESS is increasingly sophisticated in which bits of queries it pushes to the db engine for execution.

At least as important, you have the freedom to invest in whatever back-end technology you like and can support. That preserves a domain of site freedom that I think is absolutely crucial to keeping sites engaged and happy. Staff at the sites are reallly the only ones who know what their loads are, and what resources are available to them. Are oracle DBAs hunted to extinction in your area (or just too damned expensive)? Go with Netezza. Licensing for Teradata too dear? Use MySQL. Does your parent org have a site license for DB2, and a ton of excess DB2 DBA capacity? Bob's your uncle.

Or maybe you're not feeling much pain with plain old SAS datasets--that too is an option of course.

So I say, let's communicate and share best practices of course (and we haven't done enough of that) and advise one another about what does/doesn't seem to work. But I don't see any gain to be had by invading the province of the SDMs and mandating some particular architecture. We've all got SAS, SAS is the interface. Apart from that, free to be you and me.

Just because your remote users use SAS to access your VDW data does not mean your local users have to.

As mentioned above, local users are the ones who suffer most from poor query performance (and they likely know where your desk is). If SAS' performance overhead is too costly to your local users, or they just prefer another tool, and you've taken the plunge into a server database, there is no reason not to offer these users access via that tool (apart from cost of course).

So--probably there's something I'm not thinking of, that makes some non-SAS language or environment more of a contender for VDW front-end. Tell me what it is in the comments.

Friday, September 3, 2010

Fundamentals

In which we ponder some very basic questions.

What are the goals of the VDW? Why does it exist?

The primary reason the VDW exists is to ease the burden of multi-site health-data-based research. By putting health data in a common format and specifying standard access methods, the VDW allows programming work done at one site to be used by the staff at most any other VDW-implementing site. The result is significant savings in cost and time spent on programming work.

In addition to that primary goal, the VDW has also had a couple of very nice side effects:

At many sites, because the VDW offers a comparatively simple interface to its data to local programmers, they can be more productive even for local-only work.
It has enabled program code to be reused across sites in addition to over time. This allows an economy of scale that should offer us better code--because it's been used and results scrutinized by many people on many different projects.

That first side effect has a nice side effect of its own: the additional scrutiny that local-only use gives results in additional vetting and defect detection. Sites whose local users use their VDW files almost certainly have better-quality files because bugs are discovered (and we presume, fixed) more quickly.

What sorts of things can be "in the VDW"?

In general, the VDW as a system consists of datasets and program code that manipulates those datasets. Datasets fall into 2 categories: those that hold substantive data (e.g. enrollment; utilization), and lookup or reference datasets (e.g., the datasets of chemotherapy-signifying codes that the CRN has put together.)

Code is generally SAS programs--either macros that get incorporated into the standard macro library, or as freestanding .sas programs that do a specific task.

What does it mean for a dataset to be "in the vdw"? What should it mean?

First and foremost, putting a dataset in the VDW means that we expect the dataset to support ongoing programs of research--not just individual projects. The VDW is infrastructure--at a level over projects.

While we hope and expect that nearly all projects can benefit from the VDW to some extent, and that some portion can even be accomplished exclusively with VDW data, we do not expect that every project will be accomplishable using just VDW data. We do not consider striving to be the end-all-be-all of research data to be an attainable goal. Implementing sites should not discard their indigenous data (or programmers ;-).

More practicaly, having a dataset in the VDW means that some group of people (technical and scientific) want the dataset to exist enough to form a workgroup, and:

articulate the sorts of uses they imagine for the data,
hammer out a spec (and optimally full-on implementation guidelines)
answer questions from implementers (e.g., construe the spec)
write qa program(s) and verify implementations, and
Sites are able and willing to implement, update, document and support the files.

If we knew that a given dataset (or variable) was not getting used, would it make sense to drop it from the VDW?

That is certainly worth thinking about. I'm looking at you, Census.

What makes a good vdw spec?

A good VDW dataset specification is implementable. The best indication of this is that it has in fact been implemented--preferably at more than one site. You never really know what issues and questions you will run into until you actually go to implement. Many directions that seem clear and complete from your armchair raise knotty questions once you dig in to your indigenous data and see exactly what you have.

A good VDW spec is specific enough to give clear guidance to implementers as to how to wrangle their indigenous data into the spec, and to users as to what they should expect to find.

On the other hand, a good VDW spec is not unnecessarily specific--does not add requirements that do not serve the intended use of the data. There are many details of a dataset that do not really bear on its use. Variable labels and formats are easily in this category. I would argue that in most cases variable lengths are similarly of no import to end-users. In some cases (specifically, MRN, which is designed never to leave an implementing site) even variable type may not matter to users.

A good VDW spec preserves detail available in indigenous data without making it impossible for sites with less available detail to implement. So for examplethe old-timers among us will recall the time when the enrollment spec called for one record per member per month, and bore only the year and month. If a record existed, it meant that Joe Blow was enrolled for some portion of that month, and that was it.

Compare the new spec, which allows for pinpoint dates of enrollment & disenrollment. Now, sites that have that level of detail can include it, and those who only have month-level granularity can document that, and use first/last day of month in the enr_start and enr_end variables.

This can be a neat trick of course--there's a real tension between accomodating full detail and still giving a uniform interface across sites w/different levels of detail.

A good VDW spec uses relevant industry standards as much as possible--for example CPT, LOINC, NDCs, etc. Coming up with our own codesets for these sorts of things is a burden that we should try to avoid wherever possible.

A good VDW spec follows a coherent set of conventions for things like variable names (CamelCase or under_scores?) and missing value handling. This is an area where we could use some work, frankly--we have not paid enough attention to it in the past.
What are your thoughts on these questions? Please comment!

Friday, August 27, 2010

Shoulding all over the Collaboratory

There’s tons of talk about the NIH’s Collaboratory initiative among HMORNsters. It’s still pretty undefined (RFA comes in early October last I heard) but that doesn’t stop us from having fun speculating on what it will/should entail in terms of scope and funding. Here’s my take (from my data-centric and generally geeky point of view) on what an optimal Collaboratory would look like.

I should maybe note that even though I’m located at Group Health where the cabal leading the proposal development is, I do not have any inside information about what the people actually in control are formulating—I am not part of the cabal. This is just me spouting off.

I start from the premise that NIH’s main goals are to increase the research bandwidth of the HMORN and to increase the access that non-HMORN investigators have to HMORN scientific staff and data. That may be incomplete or flawed, but that’s the impression I have. Those goals should be easy to objectively evaluate: if we did X studies/year in a pre-collaboratory world, we should be doing X + Y post-collaboratory. If our research involved people from A different external organizations pre-collab, it should involve A + B external orgs post-collab. The larger Y and B are, the more successful we have been.

Like crime, collaboration requires motive, means, and opportunity. The players who have resources (scientific & industry expertise; data) have to be motivated to collaborate. There has to be something in it for them—some good they can acheive by collaborating that they cannot acheive without. I don’t have much to say about motivation other than it is absolutely crucial, and is probably the toughest nut to crack.

So assume motivation away. If we can assume motivated players, how could a collaboratory enable their collaboration?

By creating an environment where Investigators can get to know and trust one another

It’s very temping—particularly for us technical folk—to focus on the technical issues involved in collaboration—how do we develop the dataset we need to address our questions, and once we have it, how do we get it from point a to point b? This is because these are by and large, fun problems with acheivable technical solutions. But we humans are primitive creatures, and way before any of the fun technical issues can come up, we have to establish a basic level of trust between people at the data sources, and those at the destinations. That requires a lot of good old fashioned schmoozing, and for that, there’s nothing like in-person meetings.

That’s particularly challenging for a distributed, virtual organization like the HMORN. Staff come together at the HMORN and other scientific conferences during the year, but those contacts are too few and far between to foster real trust relationships as quickly as we will want them to form. So we need to supplement these contacts with second-best, electronic contacts.

Specifically, the collaborative should include a Social Networking component—a means for the community of Investigators to discover and engage with one another informally. We need a FaceBook for investigators, where they can describe their interests, expertise, acheivements and goals, and communicate informally with one another about their current professional ideas and activities, and solicit/offer each others’ cooperation.

By creating a defined, discoverable and transparent process by which new proposals are evaluated and passed on

Right now there’s a lot of “you gotta know a guy” to playing in the HMORN. How do you get included in an HMORN grant application? You gotta know a guy. How do you get your foot in the door to initiate your own project idea? You gotta know a guy on the inside. (And I use the non-gender-specific sense of the word ‘guy’ here.)

A large part of that is the abovementioned trust issue. If only people we already know & trust can play in our reindeer games, then we don’t have to work with anybody we consider untrustworthy. Once someone is part of the informal inner circle, we can make the proper introductions and grease the skids for collaborations with them.

But some ideas are too time-sensitive or good to put them off. We can’t always wait for trust relationships to form naturally. For those ideas to have any hope of bearing fruit, we need to have an articulated process that outsiders can use to engage with the HMORN.
I quite like the way the Cancer Research Networkhandles this—they have a process laid out with an inquiry form to fill out on their public-facing website. The form asks who you are, what organization you’re with, what your idea is, whether you’re close to a CRN site, etc. Filling out that form results in an e-mail to CRN’s project manager (and an entry in a tracking database). The PM assigns each inquiry a “Collaboration Navigator” to it who is charged with helping the inquirer through the process of engaging with the CRN.

The collaboratory should use a similar process to help qualified candidates engage with the HMORN.

By making the HMORN and its scientific and data assets discoverable and documented

If we intend to collaborate with people from external organizations, those people are going to need to know what they would be getting into. Exactly what sorts of data do we have? Over what periods of time? How big is the population? Are there outstanding usablility issues w/the data, or is it ready to go? What sorts of uses has the data been put to in the past? Can I use my own programmer for the wrangling, or do I have to get the time of someone on the inside?

Potential collaborators will have these, and a host of other questions. The more of these we can document in writing the fewer we will have to answer (again and again and again) verbally. Because the answers to these questions will change over time, the best means for documenting them is on an easily edited, publicly accessible website. The collaboratory should fund the creation of such a website.

Thursday, July 1, 2010

There are bugs in the VDW datasets. All of them. Also: water is wet.

One of the signs of a mature software product is an explicit process for reporting bugs and issues. Once a bit of software gets beyond the scratching-your-own-itch phase, and other people start using it, you need at least two things for the project to flourish: documentation, and a process for reporting bugs. The VDW is very like a software product in this regard.

We've spent a bunch of time and energy discussing and generating documentation (we need to do a lot more documenting, but at least the need is widely recognized). I'm writing today to make the case for an explicit bug tracking process, and to suggest one.

Your data are ugly and your humans are fallible

The first hump for us to get over is owning that our VDW datasets--like every human-generated dataset too big to hand-scrutinize--do indeed have bugs in them. This can be a bit distressing to contemplate, particularly for people removed enough from the realities of legacy data & its wrangling that they are simply ignorant of the warts you run into when you sift through the stuff directly. Pregnant men. The deceased making clinic visits & filling prescriptions. Months where noone seems to have drug coverage. It's awfully tempting to put bad data into the "things that happen to other people" category, and convince yourself that the data that you are depending on are holy-writ valid. That would be lovely, but it's not realistic. Even data expressly collected for the purpose of answering a specific research question--like say prospectively generated for a randomized controlled trial--will frequently contain errors. So you can imagine that our indigenous data--collected by healthcare organizations for the purpose of providing good care and possibly to generate bills--is likewise going to contain errors. Humans are fallible, and the process by which data flows from clinic to research data warehouse is shot-through with human touchpoints.

I don't mean to overstate the problems--our data are incredibly useful, and absolutely valid for a wide range of uses. And of course you've got to consider the available alternatives--it's not like non-VDW datasets are bug free either. It's no use concluding "well, if VDW datasets have bugs, we should just up the budget on our project and do custom coding at all the sites." Because those datasets will also have problems--they're using the same warty source data, and are also wrangled by humans (custom coders are just as fallible as VDW data managers). You can maybe enjoy ignorance of particular bugs w/custom programmed datasets because those typically don't get the repeated scrutiny that VDW datasets get--you develop them, answer your questions, publish your papers & then throw them away. But just because you don't know they're there does not mean they aren't. (Cue scary background music.)

But it's crucial for us to acknowledge to ourselves and to our users that there is some bad data in our VDW datasets. Because the users already know it (or they will soon discover it), and putting bugs "out on the table" will do four very important things:

increase their respect for us and our product,
enlist their help in identifying and fixing problems,
reveal misunderstandings of the VDW and its specifications, and give us an opportunity to educate users / improve documentation, and
focus the discussion of the problems in VDW data.

Right now, without an articulated process for reporting bugs, users and their managers who run into problems may not in fact report problems--or, not in a constructive way anyway. Obviously, the people who need to know about bugs are the data managers at the relevant sites. But if the effort to make problems known to them is too great, users will be tempted into things like site-specific coding; ad-hoc workarounds (which will have to be repeated by future projects until the problem is finally fixed); and of course complaining to to collegues about how the VDW is not at all what it's cracked up to be. Some of that latter is absolutely justified--it's easy to oversell the VDW, and it's not always easy to distinguish VDW promotional/sales type material from VDW documentation.

Those things are bad for us. It's inefficient to have different users discover and fix the same warts with workarounds over and over again. It's bad when people think the VDW will answer every possible research question, and waste time and money going after things we don't have. It's bad when there are back-channels of negative discussion of the VDW.

What is a 'bug'?

This is simple: a bug is anything that frustrates a user's expectations. Any time a user can make a statement of the form "hey, you said that if I did X I would get Y, but instead I got Z" that is a bug in the VDW process. Maybe it's more accurate to call these 'issues', since there is probably a significant class of them that arise from misunderstanding of the VDW generally or one of its specifications or standard methods specfically.

What should a good bug reporting process entail?

Here's what I think needs to happen when someone discovers a bug in a VDW dset implementation.

The relevant SDM(s) need to be informed that a user believes there's a bug in their implementation.
Those SDMs need to respond to the report. Do they agree it's a bug? Is it fixable, and if so, on what timeframe? Can they recommend a good workaround in the meantime?
During the time between the bug report being accepted by the SDM and the SDM implementing a fix, prospective end-users should have a means for discovering the issue. To my way of thinking, the best place to note this is right on the dataset spec itself.
Optimally, someone will ride herd on the bugs, nagging the relevant people to complete fixes & record them. (Having the reports be very visible should hopefully exert pressure on SDMs to get fixes done.)
Once the fix is completed, that fact should be recorded, the reporting user should be informed, and the bug should come off the list of outstanding issues for that site/dataset.

Proposed Process 1: First-Class

I have long believed that to really do this right, you need a purpose-built custom web application to hold both the authoritative VDW dataset specifications (optimally, all VDW educational/promotional material, really) as well as the bug/issue reports. This information is by its nature volatile. What was an open bug two weeks ago is a fixed issue today. What was a no-known-bug-having dataset implementation yesterday is a nasty-bug-having thing today. Users need to know these things in pretty much real time, and nobody wants to have their efforts fixing a bug go unrecognized. If I fix something, I don't want to hear "oh don't use GH data for X--they don't have good data on that".

A custom web application would do things like:

Track who submitted the bug, to what dataset(s)/variables(s) it is directed, and have a narrative description of just how the results differed from expectations.
Track the responses of the relevant SDM(s)
Attach currently-open (that is, accepted, fixable and not yet fixed) and unfixable issues to the relevant dataset specs, so that prospective users will know right away what they're getting themselves into.
Maintain by-dataset and by-site lists of outstanding, fixed and not-fixable issues so that sites can prioritize fixes.
Maintain by-site lists of fixed issues to document our accomplishments.
Provide a dashboard view of implementations by site, showing their extent in time and their relative 'health', as determined by the number of issues currently open against them.

I believe I got pretty far on these goals with the prototype I put together back in February 2010. I think we could execute on this vision (or get even closer) given the right contractor (which I would be delighted to help hire) without breaking the bank.

Proposed Process 2: What we can do in the meantime

So--maybe you agree with my vision above, maybe not. But regardless--we can't be with the one we love, so we've got to love the one we're with. And that one is the CRN Portal. Here's what I propose for a good-enough process for now.

Report your bug by posting a comment to the relevant dataset spec page. Include information like:

What site(s)' implementation you're reporting the bug against
Exactly what you found (including preferably information on how the SDM could find the affected data, BUT NO PHI)
A description of how this was contrary to your expectations (if that's not obvious)

Send an e-mail to Dan and Roy informing them of the new bug.
Dan and Roy will then e-mail the relevant SDMs informing them of the bug report, and ask them to respond.
SDM(s) respond to the report comment on the data spec page, acknowleging or disputing the issue, asking questions, etc.
When an SDM makes a fix, (s)he notes that in a new response to the comment.
After some some period of time fixed issue comments and their responses will be deleted.

I believe the current website will accomodate this use, and I think it will improve the work we are able to do with the VDW.

Please comment!

Wednesday, June 30, 2010

Bug in SAS' SELECT DISTINCT implementation

Running the following code on SAS version 9.2 (TS1M0) on the XP_PRO and NET_SRV platforms, you get erroneous results on the second SELECT DISTINCT. Dataset with_index should only have a single record in it, but on the above-named platforms you get 2 records. It seems that SAS is ignoring the DISTINCT.


  %let startdt  = 01jan2009;
  %let enddt    = 31dec2009;
  
  data enrollment ;
    input
      @1  mrn       $char8.
      @9  enr_start date9.
      @21 enr_end   date9.
    ;
    format
      enr_: mmddyy10. ;
    ;
  datalines ;
  roy     01jan2008   30jun2009
  roy     01jul2009   30jun2010
  ;
  
  proc sql ;
    create table without_index as
    select distinct mrn
    from enrollment as e
    where e.enr_start <= "&enddt"d and
          e.enr_end   >= "&startdt"d
    ;
  quit ;
  
  
  proc sql ;
    ** This guy queers the DISTINCT result ;
    create index pk on enrollment (mrn, enr_start, enr_end) ;
  quit ;
  
  proc sql ;
    create table with_index as
    select distinct mrn
    from enrollment as e
    where e.enr_start <= "&enddt"d and
          e.enr_end   >= "&startdt"d
    ;
  quit ;

I'm not sure exactly what the conditions are for this bug, but the above definitely ain't right.

Thursday, May 20, 2010

Going to an incremental update strategy on Utilization

We've been having trouble w/our V2 ute job for a while now--crazy long runtimes, running out of scratch space & such. With the work we're doing for V3, we started brainstorming on how we could go from the current complete-rip-and-replace job, to something more surgical & efficient. We arrived at the below steps for a strategy & I thought others might be interested and/or have valuable feedback.

Current V2 Strategy

In general, our V2 strategy has been to reshape each indigenous source of data into a denormalized version of encounter/dx/px--something like:

mrn	adate	provider	dept	dx1	dx2	...	dxN	px1	px2	...	pxN

Then we mash each resulting file together, and process it as a single file (called 'source'):

extract 'encounters' out of it according to the V2 primary key (mrn|adate|enctype|provider)
reconcile logical conflicts & do encounter-level cleanup

sweep non-inpatient encounters up into embracing inpatient stays;
merge contiguous-in-time professional claims in inpatient settings into single inpatient encounters (so e.g., Joe Patient no longer looks like he had 30 individual one-day stays in June).

clean up weirdo dx/px codes, etc.

Now, because we had that PK & encounter-level cleanup to do, the easiest way for us to save prior processing was to store off the reshaped chunks of indigenous data. For inactive sources (e.g., from systems no longer producing new data) we did this once & stored a static chunk. For active sources (claims, until recently a pre-EPIC legacy feed, now a new direct-from-clarity feed) we would create static chunks for stuff back far enough in time that we expect it to essentially be static, and then process more recent data anew. That saved a fair bit of work & shaved time off the job, but we still have issues w/it barfing if the server is particularly loaded at the wrong time. So we're eager to try and do something smarter.

Version 3 + MD5() = Incremental Updates

So now in version 3 of utilization, we don't have to shoehorn the data into a uniform definition of an encounter. We still need to do our best to ensure that the records in the main encounter file are indeed encounters, and I believe we still need to do the encounter-level cleanup steps, but we're no longer tied to a particular definition of encounter. So if one of our sources has, say, the time of the encounter in addition to the date, we're free to take that into account in figuring out which records are dupes or otherwise mergeable and which are indeed distinct encounters. Another nice change is that we will have a single field (enc_id) to use as a unique key--much nicer/more convenient to use than the V2 "gang of four".

So here's what we're thinking of trying for V3.

The Two Horseman of the Utilization Apocalypse

In general, the boogeyman around doing an incremental update is, "what if the already-processed data gets deleted or corrected in some way?". If we just take for granted that the first time we see a record & stash it in the vdw ute files it's good forevermore, we may miss out on subsequent corrections. (I'm not actually sure how much of a problem this is at GH--it may be negligible, but we are too anal-retentive to just take it on faith.) In addition to that boogeyman, you have the basic problem of figuring out which records are indeed already processed. Particularly for claims sources, where claims may bounce back and forth between the originating org & GH a couple of times to work out disputes, it's not uncommon to have new claims show up 3-months post service. Others sail right through in a single month. So claims data is almost always going to be a melange of new-and-old. We can't be adding the same data over & over again, but running through every var to see if anything has changed seems too grim to contemplate.

MD5 To The Rescue (?)

One of the newer functions available in SAS (as of 9.1 maybe?) is md5(). MD5 is a cryptographic hashing algorithm that's wonderful for creating "digests" of documents--that is, a gibberish-looking hexadecimal number that acts as a fingerprint for the document. There are several cool things about cryptographic hashes.

First, tiny changes in the input result in pretty massive changes in the output. Check out the MD5s for the words 'cat' and 'cot' below:

'document'	MD5
cat	d077f244def8a70e5ea758bd8352fcd8
cot	97223fab7b0d4c64c07e6e004c602302

Second, the likelihood of two different inputs resulting in the same output is infinitesimally small. So it really is something like a fingerprint.

Our plan is to compute and store (in an off-spec variable) an md5 based on all the substatntive data fields on their way into the main ute file, and use that to figure out what already-processed data has changed, and what data is not already in the vdw.

In short:

We build a set of files from scratch (status quo), plus we have the enc_id's (defined per-source, in such a way as they are unique across sources)
Pull the last X months of data from each still-productive source (claims, clarity, DRG, OIC, short stay, lab results, & micro results)
For each record read from source data, we compute an md5 hash for the smushing together of let's say all substantive (encounter-defining) fields (and the list can differ across sources if we want them to), plus all dx's and px's (at least for VIST-shaped stuff--it may not make sense to denormalize clarity data--or it might I suppose). Call this variable source_data_hash.
Combine each new chunk of source data into a single dset (right?)
Remove any record from source where the combination of enc_id and source_data_hash are found in vdw.utlilzation (e.g., source left join ute, where ute.enc_id is null)
That done, remove any record from vdw.utilization (and px/dx) whose enc_id is found in source.
Denormalize/clean source--these are the new recs for ute/px/dx
Append & update dset labels to indicate the new extent-in-time of the data therein.

That should get us pretty far. I think the only things it *won't* get us are the encounter-level cleanups we do, for stuff where we've already processed some of the data that need to be involved. To wit: sweeping non-inpatient data that occurs w/in the extent of an inpatient stay into the stay (when the stay in question was already processed), and collapsing contiguous-in-time inpatient claims into a single stay record (when some of those stays were already processed). In both cases we'll have to look at the substantive fields on the already-processed inpatient stay records to evaluate the need to sweep/merge. But godwilling it'll be a small enough amount of data that it won't be too intensive to do.

What do you think--does this hold water?

Implementing the Virtual Data Warehouse