Implementing the Virtual Data Warehouse

Tuesday, November 10, 2015

How we do Quality Assurance for the VDW

Last weekend I attended the most excellent PCORI Data Quality Code-a-thon, hosted by Michael Kahn and his colleagues over at University of Colorado, at which I met some really interesting and smart people doing really interesting work. A couple of them evinced an interest in VDW QA work and I said I'd share the substantive checks that we are doing.

Some Context

This is volunteer work

Like most everything VDW, QA work is largely unfunded and distributed across implementing sites. Volunteers from the data area workgroups (e.g., Utilization, Pharmacy, Enrollment) put together lists of checks pertaining mostly to their data areas, write VDW programs that implement the checks & periodically (generally annually, but sometimes more frequently) make a formal request that implementing sites run the code & submit their results to the program author(s) for collation & reporting out to the VDW Implementation Group.

One big implication of this is that our approach is not nearly as coordinated as an outside user might expect. I'd like to say that we are evolving toward a common approach, and we do have a new(ish) "Cross-File QA" group that's taking on meta-standards for QA work, but there is definitely a long way to go before this is uniform enough to be coherent to anyone not familiar with the history.

QA Is Multi-purpose

We generally try to kill 2 birds with our QA stones. Primarily we want to characterize the quality of our implementations for ourselves, each other, and our user community. But we also love it when our reports are useful to Investigators writing grant applications, who sometimes need to brag about e.g., how many person/years worth of data we have across the network for people with a drug benefit.

This can be a slippery slope, on occasion leading individual sites to declare that a given measure has strayed from QA (which is generally exempt from IRB approval) into substantive research territory, or else exposes what should be proprietary information. One example that comes to mind on Enrollment was a measure of churn--e.g., in a typical month, how many enrollees does a site tend to lose to disenrollment, and how many do they pick up? It's a constant dance/negotiation.

Roy's QA Prejudices

To my way of thinking the best QA:

Enables implementers to find (and fix) their own errors first, before exposing them to any larger audience. This is a matter of professional courtesy.
Includes as many objective checks as are practical to implement in code, and presents the running user with:

A clear list of what the checks are
What the tolerance is for those checks (e.g., 3% of your patient language records can have nonstandard values, but any more than 5% and we're going to say you failed the check).
Whether the file passed or failed each check.

More general descriptives characterizing the amount and predominant values in the data. These are often most useful when viewed as part of collated output so you can compare sites.
Collated quality/descriptive reports

should be readily available to the user community (we have them up on the HCSRN web portal, behind the password-protected area).
should be easily updated (completely automatically if possible) so that implementers are incentivized to fix whatever issues they can as soon as possible (and get credit for doing so).

Following the lead of the Utilization (encounters) workgroup, we generally refer to the objective checks as "Tier 1" checks and the descriptives as "Tier 2". Like most things, the checks are a matter of negotiation within the workgroup. I've come to think of them as crucial adjuncts to the specs themselves because they sometimes reveal reasonable disagreements on how to interpret the specs.

The Checks

Demographics

Tier 1

All variables listed in the spec exist and are the proper type (character or numeric). I don't personally like to check length and so don't, though there is diversity of opinion on that & so some QA programs do. There is no tolerance on these checks--any missing variable is a fail.
For those variables that have an enumerated domain of permissible values (pretty much everything but MRN and birth_date) that those are the only values found. If > 2% of the values found are off-spec we issue a warning. At 5% or greater we fail the check.
MRN is unique. Zero tolerance here.

Tier 2

Counts of records (not just current enrollees) by

Gender
Race
Ethnicity
Age group
Need for an interpreter

Counts of enrollees over time by those same variables.

Enrollment

Tier 1

All variables listed in the spec exist and are the proper type (character or numeric). Here again zero tolerance.
For those variables that have an enumerated domain of permissible values (all but MRN, the start/stop dates, PCP and PCC) that those are the only values found. Same tolerance here as with demog-- > 2% is a warn, > 5% a fail.
Enr_end must fall after enr_start. Zero tolerance.
Enr_end must not be in the future. Warn at 1%, fail at 3%.
At least one of the plan type flags (which say whether the person/period was enrolled in a PPO, HMO, etc.) must be set. Warn at 2% and fail at 4%.
Ditto for the insurance type flags (e.g., commercial, medicare/caid, etc.).
If any of the Medicare part insurance flags are set, the general Medicare insurance flag must be set. Zero tolerance.
No period prior to 2006 should have the Medicare Part D flag set. 1% warn, 2% fail.
If the Part D flag is set, the drugcov flag must also be set.
If the high-deductible health plan flag is set, either the commercial or the private-pay flag must also be set.
If any of the incomplete_* variable values is 'X' (not implemented) then all values must be 'X'.

(That last check refers to six variables too new to be listed in the publicly available specs. They let implementers surface known problems with data capture, if there are any (for example at Group Health we only have tumor registry information on people who live in one of the seventeen WA State SEER counties).

Tier 2

Counts and percents of enrollees over time by all substantive enrollment variables--plan types, insurance types, drug coverage, etc.
Counts & percents over time by several demographic variables (listed above under demographics).
Counts & percents of enrollees over time by whether the values in primary care clinic (PCC) and primary care physician (PCP) appear to be valid (that is, contain at least one non-space and non-zero character).

More Later

As I'm the primary author of the E/D checks, these were the easiest to hand. I'll be back with at least some of the other checks the other groups have implemented.

Wednesday, May 21, 2014

Axis Problem

How can I get the 0 value on this graph's x-axis to stay left-justified, without using a VALUES = statement on the XAXIS?

The code for this is:

data gnu ;
input
@1 clinic $char5.
@7 report_date date9.
@19 readmit_rate 3.1
;
format report_date monyy7. ;
datalines ;
north 31may2012 0.8
north 30jun2012 0.2
north 31jul2012 0.3
west 31may2012 0.0
west 30jun2012 0.0
west 31jul2012 0.0
run ;

options orientation = landscape ;

ods html path = "c:\temp" (URL=NONE)
body = "deleteme.html"
(title = "Axis Problems")
;

proc sgplot data = gnu ;
hbar report_date / response = readmit_rate ;
by clinic ;
xaxis grid /* values = (0 to .1 by .1) */ ;
yaxis grid ;
run ;

ods _all_ close ;

Friday, January 17, 2014

Why aren't these colors the same?

When I run:


proc format ;

  value $sub

    "bibbity" = "5"

    "bobbity" = "30"

    "boo" = "80"

    "baz" = "120"

    "foo" = "150"

    "zoob" = "180"

  ;

quit ;


data test ;

  do subject = 'bibbity', 'bobbity', 'boo', 'baz', 'foo' ;

    do obs_date = '01-jan-2010'd to '31-dec-2013'd by 30 ;

      num_widgets = input(put(subject, $sub.), best.) + floor(uniform(4) * 30) ;

      proportion_blue = uniform(4) ;

      output ;

    end ;

  end ;

  format obs_date mmddyy10. proportion_blue percent9.2 ;

run ;



%let out_folder = c:/temp/ ;



ods graphics / height = 6in width = 10in ;

ods html path = "&out_folder" (URL=NONE)

         body   = "deleteme.html"

         (title = "Why are bubble color and line color not coordinated?")

          ;



  proc sgplot data = test ;

    loess  x = obs_date y = num_widgets / group = subject ;

    bubble x = obs_date y = num_widgets size = proportion_blue / group = subject transparency=0.5 ;

    xaxis grid ;

    yaxis grid ;

  run ;



ods _all_ close ;

I get:

I think it's confusing that the bubble colors and the loess line colors don't match.

If I change from loess to a series plot, the colors match.

Anybody know how I can get the colors to match?

Thanks!

Sunday, November 10, 2013

Loving R's ggplot2!

So I'm messing around with R for a coursera course I'm taking, and totally loving the ggplot2 library. Check out this lovely plot of some loan data we got from the course instructor:

That's a scatterplot with loess smoother + confidence intervals, all done with this simple call:


qplot(x     = FICO
    , y     = ir
    , data  = loansData
    , color = Loan.Length
    , geom  = c('point', 'smooth')
    , ylab  = "Interest Rate"
    , xlab  = "FICO Score"
    )

How awesome is that?

Saturday, September 21, 2013

Launchy + PowerShell = easy navigation between project folders

Like most programmers at GHRI, I do work in a multitude of different directories. Different projects store their programs & data in different folders, and there are numerous different folders that that are important to my data infrastructure work.

When I'm called upon to navigate to these different folders I typically have to remember where they are & then 'cd' over to them (if I'm at a command line) or type into explorer's address bar one component at a time, waiting for the auto-complete (or attempting tab completion at the command line). This can be cumbersome--especially when I'm not physically connected to the network.

At some point I decided to set some environment variables for myself so I could just type, e.g., %myproj% into the Run window or Explorer's address bar or an Open File dialog & be taken there instantly. I found this very helpful--no more having to remember where things lived, just my nicknames for them.

Then after adopting powershell as my preferred command-line & discovering functions, I created a parallel set of functions that just did a 'cd' into the proper directory.

Then my machine was repaved & upgraded to Windows 7, and I lost my environment variables. Around the same time I read a lifehacker article on the Launchy utility & decided to try that. So rather than set up the environment vars I just created a special folder called Shortcuts into which I put shortcut files pointing to the various folders, named after my nicknames for the projects. I like Launchy quite a bit, but I did miss my environment variables for the odd Open File dialog.

So today I decided to delve into powershell scripting a little so that I could put all the information in a script, and have it generate:

The environment variables I missed,
the ps functions I wanted, and
the Launchy shortcuts I wanted.

Here's what I came up with--it seems to work pretty well.

$WinShell = New-Object -comObject WScript.Shell
$shrt_dir = "C:\Users\Roy\Desktop\shortcuts"
# nicknames and locations of my projects
$projects = @{"grif"  = "\\some_server\griffin\stupid name" ;
              "cupid" = "\\other_server\projects\cupid" ;
              "prod"  = "\\data_server\management\programs\"
}

foreach($prj in $projects.GetEnumerator()) {
  # Create a shortcut named for the nickname that points to the dir in the value.
  $shrtfile = $shrt_dir + '\' + $prj.key + '.lnk'
  $shrt = $WinShell.CreateShortcut($shrtfile)
  $shrt.TargetPath        = $prj.value
  $shrt.WorkingDirectory  = $prj.value
  $shrt.Save()

  # Create an environment variable for each.
  [Environment]::SetEnvironmentVariable($prj.key, $prj.value, "User")

  # Create a function for each nickname
  $this_func = "function " + $prj.key + "() {Set-Location '" + $prj.value + "'}"
  $this_func
  Invoke-Expression $this_func
}

Next up I want to change my prompt function so that those project folders show up as e.g., ::cupid:: in the prompt rather than the whole long thing. I'm already replacing $env:home with a '~', so I should just be able to loop through that $projects dictionary to make similar substitutions.

Tuesday, July 9, 2013

Principles of VDW development

Spitballing for a paper I'm (hopefully) helping to write.

The principles informing development of the VDW are:

The VDW exists to facilitate substantive health services, epidemiological, and health economics research, rather than advance the fields of medical informatics or computer science. Thus we take a very pragmatic (and often not very sexy) approach to data sharing. Results are always favored over methods.
Participation must not impair organizations' ability to protect patients/insureds as the human subjects of research, or abide by applicable laws and regulations (HIPAA, etc.). Participation must also not impair organizations' ability to protect the interests of both local researchers and their parent organizations (i.e., the health care/insurance providers).
The data standards must be open and publicly available so that any interested organization with relevant data can implement and potentially collaborate with other implementing organizations.
The data standards represent a "floor" rather than a "ceiling"—that is, implementers are free to embellish or enhance data structures in ways that do not break compatibility with the base VDW specifications, in order to increase the value to local researchers.
No "least common denominator" specifications. Participating organizations have varying levels of detail in their local data. Rather than discard all but the lowest level of detail available across all implementing sites, the best data specifications accommodate detail and make it optional. For example, if some organizations record the precise day that an insured disenrolls, but others always pad this out to last day of the month, the best specification will allow both organizations to put their data in VDW form.
Participation can be partial—an organization can choose to implement 5 out of the 11 data areas for example. Their attractiveness as a collaborator is thus diminished of course, but they are still participants in the VDW process.
Central coordination, but not authority. Participation is voluntary by all organizations. There is no central source of funding, and therefore no central authority directing the development of the VDW. Decisionmaking is by rough consensus with the goal of serving the greatest number of existing and foreseeable projects, while not overtaxing resources available at the sites.

The problem w/the 3rd bullet there is that there's no central clearinghouse where a non-HMORN member org can advertise their implementation & hold themselves out as potential collaborators.

I might add:

The data is never going to be perfect. Like all data collected for any reason, VDW data has problems, not all of which we know about. Problems can be introduced both in the local operational business systems from which it originates, and in the work implementers do to transform it into VDW data. We are committed to a transparent process by which problems are reported, prioritized, and fixed. But users should not assume that the data is pristine and suitable for their intended use--caveat user, and do please report issues.
VDW is not the end-all/be-all of research data warehouses. We are eager to (strategically) expand the domain of projects for which VDW data alone are sufficient--where current or future projects stand to benefit. But potential users should not assume that a given study can be carried out using nothing but VDW data. Custom programming is sometimes necessary, and some projects can't be carried out at all. Also, "my project needs new variable X" is not a reason for us to drop everything we're doing and implement variable X.

What am I forgetting? What could be said better?

Wednesday, May 8, 2013

Managing the standard macros

Current State

GHRI maintains an FTP server at vdw.hmoresearchnetwork.org, which we use to keep/maintain the VDW standard macros, and several associated data files (e.g., the format data for %vdw_formats).

There are 2 main accounts for this server, one with just read access (which is what we use in the FTP FILENAME statement in stdvars.sas) and one with write and read (which is what I use for updating the macros).

Advantages

SAS users at every site can %include the FTP-hosted file directly into their programs, ensuring they have the most recent version of the code at all times.
Having the code in a single place ensures that all users benefit from additions and bugfixes.
FTP is easy to script & we have an easy method of pushing out updates (e.g., a ruby script moves the file).

Disadvantages

FTP is rightly viewed as an outdated and insecure protocol.
One VDW site disables FTP on one of their servers, requiring local staff to manually download (and then selectively alter) the standard macros program file.
FTP has made it a pain for end-users to just browse the code, which has probably hindered use.

We have on at least two different occasions over the last 5 or so years tested out http access at the various sites. That can work every place that tested it (don't recall who all reported in the last time, though it was a bunch of sites)--including Group Health. In that last round we also tested https access. That everybody could do except for Group Health. (Our problem is our proxy server, which requires NTLM authentication, which not many non-browser clients know how to pass.)

Impending Change

KP Northern California is to take over the hmoresearchnetwork.org domain, and will implement a new web presence for the HMORN based on its Alfresco system.

This raises the questions:

Will KPNC be amenable to maintaining an FTP server as GHRI has lo these many years so that we can keep the status quo?
Do we want to keep the status quo, or is it time to either move to http(s), or to another model of disseminating this code altogether?
If http(s) do we want to keep the code on an hmorn.org domain or do we want to move things over to say, github?

The main candidate for "another model of disseminating this code altogether" is I think, the CESR convention of making static copies of macros which are included in the .zip file packages that represent CESR VDW application jobs. Even in that case though I think we'll want to have some 'canonical' repository of the code, to which updates should be directed & from which new copies should be had. I don't know how CESR is dealing with that currently.

If the answer to question 1 is 'no' and 3 is 'hmorn.org', then I think either KPNC staff will have to take charge of the macros, or else provide Roy with a means of pushing updates to the file (preferably easily automated--something git or ssh-based would be my preference).

The case for github

I for one, would like to see us move the whole schmeer over to a public github repository, for the following reasons.

It makes our macros capital-P Public, and I think many SAS programmers could benefit from them.
It will raise the profile of the HMORN on what has become the de-facto standard for open source projects. Note that we already have an 'organization' on github for the HMORN, which Christine Bredfeldt and I put together in order to share the PHI macros we wrote, and which Christine has published a paper on.
It makes not just the macros as they exist now available, but the complete history of edits back to 2009 available. All versions will be easily browsable.
Github has a lovely collaboration model, and should make it easy for any minimally competant git user to update a macro and share it back with the group, either via pull request, or by direct commit (for people given commit permission).
It is incredibly secure. In particular:

All private data exchanged with GitHub is always transmitted over SSL (which is why your dashboard is served over HTTPS, for instance). All pushing and pulling of private data is done over SSH authenticated with keys, or over HTTPS using your GitHub username and password. The SSH login credentials used to push and pull can not be used to access a shell or the filesystem. All users are virtual (meaning they have no user account on our machines) and are access controlled through the peer reviewed, open source git-shell.

The one down-side is that would-be collaborators would need to get comfortable with git before they could easily contribute. But that's not much of a problem relative to the status quo--people would still be able to e-mail me (or Christine, or any other person who creates an account & is given commit permission) changes that we could commit and push.

Implementing the Virtual Data Warehouse

Tuesday, November 10, 2015

How we do Quality Assurance for the VDW

Some Context

This is volunteer work

QA Is Multi-purpose

Roy's QA Prejudices

The Checks

Demographics

Tier 1

Tier 2

Enrollment

Tier 1

Tier 2

More Later

Wednesday, May 21, 2014

Axis Problem

Friday, January 17, 2014

Why aren't these colors the same?

Sunday, November 10, 2013

Loving R's ggplot2!

Saturday, September 21, 2013

Launchy + PowerShell = easy navigation between project folders

Tuesday, July 9, 2013

Principles of VDW development

Spitballing for a paper I'm (hopefully) helping to write.

Wednesday, May 8, 2013

Managing the standard macros

Current State

Advantages

Disadvantages

Impending Change

The case for github

Followers

Blog Archive