Monday, November 5, 2012

Cool new lab results data visualization

I've been trying to figure out how to put different graphs together in the same picture for a while now--finally got it working & could not contain my glee (thus, the blog post ;-). My main motivation is identifying local_cd's that I shouldn't be using (e.g., results should be character, but are numeric; units are wrong; values are way out of range, etc.). Check it out--I've got a VDW program that spits one of these out per value of test_type:

The graphs depict:
  • The number of result records over time.
  • The number of result values that are numeric vs. character, stratified by the local lab codes used.
  • The distributions of numeric result values, by unit.
  • The number of numeric result values, by unit and local lab code.
  • The values of character results by local lab code.
I just this morning submitted an abstract for the 2013 HMO Research Network conference & hope to present the graphs & program there.

Sunday, October 14, 2012

New Sublime Text package available for SAS Programming

Please use my SAS Package for Sublime Text

I started writing a crazy long treatise on why I like text editors, but really the information I want to put out is this: I've been working on a package for SAS programming with the Sublime Text 3 editor, which I believe is really functional & I would love to get feedback on how to make it better.

I'm using this as my main editor now--have not opened up UltraEdit in literally weeks and could not be happier.


You will find the package in this github repository.  Installing it amounts to causing the files in that repo to be copied to a new subdirectory under the location where Sublime Text is putting your packages (which you can find out by choosing the 'Browse Packages...' item off the 'Preferences' menu).  There are 2 good ways to do this:

If you've got git

Open a command prompt, cd over to the location where your Sublime packages are stored, and type:
git clone SAS
The nice thing about doing it that way is that you can update the repo anytime thereafter with a simple
git pull
In the package subdirectory.

If you don't have git

The easiest thing would be to download this zip file and extract it to a new SAS subdirectory that you create in your Packages directory.

What do I get?  What's working?

The package includes:
  1. A mostly written-from-scratch SAS program syntax definition file, so syntax highlighting works (including SAS' wonky comment style) and the symbol list gets populated w/dataset & macro names.
  2. Likewise, a SAS log syntax definition file, which highlights NOTEs, INFO messages and of course ERRORs.
  3. A build system (windows-only at the moment) which:
    1. Batch-submits the currently-showing file,
    2. Waits for the program to finish,
    3. Scans the resulting .log file for errors/warnings w/a fairly sophisticated regex (which e.g., ignores license expiration WARNINGs), popping a dialog indicating whether the log is clean), and then
    4. Opens the log file for viewing.
  4. A command to skim through log files, jumping right to ERRORs and WARNINGs, by default bound to ctrl-e.  So just skip from trouble spot to trouble spot by hitting ctrl-e!
  5. A bunch of very useful snippets.

What's not working yet or is still to be done?

My to-do is:
  1. Rewrite the build script (currently a windows script file) as a python command to make it portable across platforms. SORT OF DONE
  2. Incorporate this package into Package Control , to make installation & updating that much easier.
  3. Move the Roy- and Group Health-specific snippets out of the package & into my User package.
  4. Moar snippetts!


Many many thanks to Jiangtang Hu, whose fork of Jakob Stoeck's SAS bundle was the beginning of my hope that I could profitably use Sublime Text for SAS programming!


Looks like there have been late comments to this post, with error reports.  Terribly sorry not to reply--I did not get notification about the comments.

Would anyone who's having trouble please log an issue at the github repository please?

Also--please note that I am no longer supporting sublime text 2 (even though it's nominally the current version).  ST3 has been out for more than a year (?) now, is very stable, and that's all I have installed anywhere, so I'm not in a good position to support ST2.  Sorry!

Thank you!

Monday, July 2, 2012

You're telling me what you need--I'm telling you what we have.

Listening in on some of the PCORI Workshop To Advance Use of Electronic Data, I can't help but be reminded of this bit of the movie Apollo 13:

For organizations like the HMORN, it's crucial to remember that we are data recyclers--we don't get to say what the health plans/providers record in the conduct of their business. We get what we get.

A big part of the valuable expertise that we've built up over the years is how to wring the greatest amount of information possible out of the data that we do have.

Friday, June 22, 2012

A Design Pattern for Production SAS jobs

Okay, that title is probably putting on airs.  But I have been thinking about what a good set of conventions might be for us to use at GH, and I think I've happened upon one, and wanted to flesh it out a bit to see if it looked reasonable all in one place.

So here it is.


Lots of this relies on my personal preferences (e.g., many small files vs. one single monolithic job) and what may be unusual bits of the GH (or my personal) computing environment.  I do not intend to prescribe these as The One True Way for any and all people who may eventually read this.  I'll try to state my reasons for preferring something when I have reasons.  If I don't state any, feel free to assume that I don't have any & it's just arbitrary preference (or ask in the comments & I'll respond).

Some Context

Couple of salient details about the environment I'm working in.
  • (As of this writing anyway) the primary destination for the data marshaled in these jobs is SAS datasets sitting on a central "data warehouse" server.  That's a windows server & is  a SAS/CONNECT rsubmit target.
  • Source data come from any or all of:
    • SAS datasets sitting on the same destination server.
    • SAS datasets sitting on a unix server (which is also a SAS/CONNECT rsubmit host).
    • GH's Enterprise Data Warehouse--an RDBMS.
    • GH's Cost Management Data system--a different RDBMS.
    • Clarity--on still a third type of RDBMS.
  • My primary machine is a laptop, which often has only a poor connection to the network.
  • There are 3 of us working on "data infrastructure".
  • We use git for source code control. 


We want:

  • all jobs to run on a schedule, so the humans don't have to remember to kick things off.
  • to be able to roll back to a previous version of a file quickly in case there are errors or bad updates.
  • to be able to make and test fixes to jobs quickly, and without disturbing the user-facing (production) data.
  • it to be reasonably easy for a new person to understand/maintain a job they didn't create.
That last guy is the reason for the conventions.

The Conventions 

Destination Data


Datasets should live in folders that group them by content area.  (Right now we have all VDW datasets in a single folder, but the idea of putting each one in its own subfolder sometimes appeals to me.)  So encounters/pxs/dxs for instance should all be in a single folder.
  • Current production versions of each set of datasets should live in a top level folder, whose name signifies what datasets should be found therein (e.g., "ambulatory_visits").
  • Only production data (and associated views thereon) should live in these folders--there should be no ancillary or interim datasets, documents, programs, or other files in there.
  • Any ancillary and end-user-useful interim files should live in a subfolder called support.
  • Prior versions of datasets should live in a subfolder called past.  Space permitting, the immediate prior version of the dataset should be stored unzipped.
  • New versions of datasets should be stored in a subfolder called next, for whatever duration of time it takes for staff and/or users to validate them. 
So jobs build datasets in the /next subfolder, whatever QA & or user validation is going to happen happens, and then dsets migrate from prod -> past, and from next -> prod.  This helps cut down on runtime errors due to users holding locks 


  • Dataset names should clearly convey what data is contained therein.
  • Datasets should be labeled to indicate their contents and their expected-complete-through date (e.g., "Professional claims paid through 31oct2012").
  • Variables should similarly be labeled descriptively, and include important caveats if necessary and/or aids to interpretation (e.g., 02 = breast, 03 = colorectal).

Source Code


In case it doesn't go without saying, all source code needed to produce production data, as well as any associated notes, documents, ancillary data, etc. should be under source code control.

The 'production' repository is the one named in the scheduled job.  At GH we store source in a separate folder structure that largely mirrors the datasets that the jobs create..

Substantive edits to source code must not be made directly to files in the production repo--you should always do real work on a clone whose changes you then pull into prod once they are tested and ready.  The only edits you should do in prod is the minor alterations necessary to have the job run in production mode (that is, so it runs successfully under a service account & replaces the prod data).  Unless you're feeling lucky. ;-)

In my case I usually have 3 repos/job: production, a working copy that lives on my laptop, and a 'staging' copy that lives on the network in a non-production area.  I typically edit on my laptop, pull changes into staging, run/test them from there (usually from an ad-hoc task scheduled on a machine that stays connected to the network).  I use this script for that scheduling, so I get an e-mail when the job finishes.  I do this because the jobs often run longer than I'm connected to the network and I want to run them without interruption.  So the sequence is usually:
  1. Clone prod onto my laptop.  Work there.
  2. Clone laptop into staging.  Test there.
  3. Pull from staging into prod.
  4. Pull from prod into laptop to bring that repo up-to-date.
Round-and-round she goes.  Git makes this muy easy.


Here's where the real fun is.

Like datasets, every job gets its own folder.  The programs necessary for the job all live in the top-level.  I have the following subfolders:
  • /tests: for holding unit test type programs--optimally one/macro in the job.
  • /supporting files: reference data (usually mdbs, for holding things like our native lab result codes & the VDW test_types they should be translated into), reconnaissance programs, ad-hoc fix programs; sketches for new approaches to things
  • /qa: data QA
  • /output: QA programs spit results out here.
  • /notes: various text files, word docs, etc. describing the job, or questions I asked/answered in the course of the programming.
Also like with datasets, the top-level folder has nothing but the .sas files necessary for the job.  (Because this stuff is all under SCC, we don't have to be sentimental/anal-retentive about obsolete code--we can go back in time anytime we want.)

There are two programs that are part of every job:
  1., which  is the entry-point for the job--that's what controls execution.
  2., which--wait for it--sets up the environment for the job, making it possible/easy to switch the job from "dev" mode (runs as me; maybe on subset data; does not disturb production data) to "prod" mode (runs under a service account; puts data in public-facing places).
By 'environment', I mean things like:
  • Where the job will write out any interim datasets.
  • Where it looks for ancillary datasets (e.g., our enrollment job has a separate dset of DRUGCOV information that lives in a file that only gets built once in a while--that type of thing).
  • Where it looks for files of account credentials/rlogin script files (these are different if I'm running in dev mode (runs as me) vs. production mode (runs as a service account). 
  • Crucially--where the job will write out its final datasets.
  • Whether are any global subsetting parameters or "obs = X" type conditions that should be in effect during the job.
  • Sometimes--the locations of the source data for the job.
Usually I set all these various libs/parameters up to prod-mode values, and then define a macro that sets them to dev values.

I would like to say that is the only place you'll find a libname statement in my jobs.  I aspire to that, but it's sometimes tough when I have to rsubmit into server1 to do some stuff, and then rsubmit from that remote session into server2 to do some other stuff.  I'd love to hear ideas for overcoming that.

In addition to these two programs, there are a series of files that define the macros that actually get the work done.  There's typically nothing fancy about these macros--they just take a few parameters (usually naming datasets they should accept as input and spit out as output) and do a chunk of work, assuming that their environment is set up for them.  The macro-fication is strictly about being able to control which bits of the program run when.

So typically looks something like this:
** Start a remote session on the server. ; 
%include "//server/share/" ;
** Make sure only one of these is in effect (or that the last definition is ;
** the one you want). ;
%**let codehome = //server/share/job_name/ ;       ** <-- prod. ;
%**let codehome = //other_server/share/job_name/ ; ** <-- staging ;
%let codehome = //roys_laptop/share/job_name/ ;    ** <-- laptop ;

%include "&" ; 
** Put us into development mode (comment this out to go prod) ;
%set_dev ;
%include "&" ; 
%include "&" ;
%include "&" ;
%gather_mbhist(outset = mbh, obslim = &obs) ;
%gather_retro(outset =  retr, obslim = &obs) ;
%interleave(mbhist = mbh, retro = retr, obslim = &obs, outset = retro_enrollment) ; 

I've seen jobs where the files are named with leading numbers to indicate the order in which they are run.  That can be nice, but I don't tend to follow that convention myself.

There are a couple of advantages to setting up jobs this way:
  • It is very easy to see what mode a job is currently in, and to change it to a different mode.
  • Because is the only program actually executing code, I can use it to easily control the job.  If something barfs in a middle step I can comment-out the bits that ran properly and manually submit the rest.
  • If I'm smart about how I name the macros that make up a job, can read like a description of the high-level steps of a job.
  • Having code split up into discrete chunks makes it easier to test.  I can have an e.g., job_name/tests/ program that just does minimal environment setup & then %includes/calls that one bit of the job. 
  • Putting the macros in different files lets me navigate to the code of a given step more easily than if I just had a single monolithic file with everything in it (I use an editor w/a multi-tab interface, so it's easy to jump around). 
So there it is.  What do you think?

Lives Schmives

We and other networks like to toss around "covered lives" as a statistic with which to impress funders. I think this probably claims too much and too little at the same time.

It's too much because we don't literally have e.g., 13-million complete birth-to-death life stories in our data. But it's too little in that looking at a snapshot of who's enrolled over any particular period of time understates the *depth* of coverage over time. A lot of our members stay members over long periods of time. Not all, to be sure, but plenty do. They retire; change employers; move from place to place, but they very often stay with our orgs.

So what would be a better statistic for us to brag on? Summing up person-years goes some distance to address the issue, though that too undersells the longevity. Here are some candidate statistics we might use to give a better sense of what we've got:

  • total person/years
  • the median length of terminated enrollments (or maybe show percentiles on that distribution)
  • the proportion of terminations that are due to death.
  • Average proportion of live covered (so--(#covered days /(DOB - min(DOD, &sysdate)))
What other statistics can you think of?  Leave a comment & make your case!