Implementing the Virtual Data Warehouse: June 2012

Okay, that title is probably putting on airs. But I have been thinking about what a good set of conventions might be for us to use at GH, and I think I've happened upon one, and wanted to flesh it out a bit to see if it looked reasonable all in one place.

So here it is.

Disclaimer

Lots of this relies on my personal preferences (e.g., many small files vs. one single monolithic job) and what may be unusual bits of the GH (or my personal) computing environment. I do not intend to prescribe these as The One True Way for any and all people who may eventually read this. I'll try to state my reasons for preferring something when I have reasons. If I don't state any, feel free to assume that I don't have any & it's just arbitrary preference (or ask in the comments & I'll respond).

Some Context

Couple of salient details about the environment I'm working in.

(As of this writing anyway) the primary destination for the data marshaled in these jobs is SAS datasets sitting on a central "data warehouse" server. That's a windows server & is a SAS/CONNECT rsubmit target.
Source data come from any or all of:
- SAS datasets sitting on the same destination server.
- SAS datasets sitting on a unix server (which is also a SAS/CONNECT rsubmit host).
- GH's Enterprise Data Warehouse--an RDBMS.
- GH's Cost Management Data system--a different RDBMS.
- Clarity--on still a third type of RDBMS.
My primary machine is a laptop, which often has only a poor connection to the network.
There are 3 of us working on "data infrastructure".
We use git for source code control.

Goals

We want:

all jobs to run on a schedule, so the humans don't have to remember to kick things off.
to be able to roll back to a previous version of a file quickly in case there are errors or bad updates.
to be able to make and test fixes to jobs quickly, and without disturbing the user-facing (production) data.
it to be reasonably easy for a new person to understand/maintain a job they didn't create.

That last guy is the reason for the conventions.

The Conventions

Destination Data

Folders

Datasets should live in folders that group them by content area. (Right now we have all VDW datasets in a single folder, but the idea of putting each one in its own subfolder sometimes appeals to me.) So encounters/pxs/dxs for instance should all be in a single folder.

Current production versions of each set of datasets should live in a top level folder, whose name signifies what datasets should be found therein (e.g., "ambulatory_visits").
Only production data (and associated views thereon) should live in these folders--there should be no ancillary or interim datasets, documents, programs, or other files in there.
Any ancillary and end-user-useful interim files should live in a subfolder called support.
Prior versions of datasets should live in a subfolder called past. Space permitting, the immediate prior version of the dataset should be stored unzipped.
New versions of datasets should be stored in a subfolder called next, for whatever duration of time it takes for staff and/or users to validate them.

So jobs build datasets in the /next subfolder, whatever QA & or user validation is going to happen happens, and then dsets migrate from prod -> past, and from next -> prod. This helps cut down on runtime errors due to users holding locks

Datasets

Dataset names should clearly convey what data is contained therein.
Datasets should be labeled to indicate their contents and their expected-complete-through date (e.g., "Professional claims paid through 31oct2012").
Variables should similarly be labeled descriptively, and include important caveats if necessary and/or aids to interpretation (e.g., 02 = breast, 03 = colorectal).

Source Code

Control

In case it doesn't go without saying, all source code needed to produce production data, as well as any associated notes, documents, ancillary data, etc. should be under source code control.

The 'production' repository is the one named in the scheduled job. At GH we store source in a separate folder structure that largely mirrors the datasets that the jobs create..

Substantive edits to source code must not be made directly to files in the production repo--you should always do real work on a clone whose changes you then pull into prod once they are tested and ready. The only edits you should do in prod is the minor alterations necessary to have the job run in production mode (that is, so it runs successfully under a service account & replaces the prod data). Unless you're feeling lucky. ;-)

In my case I usually have 3 repos/job: production, a working copy that lives on my laptop, and a 'staging' copy that lives on the network in a non-production area. I typically edit on my laptop, pull changes into staging, run/test them from there (usually from an ad-hoc task scheduled on a machine that stays connected to the network). I use this script for that scheduling, so I get an e-mail when the job finishes. I do this because the jobs often run longer than I'm connected to the network and I want to run them without interruption. So the sequence is usually:

Clone prod onto my laptop. Work there.
Clone laptop into staging. Test there.
Pull from staging into prod.
Pull from prod into laptop to bring that repo up-to-date.

Round-and-round she goes. Git makes this muy easy.

Organization

Here's where the real fun is.

Like datasets, every job gets its own folder. The programs necessary for the job all live in the top-level. I have the following subfolders:

/tests: for holding unit test type programs--optimally one/macro in the job.
/supporting files: reference data (usually mdbs, for holding things like our native lab result codes & the VDW test_types they should be translated into), reconnaissance programs, ad-hoc fix programs; sketches for new approaches to things
/qa: data QA
/output: QA programs spit results out here.
/notes: various text files, word docs, etc. describing the job, or questions I asked/answered in the course of the programming.

Also like with datasets, the top-level folder has nothing but the .sas files necessary for the job. (Because this stuff is all under SCC, we don't have to be sentimental/anal-retentive about obsolete code--we can go back in time anytime we want.)

There are two programs that are part of every job:

main.sas, which is the entry-point for the job--that's what controls execution.
environment.sas, which--wait for it--sets up the environment for the job, making it possible/easy to switch the job from "dev" mode (runs as me; maybe on subset data; does not disturb production data) to "prod" mode (runs under a service account; puts data in public-facing places).

By 'environment', I mean things like:

Where the job will write out any interim datasets.
Where it looks for ancillary datasets (e.g., our enrollment job has a separate dset of DRUGCOV information that lives in a file that only gets built once in a while--that type of thing).
Where it looks for files of account credentials/rlogin script files (these are different if I'm running in dev mode (runs as me) vs. production mode (runs as a service account).
Crucially--where the job will write out its final datasets.
Whether are any global subsetting parameters or "obs = X" type conditions that should be in effect during the job.
Sometimes--the locations of the source data for the job.

Usually I set all these various libs/parameters up to prod-mode values, and then define a macro that sets them to dev values.

I would like to say that environment.sas is the only place you'll find a libname statement in my jobs. I aspire to that, but it's sometimes tough when I have to rsubmit into server1 to do some stuff, and then rsubmit from that remote session into server2 to do some other stuff. I'd love to hear ideas for overcoming that.

In addition to these two programs, there are a series of files that define the macros that actually get the work done. There's typically nothing fancy about these macros--they just take a few parameters (usually naming datasets they should accept as input and spit out as output) and do a chunk of work, assuming that their environment is set up for them. The macro-fication is strictly about being able to control which bits of the program run when.

So main.sas typically looks something like this:

** Start a remote session on the server. ;

%include "//server/share/start_rsubmit_session.sas" ;

** Make sure only one of these is in effect (or that the last definition is ;

** the one you want). ;
%**let codehome = //server/share/job_name/ ;       ** <-- prod. ;
%**let codehome = //other_server/share/job_name/ ; ** <-- staging ;
%let codehome = //roys_laptop/share/job_name/ ;    ** <-- laptop ;

%include "&codehome.environment.sas" ;

** Put us into development mode (comment this out to go prod) ;
%set_dev ;

%include "&codehome.gather_mbhist.sas" ;

%include "&codehome.gather_retro.sas" ;

%include "&codehome.interleave_enrollment.sas" ;

%gather_mbhist(outset = mbh, obslim = &obs) ;

%gather_retro(outset =  retr, obslim = &obs) ;

%interleave(mbhist = mbh, retro = retr, obslim = &obs, outset = retro_enrollment) ;

I've seen jobs where the files are named with leading numbers to indicate the order in which they are run. That can be nice, but I don't tend to follow that convention myself.

There are a couple of advantages to setting up jobs this way:

It is very easy to see what mode a job is currently in, and to change it to a different mode.
Because main.sas is the only program actually executing code, I can use it to easily control the job. If something barfs in a middle step I can comment-out the bits that ran properly and manually submit the rest.
If I'm smart about how I name the macros that make up a job, main.sas can read like a description of the high-level steps of a job.
Having code split up into discrete chunks makes it easier to test. I can have an e.g., job_name/tests/test_gather_mbhist.sas program that just does minimal environment setup & then %includes/calls that one bit of the job.
Putting the macros in different files lets me navigate to the code of a given step more easily than if I just had a single monolithic file with everything in it (I use an editor w/a multi-tab interface, so it's easy to jump around).

So there it is. What do you think?

Implementing the Virtual Data Warehouse

Friday, June 22, 2012

A Design Pattern for Production SAS jobs