Implementing the Virtual Data Warehouse: May 2013

Current State

GHRI maintains an FTP server at vdw.hmoresearchnetwork.org, which we use to keep/maintain the VDW standard macros, and several associated data files (e.g., the format data for %vdw_formats).

There are 2 main accounts for this server, one with just read access (which is what we use in the FTP FILENAME statement in stdvars.sas) and one with write and read (which is what I use for updating the macros).

Advantages

SAS users at every site can %include the FTP-hosted file directly into their programs, ensuring they have the most recent version of the code at all times.
Having the code in a single place ensures that all users benefit from additions and bugfixes.
FTP is easy to script & we have an easy method of pushing out updates (e.g., a ruby script moves the file).

Disadvantages

FTP is rightly viewed as an outdated and insecure protocol.
One VDW site disables FTP on one of their servers, requiring local staff to manually download (and then selectively alter) the standard macros program file.
FTP has made it a pain for end-users to just browse the code, which has probably hindered use.

We have on at least two different occasions over the last 5 or so years tested out http access at the various sites. That can work every place that tested it (don't recall who all reported in the last time, though it was a bunch of sites)--including Group Health. In that last round we also tested https access. That everybody could do except for Group Health. (Our problem is our proxy server, which requires NTLM authentication, which not many non-browser clients know how to pass.)

Impending Change

KP Northern California is to take over the hmoresearchnetwork.org domain, and will implement a new web presence for the HMORN based on its Alfresco system.

This raises the questions:

Will KPNC be amenable to maintaining an FTP server as GHRI has lo these many years so that we can keep the status quo?
Do we want to keep the status quo, or is it time to either move to http(s), or to another model of disseminating this code altogether?
If http(s) do we want to keep the code on an hmorn.org domain or do we want to move things over to say, github?

The main candidate for "another model of disseminating this code altogether" is I think, the CESR convention of making static copies of macros which are included in the .zip file packages that represent CESR VDW application jobs. Even in that case though I think we'll want to have some 'canonical' repository of the code, to which updates should be directed & from which new copies should be had. I don't know how CESR is dealing with that currently.

If the answer to question 1 is 'no' and 3 is 'hmorn.org', then I think either KPNC staff will have to take charge of the macros, or else provide Roy with a means of pushing updates to the file (preferably easily automated--something git or ssh-based would be my preference).

The case for github

I for one, would like to see us move the whole schmeer over to a public github repository, for the following reasons.

It makes our macros capital-P Public, and I think many SAS programmers could benefit from them.
It will raise the profile of the HMORN on what has become the de-facto standard for open source projects. Note that we already have an 'organization' on github for the HMORN, which Christine Bredfeldt and I put together in order to share the PHI macros we wrote, and which Christine has published a paper on.
It makes not just the macros as they exist now available, but the complete history of edits back to 2009 available. All versions will be easily browsable.
Github has a lovely collaboration model, and should make it easy for any minimally competant git user to update a macro and share it back with the group, either via pull request, or by direct commit (for people given commit permission).
It is incredibly secure. In particular:

All private data exchanged with GitHub is always transmitted over SSL (which is why your dashboard is served over HTTPS, for instance). All pushing and pulling of private data is done over SSH authenticated with keys, or over HTTPS using your GitHub username and password. The SSH login credentials used to push and pull can not be used to access a shell or the filesystem. All users are virtual (meaning they have no user account on our machines) and are access controlled through the peer reviewed, open source git-shell.

The one down-side is that would-be collaborators would need to get comfortable with git before they could easily contribute. But that's not much of a problem relative to the status quo--people would still be able to e-mail me (or Christine, or any other person who creates an account & is given commit permission) changes that we could commit and push.

Implementing the Virtual Data Warehouse

Wednesday, May 8, 2013

Managing the standard macros

Current State

Advantages

Disadvantages

Impending Change

The case for github

Followers

Blog Archive