Wednesday, September 8, 2010

1970 called--they want their programming language back

I've had a couple of conversations over the past few months in which people have questioned whether SAS is really still our best bet for the VDW. That is a question worth asking--SAS is not cheap, and has some significant shortcomings. The amount of disk space (and I/O work) required for long text fields can be close to heartbreaking, for example. The SQL optimizer can be maddeningly stupid (and it's pretty easy to write programs that have no hope of taking advantage of indexes, even if you have a decent indexing strategy). There's no persistant data-serving process to cache data across requests--unless you have some smarts & memory devoted to disk access (e.g., SAN/NAS type stuff) each SAS job does its own I/O tabula rasa.

I'm writing to defend SAS as a data-access technology for the VDW. I believe it is the one bit of software that we should require sites to have licensed in order to play in our VDW reindeer games. The only bit.

Some Distinctions

To start out I need to make a couple of distinctions. The first distinction is user location--is the user for a given VDW query local (from my site) or remote (from a sister site)? The second distinction is between the VDW's user interface--the parts of the system that an end-user actually touches/interacts with (by, e.g., pointing/clicking around a graphical interface, or writing commands, or speaking into a futuristic microphone of some kind ("diabetics please")), and the parts of the system that actually store and enable access to the VDW data. Call this second distinction front-end vs. back-end. The front-end is what defines what an end-user can do with the data.

Right now our user interface is of course SAS. Anything you can express in SAS using the files/variables we have, you can do with the VDW.
Location, Location, Location
Local and remote users have very different abilities and expectations. Local users are bound to learn the local data environment (VDW and non-VDW alike) including any tools supported at the site for accessing their site's data. They can pretty easily eyeball the data, and see when something looks goofy. They will also have higher expectations around performance--they will be the ones actually staring at the hourglass or spinning beachball or "Query executing..." status bar message while their queries run. (They also know exactly who to complain to when these things take up too much of their time.)

By contrast, remote users are at two distinct disadvantages. To begin with, they have to elicit the help of a local user to actually cause their program to run. (This is by design--the VDW was/is fearsome enough to site staff, who so feared a loss of control over their data that it was difficult even communicating what the VDW is. We figured trying to automate the local humans out of the process would be a non-starter.) That involves e-mail and maybe phone calls. The local humans then need to make sure the query is approved (frequently by other local humans), run it, evaluate whether it ran as expected, report any errors, evaluate whether the output is within the scope of the approval, figure out a good means for sending the output (is e-mail sufficient, or do we have PHI or other sensitive data requiring fancier measures?) and then actually exercise those means. The remote user in this scenario should not expect results back in minutes, the way a local user would. Lucky remote users will get results back in a matter of hours or days. Unluckier ones, well...longer. Remote users know this, and tend not to spend too much time staring at their in-boxes waiting for results to come in.

They are also at a disadvantage with respect to learning the local data environment. They don't know what your servers are named, what sorts of pipes exist between where their program runs & where the data actually live. They don't know what operating system you're running, or what server database, or that your DBA has a weakness for chocolate and works from home on Fridays.

No doubt you can see where I'm going with these distinctions. Here's why I don't see VDW moving away from SAS anytime soon.

As a user-interface, it's incredibly expressive, and offers a lot of convenience tools that would be painful to reproduce or do without.

SAS has a decent SQL implementation, an excellent procedural language (data step) and even a capable macro facility (so you can write code that writes code).

Imagine having to implement the functionality of one or all of these in say, Java:
  • TRANSPOSE
  • TABULATE or REPORT
  • UNIVARIATE
I'm no java programmer (my tastes run more to ruby and .net) but I bet you dollars to doughnuts it would be exceedingly unpleasant.

Even MEANS is a significant tool--try figuring out percentiles in say, SQL. No fun.

Even if SAS isn't entirely unique in these abilities (is SPSS a contender here maybe? R?) it's awfully good, even at large data volumes.

Just because SAS is the front-end does not mean you have to use it on the back-end

SAS has for years supported a suite of ACCESS products, which offer binary client access to such diverse products as:
  • Oracle
  • Teradata
  • Netezza
  • Sybase
  • Informix
  • MySQL
  • OLEDB (which gets you MSSQL, and anything else anybody's written a driver for)
  • ODBC (see OLEDB)
And (as they say) many more. If you license the right ACCESS module you can store your data pretty much anyplace you like.

Using SAS as the front-end means your remote users don't have to care what your back-end is

Because SAS/ACCESS pretty completely abstracts the back-end away, nobody remote really has to know or care what your back-end is, so long as it's available and has the expected data. It looks like a SAS dataset to SAS--I can feed it to TABULATE. Unless performance is so horrible that run time becomes a significant contributor to my wait (and recall all the waiting I'm having to do on the humans in the system) however many squirrels have to run on their exercise wheels to get me the data is really not my concern.

Now, will remote users get all the benefits of using e.g., a server database? Alas, no. But they will get many of them--SAS/ACCESS is increasingly sophisticated in which bits of queries it pushes to the db engine for execution.

At least as important, you have the freedom to invest in whatever back-end technology you like and can support. That preserves a domain of site freedom that I think is absolutely crucial to keeping sites engaged and happy. Staff at the sites are reallly the only ones who know what their loads are, and what resources are available to them. Are oracle DBAs hunted to extinction in your area (or just too damned expensive)? Go with Netezza. Licensing for Teradata too dear? Use MySQL. Does your parent org have a site license for DB2, and a ton of excess DB2 DBA capacity? Bob's your uncle.

Or maybe you're not feeling much pain with plain old SAS datasets--that too is an option of course.

So I say, let's communicate and share best practices of course (and we haven't done enough of that) and advise one another about what does/doesn't seem to work. But I don't see any gain to be had by invading the province of the SDMs and mandating some particular architecture. We've all got SAS, SAS is the interface. Apart from that, free to be you and me.

Just because your remote users use SAS to access your VDW data does not mean your local users have to.

As mentioned above, local users are the ones who suffer most from poor query performance (and they likely know where your desk is). If SAS' performance overhead is too costly to your local users, or they just prefer another tool, and you've taken the plunge into a server database, there is no reason not to offer these users access via that tool (apart from cost of course).

So--probably there's something I'm not thinking of, that makes some non-SAS language or environment more of a contender for VDW front-end. Tell me what it is in the comments.

Friday, September 3, 2010

Fundamentals

In which we ponder some very basic questions.

What are the goals of the VDW? Why does it exist?

The primary reason the VDW exists is to ease the burden of multi-site health-data-based research. By putting health data in a common format and specifying standard access methods, the VDW allows programming work done at one site to be used by the staff at most any other VDW-implementing site. The result is significant savings in cost and time spent on programming work.

In addition to that primary goal, the VDW has also had a couple of very nice side effects:
  • At many sites, because the VDW offers a comparatively simple interface to its data to local programmers, they can be more productive even for local-only work.
  • It has enabled program code to be reused across sites in addition to over time. This allows an economy of scale that should offer us better code--because it's been used and results scrutinized by many people on many different projects.

That first side effect has a nice side effect of its own: the additional scrutiny that local-only use gives results in additional vetting and defect detection. Sites whose local users use their VDW files almost certainly have better-quality files because bugs are discovered (and we presume, fixed) more quickly.

What sorts of things can be "in the VDW"?


In general, the VDW as a system consists of datasets and program code that manipulates those datasets. Datasets fall into 2 categories: those that hold substantive data (e.g. enrollment; utilization), and lookup or reference datasets (e.g., the datasets of chemotherapy-signifying codes that the CRN has put together.)

Code is generally SAS programs--either macros that get incorporated into the standard macro library, or as freestanding .sas programs that do a specific task.

What does it mean for a dataset to be "in the vdw"? What should it mean?


First and foremost, putting a dataset in the VDW means that we expect the dataset to support ongoing programs of research--not just individual projects. The VDW is infrastructure--at a level over projects.

While we hope and expect that nearly all projects can benefit from the VDW to some extent, and that some portion can even be accomplished exclusively with VDW data, we do not expect that every project will be accomplishable using just VDW data. We do not consider striving to be the end-all-be-all of research data to be an attainable goal. Implementing sites should not discard their indigenous data (or programmers ;-).

More practicaly, having a dataset in the VDW means that some group of people (technical and scientific) want the dataset to exist enough to form a workgroup, and:
  • articulate the sorts of uses they imagine for the data,
  • hammer out a spec (and optimally full-on implementation guidelines)
  • answer questions from implementers (e.g., construe the spec)
  • write qa program(s) and verify implementations, and
  • Sites are able and willing to implement, update, document and support the files.

If we knew that a given dataset (or variable) was not getting used, would it make sense to drop it from the VDW?

That is certainly worth thinking about. I'm looking at you, Census.

What makes a good vdw spec?

A good VDW dataset specification is implementable. The best indication of this is that it has in fact been implemented--preferably at more than one site. You never really know what issues and questions you will run into until you actually go to implement. Many directions that seem clear and complete from your armchair raise knotty questions once you dig in to your indigenous data and see exactly what you have.

A good VDW spec is specific enough to give clear guidance to implementers as to how to wrangle their indigenous data into the spec, and to users as to what they should expect to find.

On the other hand, a good VDW spec is not unnecessarily specific--does not add requirements that do not serve the intended use of the data. There are many details of a dataset that do not really bear on its use. Variable labels and formats are easily in this category. I would argue that in most cases variable lengths are similarly of no import to end-users. In some cases (specifically, MRN, which is designed never to leave an implementing site) even variable type may not matter to users.

A good VDW spec preserves detail available in indigenous data without making it impossible for sites with less available detail to implement. So for examplethe old-timers among us will recall the time when the enrollment spec called for one record per member per month, and bore only the year and month. If a record existed, it meant that Joe Blow was enrolled for some portion of that month, and that was it.

Compare the new spec, which allows for pinpoint dates of enrollment & disenrollment. Now, sites that have that level of detail can include it, and those who only have month-level granularity can document that, and use first/last day of month in the enr_start and enr_end variables.

This can be a neat trick of course--there's a real tension between accomodating full detail and still giving a uniform interface across sites w/different levels of detail.

A good VDW spec uses relevant industry standards as much as possible--for example CPT, LOINC, NDCs, etc. Coming up with our own codesets for these sorts of things is a burden that we should try to avoid wherever possible.

A good VDW spec follows a coherent set of conventions for things like variable names (CamelCase or under_scores?) and missing value handling. This is an area where we could use some work, frankly--we have not paid enough attention to it in the past.
What are your thoughts on these questions? Please comment!