Wednesday, September 8, 2010

1970 called--they want their programming language back

I've had a couple of conversations over the past few months in which people have questioned whether SAS is really still our best bet for the VDW. That is a question worth asking--SAS is not cheap, and has some significant shortcomings. The amount of disk space (and I/O work) required for long text fields can be close to heartbreaking, for example. The SQL optimizer can be maddeningly stupid (and it's pretty easy to write programs that have no hope of taking advantage of indexes, even if you have a decent indexing strategy). There's no persistant data-serving process to cache data across requests--unless you have some smarts & memory devoted to disk access (e.g., SAN/NAS type stuff) each SAS job does its own I/O tabula rasa.

I'm writing to defend SAS as a data-access technology for the VDW. I believe it is the one bit of software that we should require sites to have licensed in order to play in our VDW reindeer games. The only bit.

Some Distinctions

To start out I need to make a couple of distinctions. The first distinction is user location--is the user for a given VDW query local (from my site) or remote (from a sister site)? The second distinction is between the VDW's user interface--the parts of the system that an end-user actually touches/interacts with (by, e.g., pointing/clicking around a graphical interface, or writing commands, or speaking into a futuristic microphone of some kind ("diabetics please")), and the parts of the system that actually store and enable access to the VDW data. Call this second distinction front-end vs. back-end. The front-end is what defines what an end-user can do with the data.

Right now our user interface is of course SAS. Anything you can express in SAS using the files/variables we have, you can do with the VDW.
Location, Location, Location
Local and remote users have very different abilities and expectations. Local users are bound to learn the local data environment (VDW and non-VDW alike) including any tools supported at the site for accessing their site's data. They can pretty easily eyeball the data, and see when something looks goofy. They will also have higher expectations around performance--they will be the ones actually staring at the hourglass or spinning beachball or "Query executing..." status bar message while their queries run. (They also know exactly who to complain to when these things take up too much of their time.)

By contrast, remote users are at two distinct disadvantages. To begin with, they have to elicit the help of a local user to actually cause their program to run. (This is by design--the VDW was/is fearsome enough to site staff, who so feared a loss of control over their data that it was difficult even communicating what the VDW is. We figured trying to automate the local humans out of the process would be a non-starter.) That involves e-mail and maybe phone calls. The local humans then need to make sure the query is approved (frequently by other local humans), run it, evaluate whether it ran as expected, report any errors, evaluate whether the output is within the scope of the approval, figure out a good means for sending the output (is e-mail sufficient, or do we have PHI or other sensitive data requiring fancier measures?) and then actually exercise those means. The remote user in this scenario should not expect results back in minutes, the way a local user would. Lucky remote users will get results back in a matter of hours or days. Unluckier ones, well...longer. Remote users know this, and tend not to spend too much time staring at their in-boxes waiting for results to come in.

They are also at a disadvantage with respect to learning the local data environment. They don't know what your servers are named, what sorts of pipes exist between where their program runs & where the data actually live. They don't know what operating system you're running, or what server database, or that your DBA has a weakness for chocolate and works from home on Fridays.

No doubt you can see where I'm going with these distinctions. Here's why I don't see VDW moving away from SAS anytime soon.

As a user-interface, it's incredibly expressive, and offers a lot of convenience tools that would be painful to reproduce or do without.

SAS has a decent SQL implementation, an excellent procedural language (data step) and even a capable macro facility (so you can write code that writes code).

Imagine having to implement the functionality of one or all of these in say, Java:
  • TRANSPOSE
  • TABULATE or REPORT
  • UNIVARIATE
I'm no java programmer (my tastes run more to ruby and .net) but I bet you dollars to doughnuts it would be exceedingly unpleasant.

Even MEANS is a significant tool--try figuring out percentiles in say, SQL. No fun.

Even if SAS isn't entirely unique in these abilities (is SPSS a contender here maybe? R?) it's awfully good, even at large data volumes.

Just because SAS is the front-end does not mean you have to use it on the back-end

SAS has for years supported a suite of ACCESS products, which offer binary client access to such diverse products as:
  • Oracle
  • Teradata
  • Netezza
  • Sybase
  • Informix
  • MySQL
  • OLEDB (which gets you MSSQL, and anything else anybody's written a driver for)
  • ODBC (see OLEDB)
And (as they say) many more. If you license the right ACCESS module you can store your data pretty much anyplace you like.

Using SAS as the front-end means your remote users don't have to care what your back-end is

Because SAS/ACCESS pretty completely abstracts the back-end away, nobody remote really has to know or care what your back-end is, so long as it's available and has the expected data. It looks like a SAS dataset to SAS--I can feed it to TABULATE. Unless performance is so horrible that run time becomes a significant contributor to my wait (and recall all the waiting I'm having to do on the humans in the system) however many squirrels have to run on their exercise wheels to get me the data is really not my concern.

Now, will remote users get all the benefits of using e.g., a server database? Alas, no. But they will get many of them--SAS/ACCESS is increasingly sophisticated in which bits of queries it pushes to the db engine for execution.

At least as important, you have the freedom to invest in whatever back-end technology you like and can support. That preserves a domain of site freedom that I think is absolutely crucial to keeping sites engaged and happy. Staff at the sites are reallly the only ones who know what their loads are, and what resources are available to them. Are oracle DBAs hunted to extinction in your area (or just too damned expensive)? Go with Netezza. Licensing for Teradata too dear? Use MySQL. Does your parent org have a site license for DB2, and a ton of excess DB2 DBA capacity? Bob's your uncle.

Or maybe you're not feeling much pain with plain old SAS datasets--that too is an option of course.

So I say, let's communicate and share best practices of course (and we haven't done enough of that) and advise one another about what does/doesn't seem to work. But I don't see any gain to be had by invading the province of the SDMs and mandating some particular architecture. We've all got SAS, SAS is the interface. Apart from that, free to be you and me.

Just because your remote users use SAS to access your VDW data does not mean your local users have to.

As mentioned above, local users are the ones who suffer most from poor query performance (and they likely know where your desk is). If SAS' performance overhead is too costly to your local users, or they just prefer another tool, and you've taken the plunge into a server database, there is no reason not to offer these users access via that tool (apart from cost of course).

So--probably there's something I'm not thinking of, that makes some non-SAS language or environment more of a contender for VDW front-end. Tell me what it is in the comments.

No comments:

Post a Comment