Thursday, July 1, 2010

There are bugs in the VDW datasets. All of them. Also: water is wet.

One of the signs of a mature software product is an explicit process for reporting bugs and issues. Once a bit of software gets beyond the scratching-your-own-itch phase, and other people start using it, you need at least two things for the project to flourish: documentation, and a process for reporting bugs. The VDW is very like a software product in this regard.

We've spent a bunch of time and energy discussing and generating documentation (we need to do a lot more documenting, but at least the need is widely recognized). I'm writing today to make the case for an explicit bug tracking process, and to suggest one.

Your data are ugly and your humans are fallible

The first hump for us to get over is owning that our VDW datasets--like every human-generated dataset too big to hand-scrutinize--do indeed have bugs in them. This can be a bit distressing to contemplate, particularly for people removed enough from the realities of legacy data & its wrangling that they are simply ignorant of the warts you run into when you sift through the stuff directly. Pregnant men. The deceased making clinic visits & filling prescriptions. Months where noone seems to have drug coverage. It's awfully tempting to put bad data into the "things that happen to other people" category, and convince yourself that the data that you are depending on are holy-writ valid. That would be lovely, but it's not realistic. Even data expressly collected for the purpose of answering a specific research question--like say prospectively generated for a randomized controlled trial--will frequently contain errors. So you can imagine that our indigenous data--collected by healthcare organizations for the purpose of providing good care and possibly to generate bills--is likewise going to contain errors. Humans are fallible, and the process by which data flows from clinic to research data warehouse is shot-through with human touchpoints.

I don't mean to overstate the problems--our data are incredibly useful, and absolutely valid for a wide range of uses. And of course you've got to consider the available alternatives--it's not like non-VDW datasets are bug free either. It's no use concluding "well, if VDW datasets have bugs, we should just up the budget on our project and do custom coding at all the sites." Because those datasets will also have problems--they're using the same warty source data, and are also wrangled by humans (custom coders are just as fallible as VDW data managers). You can maybe enjoy ignorance of particular bugs w/custom programmed datasets because those typically don't get the repeated scrutiny that VDW datasets get--you develop them, answer your questions, publish your papers & then throw them away. But just because you don't know they're there does not mean they aren't. (Cue scary background music.)

But it's crucial for us to acknowledge to ourselves and to our users that there is some bad data in our VDW datasets. Because the users already know it (or they will soon discover it), and putting bugs "out on the table" will do four very important things:
  1. increase their respect for us and our product,
  2. enlist their help in identifying and fixing problems,
  3. reveal misunderstandings of the VDW and its specifications, and give us an opportunity to educate users / improve documentation, and
  4. focus the discussion of the problems in VDW data.
Right now, without an articulated process for reporting bugs, users and their managers who run into problems may not in fact report problems--or, not in a constructive way anyway. Obviously, the people who need to know about bugs are the data managers at the relevant sites. But if the effort to make problems known to them is too great, users will be tempted into things like site-specific coding; ad-hoc workarounds (which will have to be repeated by future projects until the problem is finally fixed); and of course complaining to to collegues about how the VDW is not at all what it's cracked up to be. Some of that latter is absolutely justified--it's easy to oversell the VDW, and it's not always easy to distinguish VDW promotional/sales type material from VDW documentation.

Those things are bad for us. It's inefficient to have different users discover and fix the same warts with workarounds over and over again. It's bad when people think the VDW will answer every possible research question, and waste time and money going after things we don't have. It's bad when there are back-channels of negative discussion of the VDW.

What is a 'bug'?

This is simple: a bug is anything that frustrates a user's expectations. Any time a user can make a statement of the form "hey, you said that if I did X I would get Y, but instead I got Z" that is a bug in the VDW process. Maybe it's more accurate to call these 'issues', since there is probably a significant class of them that arise from misunderstanding of the VDW generally or one of its specifications or standard methods specfically.

What should a good bug reporting process entail?

Here's what I think needs to happen when someone discovers a bug in a VDW dset implementation.
  1. The relevant SDM(s) need to be informed that a user believes there's a bug in their implementation.
  2. Those SDMs need to respond to the report. Do they agree it's a bug? Is it fixable, and if so, on what timeframe? Can they recommend a good workaround in the meantime?
  3. During the time between the bug report being accepted by the SDM and the SDM implementing a fix, prospective end-users should have a means for discovering the issue. To my way of thinking, the best place to note this is right on the dataset spec itself.
  4. Optimally, someone will ride herd on the bugs, nagging the relevant people to complete fixes & record them. (Having the reports be very visible should hopefully exert pressure on SDMs to get fixes done.)
  5. Once the fix is completed, that fact should be recorded, the reporting user should be informed, and the bug should come off the list of outstanding issues for that site/dataset.

Proposed Process 1: First-Class

I have long believed that to really do this right, you need a purpose-built custom web application to hold both the authoritative VDW dataset specifications (optimally, all VDW educational/promotional material, really) as well as the bug/issue reports. This information is by its nature volatile. What was an open bug two weeks ago is a fixed issue today. What was a no-known-bug-having dataset implementation yesterday is a nasty-bug-having thing today. Users need to know these things in pretty much real time, and nobody wants to have their efforts fixing a bug go unrecognized. If I fix something, I don't want to hear "oh don't use GH data for X--they don't have good data on that".

A custom web application would do things like:

  • Track who submitted the bug, to what dataset(s)/variables(s) it is directed, and have a narrative description of just how the results differed from expectations.
  • Track the responses of the relevant SDM(s)
  • Attach currently-open (that is, accepted, fixable and not yet fixed) and unfixable issues to the relevant dataset specs, so that prospective users will know right away what they're getting themselves into.
  • Maintain by-dataset and by-site lists of outstanding, fixed and not-fixable issues so that sites can prioritize fixes.
  • Maintain by-site lists of fixed issues to document our accomplishments.
  • Provide a dashboard view of implementations by site, showing their extent in time and their relative 'health', as determined by the number of issues currently open against them.
I believe I got pretty far on these goals with the prototype I put together back in February 2010. I think we could execute on this vision (or get even closer) given the right contractor (which I would be delighted to help hire) without breaking the bank.

Proposed Process 2: What we can do in the meantime

So--maybe you agree with my vision above, maybe not. But regardless--we can't be with the one we love, so we've got to love the one we're with. And that one is the CRN Portal. Here's what I propose for a good-enough process for now.
  1. Report your bug by posting a comment to the relevant dataset spec page. Include information like:
    • What site(s)' implementation you're reporting the bug against
    • Exactly what you found (including preferably information on how the SDM could find the affected data, BUT NO PHI)
    • A description of how this was contrary to your expectations (if that's not obvious)
  2. Send an e-mail to Dan and Roy informing them of the new bug.
  3. Dan and Roy will then e-mail the relevant SDMs informing them of the bug report, and ask them to respond.
  4. SDM(s) respond to the report comment on the data spec page, acknowleging or disputing the issue, asking questions, etc.
  5. When an SDM makes a fix, (s)he notes that in a new response to the comment.
  6. After some some period of time fixed issue comments and their responses will be deleted.
I believe the current website will accomodate this use, and I think it will improve the work we are able to do with the VDW.

Please comment!