Thursday, October 27, 2022

Provider specialty data--wrangling and using

In medical research data harvested from insurance/care data it's often important to know something about the providers that appear in the data. In particular we frequently want to know what the type of provider is--were they a:

  • Nurse Practitioner
  • Physician's Assistant
  • Social Worker
  • Technician
and so forth.  It's also frequently important to know what that person's specialty is--for example:
  • Transplant surgery
  • Pediatrics
  • Oncology
Trying to make this data available for research--particularly in hybrid organizations like HMOs, can be a serious challenge.  In general, we'd like to have a lookup table that lists every provider's type and specialty as it existed at the time of an encounter, say. Unfortunately, our sources don't often provide that information, and so we wind up with a pile of types and specialties for each provider that can be confusing to use.

Reasons a single provider might have more than one record in a provider type/specialty table

  1. They are (or were) truly qualified in multiple specialties
    1. simultaneously (say, internal medicine and oncology), or
    2. they re-trained over the course of their career (e.g., started out as a registered nurse, became a nurse-practitioner)
  2. We drew data for the same individual from multiple systems, and the information is inconsistent because:
    1. one or all of the systems are wrong about the type/specialty
    2. one system has a bad identifier & they're not really the same individual
  3. Some of the data is just bad (e.g., indicates additional specialties not actually merited).
Unfortunately our sources (National Provider Identifier Registry, our various Electronic Medical Record systems, our claims processing systems) don't often (ever?) give guidance as to the time progression of various types and specialties. Lacking that information, we are powerless to convey it to users.  So what can we do?

Document the scope of the problem


Looking just at our EMR data, a little more than 94% of providers have just a single listed specialty.  If that's typical, while probably not negligible, it's not a show-stopper either.


Provide Users All The Data We Have, warn them of the issue, and give whatever guidance we can


So if you happen to know from your analysis that the NPI registry data is particularly dodgy, say, let them know that in cases of ambiguities, that they should prefer records coming from sources other than NPI.

Similarly, it's often useful to look at other contextual clues on particular encounters--what procedures/diagnoses bear that provider's identifier? Social workers don't perform Radial Keratotamy surgery, right--so if that's what looks to be happening, one of those fields is lying.

Department (when it's available and not imputed from provider specialty) can also be an important clue.

For Heavy-Hitters, Analyze and Special-Case Particular Providers' Data


In our provider table I like to include off-spec fields that give the number of encounter & rx fill records that bear that provider identifier. That lets me optimize my bang-for-buck in terms of my precious time inspecting/analyzing/researching seeming ambiguities.  Providers that only appear on a handful of records do not merit the same attention that the ones with higher numbers do.

When we do find bad data, I am absolutely not above including lines like:

if provider = 'some_dude' and specialty = 'wrong specialty' then delete ;
in my code.

Maybe Someday--Do something AI-ish to flag inconsistencies or (dare I hope) try to construct time-periods


This should be possible to do in a centrally created Quality Assurance type program, but it would probably be a ton of work and is not feasible at the moment.