Sunday, November 10, 2013

Loving R's ggplot2!

So I'm messing around with R for a coursera course I'm taking, and totally loving the ggplot2 library.  Check out this lovely plot of some loan data we got from the course instructor:

 That's a scatterplot with loess smoother + confidence intervals, all done with this simple call:

qplot(x     = FICO
    , y     = ir
    , data  = loansData
    , color = Loan.Length
    , geom  = c('point', 'smooth')
    , ylab  = "Interest Rate"
    , xlab  = "FICO Score"

How awesome is that?

Saturday, September 21, 2013

Launchy + PowerShell = easy navigation between project folders

Like most programmers at GHRI, I do work in a multitude of different directories.  Different projects store their programs & data in different folders, and there are numerous different folders that that are important to my data infrastructure work.

When I'm called upon to navigate to these different folders I typically have to remember where they are & then 'cd' over to them (if I'm at a command line) or type into explorer's address bar one component at a time, waiting for the auto-complete (or attempting tab completion at the command line).  This can be cumbersome--especially when I'm not physically connected to the network.

At some point I decided to set some environment variables for myself so I could just type, e.g., %myproj% into the Run window or Explorer's address bar or an Open File dialog & be taken there instantly.  I found this very helpful--no more having to remember where things lived, just my nicknames for them.

Then after adopting powershell as my preferred command-line & discovering functions, I created a parallel set of functions that just did a 'cd' into the proper directory.

Then my machine was repaved & upgraded to Windows 7, and I lost my environment variables.  Around the same time I read a lifehacker article on the Launchy utility & decided to try that.  So rather than set up the environment vars I just created a special folder called Shortcuts into which I put shortcut files pointing to the various folders, named after my nicknames for the projects.  I like Launchy quite a bit, but I did miss my environment variables for the odd Open File dialog.

So today I decided to delve into powershell scripting a little so that I could put all the information in a script, and have it generate:
  • The environment variables I missed,
  • the ps functions I wanted, and
  • the Launchy shortcuts I wanted.
Here's what I came up with--it seems to work pretty well.

$WinShell = New-Object -comObject WScript.Shell
$shrt_dir = "C:\Users\Roy\Desktop\shortcuts"
# nicknames and locations of my projects
$projects = @{"grif"  = "\\some_server\griffin\stupid name" ;
              "cupid" = "\\other_server\projects\cupid" ;
              "prod"  = "\\data_server\management\programs\"

foreach($prj in $projects.GetEnumerator()) {
  # Create a shortcut named for the nickname that points to the dir in the value.
  $shrtfile = $shrt_dir + '\' + $prj.key + '.lnk'
  $shrt = $WinShell.CreateShortcut($shrtfile)
  $shrt.TargetPath        = $prj.value
  $shrt.WorkingDirectory  = $prj.value

  # Create an environment variable for each.
  [Environment]::SetEnvironmentVariable($prj.key, $prj.value, "User")

  # Create a function for each nickname
  $this_func = "function " + $prj.key + "() {Set-Location '" + $prj.value + "'}"
  Invoke-Expression $this_func

Next up I want to change my prompt function so that those project folders show up as e.g., ::cupid:: in the prompt rather than the whole long thing.  I'm already replacing $env:home with a '~', so I should just be able to loop through that $projects dictionary to make similar substitutions.

Tuesday, July 9, 2013

Principles of VDW development

Spitballing for a paper I'm (hopefully) helping to write.

The principles informing development of the VDW are:
  • The VDW exists to facilitate substantive health services, epidemiological, and health economics research, rather than advance the fields of medical informatics or computer science.  Thus we take a very pragmatic (and often not very sexy) approach to data sharing.  Results are always favored over methods.
  • Participation must not impair organizations' ability to protect patients/insureds as the human subjects of research, or abide by applicable laws and regulations (HIPAA, etc.).  Participation must also not impair organizations' ability to protect the interests of both local researchers and their parent organizations (i.e., the health care/insurance providers).
  • The data standards must be open and publicly available so that any interested organization with relevant data can implement and potentially collaborate with other implementing organizations.
  • The data standards represent a "floor" rather than a "ceiling"—that is, implementers are free to embellish or enhance data structures in ways that do not break compatibility with the base VDW specifications, in order to increase the value to local researchers.
  • No "least common denominator" specifications.  Participating organizations have varying levels of detail in their local data.  Rather than discard all but the lowest level of detail available across all implementing sites, the best data specifications accommodate detail and make it optional. For example, if some organizations record the precise day that an insured disenrolls, but others always pad this out to last day of the month, the best specification will allow both organizations to put their data in VDW form.
  • Participation can be partial—an organization can choose to implement 5 out of the 11 data areas for example.  Their attractiveness as a collaborator is thus diminished of course, but they are still participants in the VDW process.
  • Central coordination, but not authority.  Participation is voluntary by all organizations.  There is no central source of funding, and therefore no central authority directing the development of the VDW.  Decisionmaking is by rough consensus with the goal of serving the greatest number of existing and foreseeable projects, while not overtaxing resources available at the sites.
The problem w/the 3rd bullet there is that there's no central clearinghouse where a non-HMORN member org can advertise their implementation & hold themselves out as potential collaborators.

I might add:
  • The data is never going to be perfect.  Like all data collected for any reason, VDW data has problems, not all of which we know about.  Problems can be introduced both in the local operational business systems from which it originates, and in the work implementers do to transform it into VDW data.  We are committed to a transparent process by which problems are reported, prioritized, and fixed.  But users should not assume that the data is pristine and suitable for their intended use--caveat user, and do please report issues.
  • VDW is not the end-all/be-all of research data warehouses.  We are  eager to (strategically) expand the domain of projects for which VDW data alone are sufficient--where current or future projects stand to benefit.  But potential users should not assume that a given study can be carried out using nothing but VDW data.  Custom programming is sometimes necessary, and some projects can't be carried out at all.  Also, "my project needs new variable X" is not a reason for us to drop everything we're doing and implement variable X.
What am I forgetting?  What could be said better?

Wednesday, May 8, 2013

Managing the standard macros

Current State

GHRI maintains an FTP server at, which we use to keep/maintain the VDW standard macros, and several associated data files (e.g., the format data for %vdw_formats).
There are 2 main accounts for this server, one with just read access (which is what we use in the FTP FILENAME statement in and one with write and read (which is what I use for updating the macros).


  • SAS users at every site can %include the FTP-hosted file directly into their programs, ensuring they have the most recent version of the code at all times.
  • Having the code in a single place ensures that all users benefit from additions and bugfixes.
  • FTP is easy to script & we have an easy method of pushing out updates (e.g., a ruby script moves the file).


  • FTP is rightly viewed as an outdated and insecure protocol.
  • One VDW site disables FTP on one of their servers, requiring local staff to manually download (and then selectively alter) the standard macros program file.
  • FTP has made it a pain for end-users to just browse the code, which has probably hindered use.
We have on at least two different occasions over the last 5 or so years tested out http access at the various sites.  That can work every place that tested it (don't recall who all reported in the last time, though it was a bunch of sites)--including Group Health.  In that last round we also tested https access.  That everybody could do except for Group Health.  (Our problem is our proxy server, which requires NTLM authentication, which not many non-browser clients know how to pass.)

Impending Change

KP Northern California is to take over the domain, and will implement a new web presence for the HMORN based on its Alfresco system.
This raises the questions:
  1. Will KPNC be amenable to maintaining an FTP server as GHRI has lo these many years so that we can keep the status quo?
  2. Do we want to keep the status quo, or is it time to either move to http(s), or to another model of disseminating this code altogether?
  3. If http(s) do we want to keep the code on an domain or do we want to move things over to say, github?
The main candidate for "another model of disseminating this code altogether" is I think, the CESR convention of making static copies of macros which are included in the .zip file packages that represent CESR VDW application jobs.  Even in that case though I think we'll want to have some 'canonical' repository of the code, to which updates should be directed & from which new copies should be had.  I don't know how CESR is dealing with that currently.

If the answer to question 1 is 'no' and 3 is '', then I think either KPNC staff will have to take charge of the macros, or else provide Roy with a means of pushing updates to the file (preferably easily automated--something git or ssh-based would be my preference).

The case for github

I for one, would like to see us move the whole schmeer over to a public github repository, for the following reasons.
  1. It makes our macros capital-P Public, and I think many SAS programmers could benefit from them.
  2. It will raise the profile of the HMORN on what has become the de-facto standard for open source projects.  Note that we already have an 'organization' on github for the HMORN, which Christine Bredfeldt and I put together in order to share the PHI macros we wrote, and which Christine has published a paper on.
  3. It makes not just the macros as they exist now available, but the complete history of edits back to 2009 available.  All versions will be easily browsable.
  4. Github has a lovely collaboration model, and should make it easy for any minimally competant git user to update a macro and share it back with the group, either via pull request, or by direct commit (for people given commit permission).
  5. It is incredibly secure.  In particular:
All private data exchanged with GitHub is always transmitted over SSL (which is why your dashboard is served over HTTPS, for instance). All pushing and pulling of private data is done over SSH authenticated with keys, or over HTTPS using your GitHub username and password. The SSH login credentials used to push and pull can not be used to access a shell or the filesystem. All users are virtual (meaning they have no user account on our machines) and are access controlled through the peer reviewed, open source git-shell.
The one down-side is that would-be collaborators would  need to get comfortable with git before they could easily contribute.  But that's not much of a problem relative to the status quo--people would still be able to e-mail me (or Christine, or any other person who creates an account & is given commit permission) changes that we could commit and push.

Saturday, February 23, 2013

Public Key Cryptography: Encryption, Signatures and Certificates

Public Key Cryptography is a technology that forms the basis of a strong form of data encryption, digital signatures, and digital certificates. This post attempts to explain—in broad strokes—how these technologies work.

Key Pairs

The engine that drives Public Key Cryptography is the "key pair". A key pair is basically a set of two numbers, each of which can be used to encrypt data. These keys are used in the same way that passwords are used by the more common type of encryption—secret key encryption.
Key pairs have 2 properties that make them very useful.
  • Things encrypted with one key of the pair can only be decrypted by using the other key in the pair.
  • It is exceedingly difficult to derive one key from the other. Doing so requires solving a problem known to be difficult in computer science (like for instance factoring large prime numbers).
One naive encryption scenario involves splitting a key pair between two people who want to communicate privately. When Alice wants to send a private message to Bob, she encrypts it with her key, and Bob uses his to decrypt. When Bob wants to reply, he encrypts with his key and Alice uses hers to decrypt.

Figure 1: Naive Encryption Scenario

Public Key Encryption

Public Key Encryption takes this scenario one step further. In Public Key Encryption, the idea is that every participant has their own pair of keys. But rather than treat both keys as secret (as in the naive scenario above), only one of the keys is secret. The other is considered to be Public—its owner disseminates it far and wide. This is safe to do because of the second property of key pairs—just because someone knows your public key does not give them an advantage in guessing what your secret key is.

Figure 2: Key Generation & Exchange

There are special kinds of internet servers that do nothing but serve as open clearinghouses of public keys.

If someone wants to send you a private message, they get your public key (either from a keyserver, or directly from you in an e-mail) and use it to encrypt the message. You then use your secret key to decrypt the message. For a message back to them, you grab their public key and encrypt with that. So all the encrypting is done with the public keys, and all the decrypting is done with the corresponding private keys.

Figure 3: Secure Data Transfer with PKC

There's one major benefit to doing encryption this way—you never need to send anything secret (like a password) over an insecure channel. Your public key goes out to the world—it's not secret and it doesn't need to be. Your private key can stay snug and cozy on your personal computer, where you generated it—it never has to be e-mailed anywhere, or read over the phone, snailed on a floppy, etc.

Compare old-style, Secret Key encryption, where both people communicating need to know the same password—in that case you've got to come up with some way of getting that password to your friend in such a way that it can't be intercepted. That is often problematic.

But wait—there's more. Since things encrypted with one key of a pair can only be decrypted by the other key, the owner of a key pair can also encrypt a message with their private key. Given that in theory at least, everyone in the world has your public key, and that any message encrypted with your private key can be decrypted by your public key, this is not a recipe for private communication—anyone at all can decrypt the message. So why would you do that?

Digital Signatures

You would do that to authenticate the message—to prove that it came from you. Any message decryptable with a given public key is sure to have come from the person holding the corresponding private key. So to the extent you can trust that a given public key really belongs to a particular person, you can be sure that such a message really did come from that person. This is the basis of Digital Signatures.

Typically, since you're not going for privacy when you digitally sign a message, there will be a plain text version of the message, followed immediately by an encrypted version. To verify the signature the encrypted version is decrypted, and that decrypted text is compared to the plain text version. If they match, you can be sure the message is authentic, otherwise the message is probably bogus—it either came from some other person entirely, or it was altered after being signed.

This is a huge benefit. All of a sudden, the formerly anonymous Internet can now support verifiable digital identities—so people and organizations who 'meet' on the internet can have a basis for trusting one another—they at least know who the other party is.

But consider that phrase above "to the extent you can trust that a given public key belongs to a particular person" . That's potentially a big problem for digital signatures. Key pairs are very easy to generate—free software for doing so is available on the internet. Just because a given public key has my name on it does not necessarily mean that I generated it. Furthermore, public keys are usually transmitted over insecure channels—like the internet—which excepting maybe public cell phone yelling is probably the least secure channel known to humankind. Thus, they are subject to hacking.

So how can you be assured that a given public key really truly belongs to Bob Friend, and not Emily Enemy?

There are 2 methods.
  1. Give Bob a call and ask him to read his public key to you over the phone. This is tedious and annoying. More importantly, it assumes that you know Bob personally, know his phone number, and will recognize his voice. In many situations (described below) this is unrealistic.
  2. If you have a friend in common with Bob (let's call him Tommy Trustworthy), whose public key you have and trust, and Tommy has a copy of Bob's public key, then you could get Tommy to digitally sign his copy of Bob's public key and send the signed copy to you. When it arrives, you verify Tommy's signature, and if it checks out, you're all set. In a sense, Tommy is vouching for Bob's identity.

The Web of Trust

This second method is the basis of something called "the web of trust". The idea amounts to a network of people vouching for each other. If everybody endorses the the public keys of everybody they know personally (by digitally signing those keys & registering their signature with an internet keyserver) then eventually we will build up a web of trust associations, from which I should be able to trust your key. I may not know you directly, but I know Fred, who knows Mary, who knows Chris, who knows you. The more people you can get to sign your key, the larger the group of people who will be able to use this system to verify your public key.
It's sort of like a digital security version of Six Degrees of Kevin Bacon.

Figure 4: Kevin's worked with everyone

This is also tedious, but it's usually less tedious than the first option—especially for people you don't know personally. Technical people understand this, and will go to the trouble. At computer conferences you will find geeks congregating in "key signing parties" where everybody gets together, proves their identities to one another (driver's licenses, passports, etc.) and signs one another’s keys.

But non-technical people generally won't do this. Being able to digitally sign things is just not that important to them. Nobody expects them to be able to do this, and nobody they know has a public key to sign.

Enter the Certificate Authority.

Imagine that, instead of a network of lots of little trusting geeks, we have one, big, super Tommy Trustworthy, who is willing to verify the identities of anyone and everyone, and after doing so, sign their keys for them.

Imagine that this Super Tommy has incredibly rigorous methods of identification. Super Tommy flies out to your location, talks to your friends and associates, and only if everything smells right does he sign your key. It's worse than applying for a mortgage. But Super Tommy and his identification methods are so well known that no matter who you are—if Tommy trusts that you are who you say you are, then the rest of the world will too.

Now, Super Tommy doesn't do all this out of the goodness of his heart—you've got to pay him to go to all this trouble. But it's worth it in order to get the enormous numbers of people who trust Super Tommy to trust you too (and to avoid having to hang out with all those geeks at the key signing party).

Super Tommy is trading on his reputation.

Digital Certificates

The role Super Tommy is playing is called a Certificate Authority—an entity that is in the business of manually authenticating and certifying people's identities, and providing them with the digital credentials to signify it. Your signed public key is now a Certificate—it's a public key that's been endorsed by a Certificate Authority. You can present this certificate to anyone, and they can verify Super Tommy's signature on it, and trust that you are who the cert says you are. "If you're good with Super Tommy, you're good with me."

Why would anyone go to this trouble?

If you are a company wanting to do business on the internet, you can take your company's certificate and install it on a web server, and thereby identify the web server as belonging to you. That way when someone goes to your web site to buy something, they can be sure that the site they are interacting with is actually yours—not some hackers' in Belarus.

That can go a long way toward making people comfortable entering their credit card numbers into your web site. And furthermore, it's a key requirement (pun intended) of the Secure Sockets Layer (SSL) technology. When you visit a website that lives on a server that has a Certificate installed (and that's using secured HTTPS) you get that reassuring little padlock icon at the bottom of your browser screen.

Figure 5: SSL-signifying padlock icon

That padlock tells you 3 things.
  1. The site has presented your browser with a certificate.
  2. The certificate has been signed by someone (some Certificate Authority) your machine is set up to trust.
  3. The internet traffic in between your browser and the web server is being encrypted. (Both to and from the server).
To see the evidence of this, double-click the little padlock icon. You'll see a dialog box that lists the person or entity to whom the certificate was issued. That's who you sue if they do you wrong. The Certificate Authority who issued the cert can tell you where that person or entity can be found.

Figure 6: The Identity established by the Certificate

Figure 7: Certificate Authorities Short-circuit the Web of Trust; VeriSign = Super Tommy.

Notice how one-sided this is. You know that you're really dealing with, but how does know that they are really dealing with you? You don't have one of these fancy certificates to present. They don't! And in fact—they don't care. If you can produce a valid credit card number, expiration date, and that new code from the signature bar on the back of the card, they can bill the account. From there, any identity disputes are between you and Visa.

But what if a site did care about the identity of the people who used it? If you went to the trouble of obtaining a certificate of your own (these are generally referred to as "client-side" certificates) then there is a process for installing that cert on a particular machine, and associating it with a particular user account. If you do that, your browser would present the certificate to the web server on request, and the web site would 'know' who it was dealing with.

For instance, CRN's Secure File Transfer website requires that users have a client-side certificate, that Kaiser Northwest issues, so they can be sure that users aren't just sharing their username/passwords. You can share your username & password with someone else fairly easily, but it is much more difficult to share a certificate.


Public Key Cryptography is an important technology that forms the basis of several other important technologies. The engine that powers PKC is the "key pair"—a mathematically derived pair of numbers, each of which can decrypt messages encrypted with the other. Private communication can be ensured by using public keys to encrypt messages. Messages can also be digitally signed by including a private-key-encrypted version of the message along with the plain text version.

By linking public keys to real-world trust associations we can establish a "web of trust" through which digital messages can be tied to particular legal entities. Because of their good reputations for carefully authenticating PKC users, Certificate Authorities serve as large portions of the web of trust.

The Gnu Privacy Guard (GPG)

The Gnu Privacy Guard is free, open source software that implements the PGP algorithm for Public Key Cryptography (among many others).  Note that PGP here is not meant to signify the proprietary, license-limited software package called PGP (published by the PGP corporation) but instead the “Pretty Good Privacy” algorithm, which is in the public domain.  GPG is available for windows, linux and many flavors of unix.  You can download the latest version at

GPG provides a command-line interface—meaning that on windows, you use it at the c:\> prompt.  Here is a mini-tutorial on how to use gpg.  User-supplied information is printed in bold red text .

Creating a key pair

To create a key pair you call gpg with the “--gen-key” option, and respond to the prompts.

C:\> gpg --gen-key

gpg (GnuPG) 1.2.1; Copyright (C) 2002 Free Software Foundation, Inc.

Please select what kind of key you want:

(1) DSA and ElGamal (default)

(2) DSA (sign only)

(5) RSA (sign only)

Your selection? 1

DSA keypair will have 1024 bits.

What keysize do you want? (1024) 1024


Requested keysize is 1024 bits


You need a Passphrase to protect your secret key.


Enter Passphrase: Who put the bang in shebang shebang shebang?

Reenter Passphrase: Who put the bang in shebang shebang shebang?


You need a User-ID to identify your key; the software constructs the user id from Real Name, Comment and Email Address in this form:

"Heinrich Heine (Der Dichter)<>"


Real name: Roy Pardee

Email address:

Comment: Home

You selected this USER-ID:

"Roy Pardee (Home) <>"


Change (N)ame, (C)omment, (E)mail or (O)kay/(Q)uit? O


We need to generate a lot of random bytes. It is a good idea to perform some other action (type on the keyboard, move the mouse, utilize the disks) during the prime generation; this gives the random number generator a better chance to gain enough entropy.






public and secret key created and signed.

key marked as ultimately trusted.


pub  1024D/22F2385B 2004-09-27 Roy Pardee (Home) <>

Key fingerprint = 6F13 46E8 4B5F FE96 F59D  A609 7CD9 F063 22F2 385B

sub  1024g/C887F092 2004-09-27

Your key pair is now stored in gpg’s keyring file. [5] Any time you need to access your secret key, you will need to supply the passphrase you’ve chosen—that’s what protects you against someone hacking into your computer, or otherwise gaining access to your keyring file, and using your secret key to impersonate you.

Exporting Your Public Key

To export your public key, type:

C:\> gpg --armor --export [e-mail address]

Where [e-mail address] is the address you specified when you generated the key.  Gpg will print your public key as text to the screen:


Version: GnuPG v1.2.1 (MingW32)







You can either copy that off your console window, or else tell gpg to write it to a file with a command like:

C:\> gpg --armor --export [e-mail address] > mykey.txt

You can then attach mykey.txt to an e-mail, or copy/paste its contents into an e-mail message or otherwise send it around.

Importing Someone Else’s Public Key

The command to import a public key is just:

C:\> gpg --import [file name]

gpg: key 705D1FB9: "Amanda Hugankiss (Pretend person)>" not changed

gpg: Total number processed: 1

gpg:              unchanged: 1

Where [file name] is the name of the file to which you’ve saved the key.

Encrypting A File

To encrypt a file using a public key you type:

C:\> gpg --encrypt-files –r [recipient’s e-mail address] [file name]

So for instance:

C:\> gpg --encrypt-files –r StudyData.txt

Will result in the creation of a new file called StudyData.gpg. [6] You can then safely send this new file to your friend Amanda and she can decrypt it with her private key.

Decrypting A File

To decrypt a file you type:

C:\> gpg --decrypt-files [file name]

This operation requires the use of your secret key, and so gpg will prompt you for the passphrase you entered when you created it:

You need a passphrase to unlock the secret key for

user: "pardre1 <>"

1792-bit ELG-E key, ID E9FD6753, created 2004-05-26 (main key ID 3468AA7D)

Who put the bang in shebang shebang shebang?

gpg: encrypted with 1792-bit ELG-E key, ID E9FD6753, created 2004-05-26

"Roy Pardee (Home) <>"

Digitally Signing A File

To digitally sign a file you type:

C:\> gpg --sign [file name]

Here again you get prompted for your passphrase, and the output is written to a new file called [file name minus extension].gpg.

Verifying a Digital Signature

To verify a signed file, type

C:\> gpg --verify [file name]


gpg: Signature made 10/11/04 16:39:11  using DSA key ID 3468AA7D

gpg: Good signature from "Roy Pardee (Home) <>"

This will of course only work if you have imported the public key of the person who signed the file.

Visit the Gnu Privacy Guard web page for more details on using gpg: .

[1] For example, the type of encryption you use when you password-protect a .zip file is ‘secret key’ encryption.

[2] This is actually a slight simplification of the process—as an expedient, most digital signature software will compute a hash of the message and then encrypt that .  The American Bar Association has a nice discussion of digital signatures here :

[3] See the appendix, below.

[4] You might ask yourself how the traffic is encrypted in both directions—you haven’t sent anybody a public key for encrypting stuff sent to you, so how is the web server able to encrypt the pages it sends you in such a way that your browser can decrypt them for display?

The answer is that the HTTPS protocol doesn’t use public key cryptography for these exchanges.  It actually uses secret key cryptography—the kind where both sides of the transaction need to know a secret password.  The reason for this is that PKC is fairly computationally intensive.  If both browser client and web server had to PK-encrypt every bit of data that ran back and forth, the wait would be next to intolerable.  So to save on time, the web server randomly generates a secret password, PK-encrypts that , and from that point forward your browser and the web server use secret key encryption.

[5] You don’t normally need to directly manipulate this file, but on a windows system, you will find it at c:\documents and settings\<<username>>\\Application Data\GnuPG\secring.gpg

[6] Note that it’s advisable to rename this file to StudyData.txt.gpg.  The extra file extension gives a cue to the type of the file.  As it is, when Amanda goes to decrypt StudyData.gpg, she will wind up with a file called simply “StudyData”.  If you rename the file she will wind up with “StudyData.txt”