DocGraph: Open social doctor data

Fred Trotter

November 19, 2012

At Strata RX in October I announced the availability of DocGraph. This is the first project of NotOnly Development, which is a Not Only For Profit Health IT micro-incubator. [Editor’s note: as of 2016, company is known as CareSet]

The DocGraph dataset shows how doctors, hospitals, laboratories and other health care providers team together to treat Medicare patients. This data details how the health care system in the U.S. delivers care.

You can read about the basics of this data release, and you can read about my motivations for making the release. Most importantly, you can still participate in our efforts to crowdfund improvements to this dataset. We have already far surpassed our original $15,000 goal, but you can still get early and exclusive access to the data for a few more days. Once the crowdfunding has ended, the price will go up substantially.

This article will focus on this data from a technical perspective.

In a few days, the crowdfunding (hosted by Medstartr) will be over, and I will be delivering this social graph to all of the participants. We are offering a ransom license that we are calling “Open Source Eventually,” so participants in the crowdfunding will get exclusive access to the data for a full six months before the license to this dataset automatically converts to a Creative Commons license. The same data is available under a proprietary-friendly license for more money. For all of these “releases,” this article will be the go-to source for technical details about the specific contents of the file.

DocGraph is very likely the largest open and real-named Social Graph of any kind. There are almost 1 million entities that appear in the 2011 data. Each of these entities is either a specific person or organization that provides health care services to Medicare patients. Of course, the graphs found in the Facebook, Zynga, Twitter and LinkedIn datasets are far more expansive, but they are also the closely held property of those companies. When other organizations interact with those graphs, they are given access to only slivers of the whole dataset. If you are not an employee of one of the previously named companies, this is probably the largest named graph that you will have access to. (Let me know if you know of a real-named graph dataset that is bigger …)

This data is keyed using the National Provider Identifier (NPI). This is a universal identifier for doctors and hospitals and was mandated for the purpose of medical billing by HIPAA as replacement to the UPIN system. (HIPAA did a lot of administrative things beyond patient privacy regulations). Anyone who bills Medicare, or prescribes medication, must have an NPI number. The release and adoption of the NPI was an important component of health care reform in its own right, since the NPI was intended to ensure that doctors had only one identifier rather than maintaining a separate identifier for each insurance company that they billed. This number is basically the equivalent of a social security number for doctors. It would be fairly difficult for a health care provider to provide care without one, and as a result they are fairly ubiquitous. I have been working on NPI data for years, at the prompting of the folks from NPIdentify. It is a rich and messy dataset all on its own!

The core NPI database is already an open dataset. You can use the government’s NPI search tool, but it sucks. So I built a better one. If you are a health care provider, you canupdate your NPI record here. This is a good time to remind doctors that it is a bad idea to list your home address in the NPI data. Because the entire NPI database is public information, and you can download the core NPI release file here. For years the NPI database was updated on a monthly basis, but now it is updated weekly.

The DocGraph dataset is fairly large, with exactly 49,685,586 pairs of referring parties. Of course, even with this many links, the actual dataset that I will be providing is relatively small. The 2011 file is 1.3 GB, which includes about 1 million providers participating in the graph at least once. To provide context, there are about 3.7 million entries in the core NPI file.

I sometimes call the DocGraph dataset the “referral” dataset, and interactions that are traditionally understood as referral relationships make up the bulk of the data. But strictly speaking the data should be considered a “teaming” dataset, which shows when providers work on the same group of patients within the same time frame. I frequently refer to these teaming relationships as “referrals” because A. they usually are referrals, and B. this is easier to say than “participating in the same teaming coupling instance.”

Specifically, the data represents the number of times that two providers billed Medicare for the same patient within a sliding 30-day window, where at least 11 patients were involved in the transaction. If Provider A sees a patient on January 15, and Provider B sees the same patient on February 15, then that counts as “+1.”

In order to ensure that this dataset did not provide any data about patients, we enforce minimum number of patients involved in a given teaming relationship. So for every entry in this dataset, at least 11 patients (which is a standard for CMS somewhere) are involved. This is intended to address the Elvis problem. Everyone knows who Elvis’ doctor is and everyone knows that Elvis is that doctor’s only patient. Therefore, if Elvis’ doctor is “referring” to a cardiologist in this dataset, then everyone would know that Elvis has heart problems. This problem goes away once you include a minimum number of patients in a transaction, so 11 patients is the floor. So we know that at least 11 patients took part in any given “referral count.” Because the patient data is both de-identified and aggregated, there should be no patient privacy concerns in this dataset. (Do let me know if you find evidence that I am wrong about this.)

To further protect patient privacy we cannot tell anything about the number of patients involved beyond the fact that there were at least 11 involved in a given provider-provider relationship. If the same patient sees Doctor A on January 15, Doctor B on February 15 and then again on June 15 and July 15, then that counts as “2” referrals in this dataset. When a referral relationship has a score of 1,100 we cannot know if this was 11 patients with 100 referral instances, or 1,100 patients with 1 referral instance, or 10 patients with 10 referral instances and 1 patient with 1,090 referral instances. The whole point here is that we have a score that approximates the strength of the relationship between two entities in the NPI database, and for that purpose it does not really matter what kind of patient flow is being indicated.

Entries in the DocGraph dataset take the form


Where the first number is the NPI of the entity that saw the patient first in time, and the second number is the NPI of the entity that saw the patient the second in time, and the score is the number of times this happend in a 30-day period within a given (2011) year.

I have uploaded the entire referral graph for the Methodist Hospital in Houston, Texas to Pastebin as an example of what you can find in the larger file.

Usually, patients go see a primary care doctor and then get referred to specialists. Usually this translates to the primary care doctor being seen first between the two doctors. But frequently a specialist will be seen and then a patient will return to a primary care provider. In fact, it might be possible to use the relative “directionality” of the graph to automatically guess which provider was the primary care provider and which was the specialist. For instance, given:



It might be reasonable to assume that 1112223334 was a primary care provider.

Doctor and organization types

Beyond just being a “wide” graph with lots of nodes with real-named entities, this dataset is incredibly deep. It is deep because the core NPI public release file contains a tremendous amount of detailed information, which is usually even right!

The first thing that the NPI file contains is at least one and possibly many different types of provider-type taxonomy. These provider types are coded in a provider ontology maintained by the American Medical Association’s National Uniform Claim Committee (I would like to thank the members for performing in this usually thankless task. Committee participation gives me a headache.) You can download this Health Care Provider Taxonomy, or you canbrowse it online.

The good news here is that the NPI database uses a provider-type taxonomy. However, there is little justification that this should be a “tree” style taxonomy. The assumption in a hierarchical taxonomy is that leaves can only have one “parent.” This means that a given “doctor type” can be either listed under the “cardiology” group or the “pediatrics group,” but not both. As a result there a lots of very arbitrary groupings for doctor types here. Doctors cannot find a sensible way to navigate this “tree” style taxonomy when they sign up. Since they have to choose something, they usually get at least one “type” correct, but often, this database does not correctly represent the breadth of a given doctor’s actual specializations.

Still, it is a good assumption that the provider taxonomy field in the core NPI file is usually correct, and as a result, it is possible to distinguish effectively between hospitals, primary care doctors, specialists types, and laboratories in the referral dataset. In fact, one of the most frequent “referrals” in the data is the referral to get lab work done at LabCorp, Quest or one of the local lab providers. Referrals to hospital emergency departments (which are not referrals at all, of course) and treatment facilities like DaVita are also very common. I have uploaded the top 100 organizations by the number of entries that they have in the dataset to Pastebin so that you can clearly see the types of relationships that will be most common in the data.


The core NPI database also contains two addresses for each NPI record, the practice location address and a mailing address. I have done queries against the Open Street Map database and about 80% of the addresses are already coded to latitude or longitude. There are zip codes that can be used to detect general location for the other 20%.

This means that is going to be possible to run all kinds of geo-data queries against this dataset. There are all kinds of other geo databases that can be overlaid against this referral database in order to reach interesting conclusions. You could easily, for instance, study referrals to allergy doctors in relationship to geo-recorded air quality scores. Let me know if you make some pretty maps and I will try to give you a shout out on Twitter or on my blog.

Hospital data

Quality and performance data for individual doctors is pretty hard to come by. However, there has been an explosion in the availability of hospital data that details how hospitals perform on critical issues like readmission rates and central line infection rates. Frequently this quality data is coded natively using NPIs and it is usually pretty simple to convert this data to NPI coded when the NPI is not directly available.

This hospital data will be part of what we are trying to merge together for our improved DocGraph project.

There are lots of interesting questions that you can ask regarding this dataset. For instance, using the DocGraph, you can determine which cardiologists are referring to hospitals with poor central line infection rates.

State level credentialing data

Every state has a state level medical board that releases data on individual doctors. This data usually includes what medical school a doctor attended, what board certifications they maintain, and any disciplinary actions that the state board has undertaken against this doctor. Unfortunately this data is expensive between ($50 and $500 per state) and rarely coded using an NPI.

This is the largest single disconnected data source from the current NPI database. Buying and normalizing this data is the first goal of our Next Level crowdfunding effort.

Using this data it would be simple to determine if attending the same medical school was an important part of how doctors refer. It will also be possible to determine if board certification has an impact on referral patterns.

Non-profit data

I have just discovered through BoingBoing that will be performing data extraction on its enormous cache of non-profit tax filings. Once it is possible to import this as a data source, it should be possible to figure out what the NPI for different non-profit hospital systems are. Once this is done, it would be possible to see how executive compensation works with the graph. This is not something we will be doing with our initial improvement project, but this is obviously where we would like to take this!


As with all rich datasets, this data is convoluted and can be confusing. Already we have seen patterns that do not make sense unless you understand how the data was built. This is based on administrative data and not clinical data. So this shows not how patients were “treated” together, but how they were “billed” together. There are several important artifacts to consider as the result of this.

First, individual providers frequently have two NPIs in the database. One for them as an individual and another for an organization that exists only for that individual. Seeing an NPI for “Dr. Smith” and another for “Dr. Smith, LLC” is not uncommon. As a result of this, frequently a given provider will not even appear in the DocGraph at all. Some digging is often required to determine how a given doctor is actually interacting with Medicare. Frequently, when an individual NPI and organizational NPI share an address, this means they are working as one unit.

Sometimes, it is the organization that is hidden in the billing. We have already seen cases where a given primary care provider is referring to more than 20 different cardiologists. A little further digging shows that these cardiologists are all part of a cardiology service. Obviously the primary care doctor is referring to the cardiology service, and does not actually have relationships with the individual cardiologists. However, there is no way, using just the DocGraph, to tell the difference between a “service” where doctors bill as individuals and a group of unaffiliated doctors.

There are cases where cardiologists list themselves as cardiac surgeons in the NPI database and vice versa. Doctors frequently rely on administrative staff members to fill out the data in the NPI database, and they often get it wrong. This issue impacts the quality of addresses and countless other issues. The current NPI registration form attempts to normalize addresses, but the older forms did not. Keep in mind that this data is just as subject to user entry errors as anything else.

In many cities, a given hospital will win the contract for the main emergency room. This often makes it seem like every provider in the city has a “referral” relationship with that hospital. Of course, this is not strictly true. A good sign is if an organization has a strong “referral” relationship with the local fire department, then they run the local emergency room and there “team” with every doctor in the city.

Because Medicare typically covers people over the age of 65, there is not very much information about doctors who exclusively treat children, or women who are having children. This means that there is a lack of data about pediatrics or ob/gyn doctors. There is also no information on doctors who do not take Medicare patients. Also there is no data in this dataset for providers who do not bill Medicare in a transactional fashion. Fully capitated plans like Kaiser Permanente will not have data in this dataset. If you put this data on a map, there should be a big hole over southern California as a result of this.

This dataset, like any messy data, should not be considered the “truth” but it can be used to help generate and reject hypotheses about how a given community delivers health care. We have lots of ideas about how to make the data more reliable and more accessible, but in order to do that, we need your help paying for improvements to the data. Even if you do not yourself want to have access to richer data, consider supporting us as we provide data for those who wish to have access to high quality versions of open doctor data. This dataset should deliver on the overall promise of open data: transparency improves performance.

There is a DocGraph Google Group now, that you can join if you would like to ask specific questions about this dataset.

Fred Trotter

Fred shapes our software development and data gathering strategies, which doesn't stop him from getting elbow-deep in the code on a regular basis. He is co-author of the first Health IT O’Reilly book Hacking Healthcare, and co-creator of the DIRECT protocol mandated in Meaningful Use. Fred’s technical commentary and data journalism work has been featured in several online and print journals including Wired, Forbes, U.S. News, NPR, Government Health IT, and Modern Healthcare.

Connect with CareSet Today

Let's start a conversation to explore how CareSet's comprehensive healthcare data insights can empower your business for data-driven success.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.