DocGraph Hop Teaming is a dataset that shows how healthcare providers in the United States work together.

April 26, 2018

Because of improvements in the algorithm used to calculate the graph as well as improvements in the underlying data that the graph is run on, the DocGraph Hop teaming data set is probably the most detailed and current picture of the patient-sharing relationships between providers in the US healthcare system that is publicly available. There are better maps available (we maintain one as a proprietary product inside CareSet Systems) but this certainly the best data that is available for downloadable by anyone.

It is also the largest social graph dataset of any kind that is available with real person and business names included (again, the largest that we are aware of). That makes it good for doing graph analysis and just understanding how naturally occurring preferentially attaching graph data sets work. (A graph data set is data that is entirely encoded as nodes and the edges between the nodes) And unlike the much bigger Twitter/Facebook/Google social graph data sets, anyone can see and work with all of the data, either for free or for a nominal cost (for the newest data). So it is a useful dataset for students and other academics who are interested in computational graph theory.

This tutorial is for the HOP Teaming dataset, which is a third-generation dataset we have generated that details how patients flow through the healthcare system. To read about the first two generations, please take a moment and read

RootGraph opening up referral data

The Problem with Previous Data

If you want to just understand what the structure of the data set is, you can skip this section.

In the article we mentioned we learn that there are two different methods for tracking “doctor collaborations”. One is the “explicit referral” mechanism, which is listing when doctor A specifically lists that doctor B referred him/her the patient specified in a given medical claim. For some claims, listing a referring provider is always required. For other medical claims, referring provider is frequently missing data. As a result, explicit referral data does not produce an accurate picture of provider collaborations.

The other approach is a “shared patients” approach, where we simply use the fact that doctor A and doctor B both submitted claims for same patient Y as an indication that they are connected. This type of dataset is built by folding a bipartite graph of which includes both physicians and patients into a unipartite graph of just providers. In a picture:

Note that when you create a dataset of provider connectivity like the one shown above, you do not have any notion of which provider was actually seen first. So you could see that Doctor A was sharing patients with a Doctor B, but you had no evidence that A was referring to be, vs B was referring to A. However, the dataset is clean and simple, both to use and implement, and this is why we created the Root NPI Graph dataset.

The first, FOIA-based, attempt to create a shared patient dataset had hoped to create a notion of directionaly so that you could see that doctor A was typically before doctor B in the referral data. It used a sliding window algorithm to accomplish this. There are problems with the sliding window approach:

  • The transaction data is impossible to interpret, especially when multiple providers are working with the same patient and approximately the same time. (like during hospital or SNF stays, or during long term therapy).
  • The performance of the algorithm is poor and becomes exponentially worse as the number of NPIs and event dates increase.
  • The sequencing of the sliding window data set requires multiple different versions of the dataset (30 day window, 6- day window, 120 window… etc etc) to interpret correctly.

Most of these problems are described carefully in the background of Root NPI Graph so I will not belabor them here.

The New Approach

Our goal with our “teaming” implied referral data has always been two fold:

  • Accurately reflect referrals and collaborations between physicians and other healthcare providers
  • Which reflects the number of patients that they share in common
  • As well as accurate reflection of the directionality of the referral relationships (i.e. who is typically seen first)

Our initial sliding window algorithm worked on all these goals, but the last one.

Basically, we need an new algorithm for a graph of provider-to-provider transactions from a sequence that resulted in a more reasonable number for “shared transactions between providers”. The algorithm we came up with we call “HOP” because it only counts the last transactions. Comparing it with the original sliding window is much simpler with a diagram.

For the purposes of this diagram, please assume that you are seeing the 11 patient (of 11 patients) shared between providers A, B and C. And assume that all of the other 10 patients saw these providers in precisely the same sequence and time-period.

On the top of this diagram, you can see why the sliding window algorithm inflates connection counts when considering a sequence of providers. On the bottom, by limiting the counted “hops” to those immediately before (on a pair basis), we can better reflects the relative movement of patients between different providers.

Note that by adding an extra ‘B’ provider event at the end of the series, the sliding window for A->B would jump to 15, while the HOP A->B would simply move to ‘3’. It is easy to see how the new algorithm ensures that transaction counts do not artificially blossom.

The resulting graph is a directed dataset, which means that A->B is not automatically the same thing as B->A. This makes the data structure more like Twitter, (where you can follow Lady Gaga without being followed back) vs Facebook where ‘friendship’ must be two-way to exist at all (and where neither one of us are Lady Gaga’s friend). As always, the wikipedia page on graph data is a good place to start learning about what the implications are of this data structure.

If you are super-interested in a technical exposition of these types of issues, consider reading Properties of healthcare teaming networks as a function of network construction algorithms by Zand and crew. We found that analysis very helpful as we designed our improved graph-generation algorithm.


Most of these caveats remain the same as they were from the very first release of this dataset. Some of these paragraphs were just copied whole-cloth from the original ReadMe. Typically these caveats all have to do with details about the way billing for third-party health insurance works in the United States, or with real quirks around how the healthcare system itself is organized.

Remember that this data does not reflect explicit referrals. There is no indication, if provider A and provider B are linked that they have any kind of formal partnership. To be specific, this means that just because a given patient received care from Provider B after seeing Provider A, there is no indication that this occured because Provider A told the patient to see Provider B. (which is the definition of an explicit referral).  Still, where partnerships do exist, they are frequently represented in this data set.

The data does have directionality, but this directionality does not necessarily correspond directly to patient flow direction. Random claims patterns might mean that a “referee” might look like they are “sending” patients to a “referer”. Considering provider type (also called taxonomy) is also a good way to infer the type of clinical relationship. No matter what the data appears to look like, you can assume that a laboratory is not “referring” to a doctor, for instance. However, beware, taxonomies available from NPPES are not fully reliable.

This data was mined from CMS Medicare datasets. The Medicare program covers mostly people who are over the age of 65 in the United States. There are also a few people who have coverage because of a severe disability, or because they have end stage renal disease. Medicaid, another large program financed at the Federal level is not considered in this data set. There are several caveats that are important to remember given that the data is Medicare-only data.

Because most populations in Medicare are in the program because of their advanced age, many physicians who do not typically treat people over the age of 65 are not included. These include Obstetricians, Gynecologists and Pediatricians.

There are some providers in the United States (physician owned hospitals, for instance) that are not allowed to bill Medicare. Still others choose not to accept Medicare patients. Both of these issues will serve to limit the generalizability of the graph to infer the structure of some parts of the healthcare delivery system.

Because Part D prescribing data is made available on a different schedule than Part A and B data, this dataset only includes data from Part A and B. This means that both pharmacy fill events and prescription events are excluded from the analysis.

First, individual providers frequently have two NPIs in the database. One for them as an individual and another for an organization that exists only for that individual. Seeing an NPI for “Dr. Smith” and another for “Dr. Smith, LLC” is not uncommon. As a result of this, frequently a given provider will not even appear in the DocGraph at all. Some digging is often required to determine how a given doctor is actually interacting with Medicare. Frequently, when an individual NPI and organizational NPI share an address, this means they are working as a single unit.

Sometimes, it is the organization that is hidden in the billing. We have already seen cases where a given primary care provider is referring to more than 20 different cardiologists. A little further digging shows that these cardiologists are all part of a cardiology service. Obviously the primary care doctor is referring to the cardiology service, and does not actually have relationships with all of the individual cardiologists. However, there is no way, using just the DocGraph, to tell the difference between a “service” where doctors bill as individuals and a group of unaffiliated doctors.

There are cases where cardiologists list themselves as cardiac surgeons in the NPI database and vice versa. Doctors frequently rely on administrative staff members to fill out the data in the NPI database, and they often get it wrong. This issue impacts the quality of addresses and countless other issues. The current NPI registration form attempts to normalize addresses, but the older forms did not. Keep in mind that this data is just as subject to user entry errors as anything else.

In many cities, especially smaller ones, a given hospital will host the main or only emergency room. This often makes it seem like every provider in the city has a “referral” relationship with that hospital. Of course, this is not strictly true. A good sign is if an organization has a strong “referral” relationship with the local fire department, then they run the local emergency room and there “team” with every doctor in the city.

For many of these problems, CareSet Systems provides other services, or other datasets that can serve to improve the usability of this dataset. We also have lots of experience with interpreting the data correctly. Visit to find out more about our services and to find out more about additional datasets.

The Data Structure

The data has the following columns:

  • from_npi – The provider seen first in sequence, coded by NPI
  • to_npi – The provider seen second in sequence, coded by NPI
  • patient_count – The total number of patients shared between the two providers over the entire time period (the time period is typically one year)
  • transaction_count – The count of times that a patient switched between the two providers, in the from-to direction.
  • average_day_wait – The average amount of days it took for a “HOP” to occur. Which is the the time it took, in days, for a patient to switch to the second provider after having seen the first. provider.
  • std_day_wait – The standard deviation of days it took for a HOP to occur.

An “NPI” stands for National Provider Identifier, which is a unique identifier assigned to a specific person or institution that bills for services in the United States. CMS maintains these identifiers, and regularly releases information about which provider has which NPI here:

As per CMS privacy policies provider pairs who saw less than 11 distinct patients together, in total, in the given time-period are not included in the release. There will always be at least 11 transactions between the two providers, at least one for each patient.


If you use the HOP teaming dataset, please reference the following

Ashish is co-founder of CareSet Systems. An entrepreneur and healthcare data transparency advocate, Ashish also founded the DocGraph Journal, bringing together the Healthcare Data Science community along with Politico and ProPublica to publish data sets for scientific advancement. Ashish is currently working to decode Medicare Claims data for Pharmaceutical companies, helping analyze provider teaming, and building robust netw

Connect with CareSet Today

Let's start a conversation to explore how CareSet's comprehensive healthcare data insights can empower your business for data-driven success.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.