# Getting Averages from the CMS Provider Utilization and Charges Data

I’ve mentioned that one of my goals is to find time to work on a side project that utilizes CMS’s open data to create a data visualization application.  I’m moving a bit slowly but, as they say, better late than never!

This post describes a method for getting accurate average values from CMS’s provider utilization and payment data. The key to calculating an accurate average is to use the weighted average formula to weight each provider’s observed values by the provider’s “line service count” for the procedure.

I’ve organized my project into six requirements.  Essentially, the requirements describe different charts and visualizations that a user might want to see when deciding on which hospital or physician to visit for a particular procedure.  Because of the structure of the CMS data, many of these requirements need to use the weighted average formula. In this post, I’ll work on the “Average Charges by Provider Type” requirement.

Here is how I’ve phrased that requirement:

“As a patient, I want to look up procedures and view charts of the potential costs of the procedures, as well as how much costs vary within certain segments. I want to filter the data by HCPCS code, state, city, zip code, payer type, and/or place of service. I also want to be able to group the results by the above attributes. I want to be able to select which measures to visualize.”

Here is a rough wireframe that displays a potential design for the user interface:

A wireframe for the Average Charges by Provider view / requirement

My initial thought was that I could just use an avg() aggregate query to get average submitted charges, average allowed amount, and average cost to patient grouped by provider type, state, zip code, and other attributes. For example, I thought that maybe something like the below query could work:

SELECT	PROVIDER_TYPE,
AVG(AVERAGE_ALLOWED_AMOUNT),
AVG(AVERAGE_SUBMITTED_CHARGE),
AVG(AVERAGE_PAYMENT_AMOUNT)
FROM UTILIZATION_AND_CHARGES
WHERE HCPCS_CODE = 93454
GROUP BY PROVIDER_TYPE;

This would work just fine if we knew that every provider represented the same proportion of total procedure performed for HCPCS_CODE 93454 (a cardiac catheterization procedure).  But this is not the case…  Some providers represent a greater proportion of the total number of cardiac catheterization procedures while other providers represent a smaller proportion.  We are taking an average of averages, and each average value is taken from a set of observations whose quantity is not consistent across all providers.

Since the number of observations for each provider within a provider type subset might vary we need to take a weighted average. The formula for a weighted average is simple and is shown below:

$w_i$ is the number of observations in the group (or the weight) and $a_i$ is the average value of the group.

So in our case, we need to find the weighted average of the allowed amount, submitted charge, and payment amount by provider type (or whatever GROUP BY variable we are using). We need to use each provider/procedure’s line service count as the weight. The actual value is the average for each provider and procedure (e.g. average allowed charges). Fortunately these values are already provided to us in the summarized data that has been provided by CMS.

To illustrate the structure of the CMS data, here are the first 10 records returned for HCPCS_CODE 93454:

FIRST_NAME LAST_NAME PROVIDER_TYPE LINE_SERVICE_COUNT AVERAGE_ALLOWED_AMOUNT
MARK STELLINGWORTH Cardiology 32.0 233.9396875000
JOHN FRANK Cardiology 42.0 219.4111904800
BRIAN CAMBI Cardiology 29.0 201.0451724100
LONNIE HARRISON Cardiology 20.0 155.7900000000
DANIEL DONOHUE Interventional Pain Management 14.0 263.5814285700
RICARDO CORDIDO Cardiology 22.0 168.6618181800
JAIME BENREY Cardiology 19.0 127.4600000000
JONATHAN BOLLES Internal Medicine 11.0 254.3900000000
HENOCK ZABHER Emergency Medicine 13.0 204.8707692300
RANDALL BREAM Cardiology 15.0 208.4993333300

So in this case, if we were to find a weighted average allowed amount among all cardiologists in just these first 10 records the equation would be:

So how do we write the query for this? It’s easy to get sidetracked and to start playing around with analytic functions. I myself went down this route at first using the excellent tutorials written by Quassnoi on Explain Extended. The exercise was interesting but the query turned out to be 73 lines long.

Sometimes all it takes is stepping back and rethinking the query. The following query yields the same results and performs just as well as my 73 line long query:

SELECT	PROVIDER_TYPE,
SUM(LINE_SERVICE_COUNT),
SUM(LINE_SERVICE_COUNT * AVERAGE_ALLOWED_AMOUNT) / SUM(LINE_SERVICE_COUNT) AS AVERAGE_ALLOWED_AMOUNT,
SUM(LINE_SERVICE_COUNT * AVERAGE_SUBMITTED_CHARGE) / SUM(LINE_SERVICE_COUNT) AS AVERAGE_SUBMITTED_CHARGE,
SUM(LINE_SERVICE_COUNT * AVERAGE_PAYMENT_AMOUNT) / SUM(LINE_SERVICE_COUNT) AS AVERAGE_PAYMENT_AMOUNT
FROM UTILIZATION_AND_CHARGES
WHERE HCPCS_CODE = 93454
GROUP BY PROVIDER_TYPE


Now that we have this query we can put it into a web service that will relay the results back to a calling object so that the results can be displayed in a bar chart. This query takes about 60 seconds to execute, however, so it might be worthwhile spending some upfront time pre-calculating these values for all of the HCPCS codes in the dataset so that the end user experience is better. We could even consider using a “big data” technology to pre-calculate these values. For example, a Hadoop scripting language like Pig could be useful for this purpose.

Until next time…