Category Archives: Data Visualization

The Mechanics of a Decision Tree

It has long been a goal of mine to learn more about machine learning techniques. Machine learning encapsulates a host of interesting solutions to common problems. If you work in tech and are interested in developing a deeper knowledge of computer science topics, statistics/probability, and general programming skills, then a machine learning side-project can be a fun way to achieve your goals.

One of the machine learning methods used in predictive statistics is the “learning decision tree”. A decision tree is used to predict an outcome from a set of observations based on the set of values present in each observation. A decision tree is much like that little pinball-like machine you may have seen on “The Price is Right”, however, while the path of the little disk in that show was entirely random, the path formed by a decision tree is not random, it is formed through an application of statistical probability. A decision tree “learns” the path for each outcome variable from the set of data that it is built from.

In order to understand how this works, one needs to understand two concepts in information science. These concepts are information gain, and entropy.

Entropy is essentially what it sounds like… randomness. When we apply the concept of entropy to information, or data, we are talking about the organization of the data. More specifically, entropy is a way of quantifying the likelihood that one will select the correct value of a random variable, in other words, entropy is the probability that a given outcome will be true in a set of data. A set of data has low entropy if the chance of selecting a random variable in that set is particularly high.

For example, imagine that we have a set of 100 students and we are interesting in selecting a student who majors in political science. Imagine that 50% of the students major in science and 50% of the students major in the humanities. Now imagine that 50% of the students who major in humanities major in political science. The percentage of students who major in political science is therefore 25%. The chance that we will select a student who majors in political science from the total student body is pretty low (25%). The entropy of the entire set of students, therefore, is fairly high. However, if we limit the set of students to just students who study humanities, we can increase our chance of selecting a student who majors in political science significantly (to 50%). We thereby reduce the entropy of the overall set by limiting the set of students to just the students who major in humanities. If you were to repeat this process over and over again, you might begin to see how a decision tree eventually leads us to a prediction, or classification, of an observation, based on the different “splits” that occurred in order to reduce entropy.

The second concept to understand is the concept of information gain. If we think of entropy as the likelihood that we will select the correct value of a random variable, then information gain is the increase in likelihood that we will select the correct value of a random variable following some type of action taken on the data. You can think of information gain as a quantification of the decrease in entropy caused by taking some action on the data. For example, when we reduced the set of students to just the students who majored in humanity, we increased the odds that we would select a student who majored in political science. We therefore “gained information” by reducing the entropy of the set of data. We use the concept of information gain in a decision tree to select the “node”, or the attribute, that we use to make each split in the tree.

To form a decision tree, these two concepts are recursively applied to the set of data. A decision tree recursively reduces the entropy in any given set of observations until it has eventually decreased the entropy to the point where one can be guaranteed to select the proper classification based on the values in the observation. This guarantee (i.e. perfectly “fitting” the data) leads to one of the key problems of decision trees, over-fitting, which can be addressed by reducing the specificity of the tree through a practice called pruning.

For more information on decision trees, particularly the algorithm that we will explore here (called ID3), you can refer to the following link:

http://www.doc.ic.ac.uk/~sgc/teaching/pre2012/v231/lecture11.html

This link does a great job of explaining the ID3 algorithm and is written by Simon Colton of the University of London.

There are several different implementations of decision tree algorithms. Searching will turn up algorithms in R and python’s scikit-learn. While these algorithms are obviously of very high quality, using them does not necessarily teach us how the inner workings of a decision tree operate. For example, the calculations used to determine entropy and information gain, as well as each node in the decision tree are all hidden behind the scenes. To learn more about the algorithm, I wrote a python class that allows me to iteratively build a decision tree. The class does not implement the recursive element of the algorithm (I hope to update it at a later point to include this functionality), however, it does allow us to think through the process of building a decision tree, including entropy and information gain.

I’ve transcribed the algorithm from Simon Colton’s website below:

1) Choose the attribute of the full set which has the highest information gain, and set this as the root node
2) For each value that the attribute can take, draw a branch from the root node
3) For each branch, calculate the set of observations for that branch (i.e. limiting the set to just values that satisfy the criterion of the branch
3A) If the set is empty, choose the default category, which will be the majority of observations from the total set
3B) If the set only contains observations from a single category, then set a leaf node and make the value of that leaf node the category
3C) Otherwise, remove the attribute from the set of attributes that are candidates to be a new node. Then determine which attribute scores the highest on information gain with respect to the set calculated in step 3. The new node will then start the cycle again from step 2.

While a full blown implementation of this algorithm would implement the recursion to fully build out the tree, I’ve opted to take a simpler approach and simply implement several methods that allow me to step through the algorithm in “interactive mode”, so to speak.

I’ve decided to use a dataset from UCI’s machine learning repository. The dataset contains data on 470 surgeries that occurred on patients suffering from lung cancer. The outcome variable is whether or not the patient dies within one year of receiving the surgery (‘Risk1Y’) and the variables are various attributes of the patient and their illness, including age, diagnosis, tumor size, the outcomes of various pulmonary function tests, whether the patient smokes, whether they experienced a cough before surgery, and more. The full dataset can be found here:

https://archive.ics.uci.edu/ml/datasets/Thoracic+Surgery+Data

The first step in the algorithm is to calculate the entropy of the entire set of data with respect to the outcome variable (‘Risk1Y’). Since we are using a binary classification (the patient either died or lived), we can use the formula to calculate entropy for a binary classification:

Entropy(s) = -p_{+} log_2(p_{+})-p_{-}log_2(p_{-})

Where p_{+} is the proportion of outcomes for which the value is positive (i.e. the classification is true) and p_{-} is the proportion of outcomes for which the value is negative (i.e. the classification is false).

To do this using the python class:

1
2
3
4
5
6
7
8
learningTree = InteractiveDecisionTree("/Users/kelder86/decision_tree/thoracic_surgery.csv", "Risk1Y")
learningTree.setFieldNames(['Diagnosis', 'FVC', 'FEV1', 'PerformanceStatus', 'Pain', 'Haemoptyisis', 'Dyspnoea', 'Cough', 'Weakness', 'SizeOfTumor', 'Diabetes', 'MI', 'PAD', 'Smokes', 'Asthma', 'Age', 'Risk1Y'])
learningTree.loadFile()
learningTree.printExcludedFields()
learningTree.printSetConditions()
learningTree.setCurrentSet()
learningTree.calculateSetEntropy()
learningTree.printEntropy()

This gives us the following output:

Positive outcomes in given set: 70.0
Negative outcomes in given set: 400.0
Entropy of given set is: 0.607171654871

You can see that the formula for entropy in this case would be:

Entropy(s) = -(70/470)log_2(70/470)-(400/470)log_2(400/470) = 0.607171654871

Now we need to cycle through each one of the attributes, except for the outcome variable, and determine which one of the attributes will reduce entropy by the greatest amount. In other words, which one of the attributes gives us the greatest information gain.

To calculate information gain for a binary classification problem, we use the following calculation:

Gain(S, A) = Entropy(S) - \sum_{v \in Values(v)} (S_{v}/S)Entropy(S_{v})

To do this with the class, we do the following:

1
2
learningTree.getHighestGain()
learningTree.printMaxGain()

This produces the following output:

Attribute with maximum gain is: {‘name’: ‘Diagnosis’, ‘gain’: 0.024284175235364205}

Remember, at this point we are processing the entire set of data (all 470 records) and we are not filtering the data in any way. The output above suggests that the diagnosis field will reduce the entropy of the entire set of data by the greatest amount out of all of the available attributes in the file.

Let’s get a little more output to see how this works. Below is the additional output from the getHighestGain() method for the first attribute processed:

Total records in set: 470.0
Field is: Diagnosis
Total number with value DGN1: 1
Total number with value DGN1 and outcome F: 1
Entropy of the set with value: DGN1: 0.0
Total number with value DGN2: 52
Total number with value DGN2 and outcome F: 40
Total number with value DGN2 and outcome T: 12
Entropy of the set with value: DGN2: 0.779349837292
Total number with value DGN3: 349
Total number with value DGN3 and outcome F: 306
Total number with value DGN3 and outcome T: 43
Entropy of the set with value: DGN3: 0.538515706679
Total number with value DGN4: 47
Total number with value DGN4 and outcome F: 40
Total number with value DGN4 and outcome T: 7
Entropy of the set with value: DGN4: 0.607171654871
Total number with value DGN5: 15
Total number with value DGN5 and outcome F: 8
Total number with value DGN5 and outcome T: 7
Entropy of the set with value: DGN5: 0.996791631982
Total number with value DGN6: 4
Total number with value DGN6 and outcome F: 4
Entropy of the set with value: DGN6: 0.0
Total number with value DGN8: 2
Total number with value DGN8 and outcome F: 1
Total number with value DGN8 and outcome T: 1
Entropy of the set with value: DGN8: 1.0
Gain for this attribute: 0.0242841752354

We can see from this (knowing the calculation for gain) that the final formula for information gain for this attribute would be:

Gain(S, A) = 0.607171654871 - (1/470)(0) -  (52/470)(0.779349837292) - (349/470)(0.538515706679) - (47/470)(0.607171654871) - (15/470)(0.996791631982) - (4/470)(0) - (2/470)(1.0) = 0.0242841752354

Now we find the attribute with the maximum gain:

Attribute is: Diagnosis
Gain for this attribute: 0.0242841752354
Attribute is: FVC
Gain for this attribute: 0.0203096548951
Attribute is: FEV1
Gain for this attribute: 0.00305944453271
Attribute is: PerformanceStatus
Gain for this attribute: 0.00652051279898
Attribute is: Pain
Gain for this attribute: 0.00212629826488
Attribute is: Haemoptyisis
Gain for this attribute: 0.00289481096538
Attribute is: Dyspnoea
Gain for this attribute: 0.0067109933434
Attribute is: Cough
Gain for this attribute: 0.00603793299662
Attribute is: Weakness
Gain for this attribute: 0.00495240811225
Attribute is: SizeOfTumor
Gain for this attribute: 0.0209567260041
Attribute is: Diabetes
Gain for this attribute: 0.00720382164832
Attribute is: MI
Gain for this attribute: 0.000992338695742
Attribute is: PAD
Gain for this attribute: 0.000868693043524
Attribute is: Smokes
Gain for this attribute: 0.00600290579821
Attribute is: Asthma
Gain for this attribute: 0.000992338695742
Attribute is: Age
Gain for this attribute: 0.0036530835234
Attribute with maximum gain is: {‘name’: ‘Diagnosis’, ‘gain’: 0.024284175235364205}

Out of all of the available attributes, the maximum gain is for the Diagnosis variable. We therefore set this attribute as the root node and we get all of the possible values for this attribute.

1
2
learningTree.setNodeBranches('Diagnosis')
learningTree.printBranches()

['DGN8', 'DGN1', 'DGN3', 'DGN2', 'DGN5', 'DGN4', 'DGN6']

These are the possible values of diagnosis. Now in a fully recursive implementation, the process would then cycle through each one of these branches, setting a node for each branch, and performing all of the calculations until the tree is exhausted. In our case we are going to manually traverse just one more level down for explanation.

We first need to calculate the new set for the branch. We can do this by excluding the values that we have already used by traversing the branch from the total set. For example, to calculate

S_{v}

for the DGN3 branch:

1
2
3
4
5
learningTree.setSetConditions([{ 'name' : 'Diagnosis', 'value' : 'DGN3' }])
learningTree.setExcludedFields(['Risk1Y', 'Diagnosis'])
learningTree.processBranch('DGN3')
learningTree.printEntropy()
learningTree.printMaxGain()

The conditions will filter the set to only contain observations where ‘Diagnosis’ is equal to ‘DGN3′. As we traverse down the tree we continue to add these conditions to represent the path that we took down to the leaf node. In the case below, the next node following the ‘Diagnosis’ node down branch ‘DGN3′ is the ‘SizeOfTumor’ attribute.

Calculating nodes for branch: DGN3
Fields currently excluded as potential nodes: ['Risk1Y', 'Diagnosis']
Conditions placed on current set:
[{'name': 'Diagnosis', 'value': 'DGN3'}]
Attribute is: FVC
Gain for this attribute: 0.0178882581378
Attribute is: FEV1
Gain for this attribute: 0.00440246814813
Attribute is: PerformanceStatus
Gain for this attribute: 0.00889217478759
Attribute is: Pain
Gain for this attribute: 0.0119114583755
Attribute is: Haemoptyisis
Gain for this attribute: 0.00515530395553
Attribute is: Dyspnoea
Gain for this attribute: 0.00720521042675
Attribute is: Cough
Gain for this attribute: 0.00429029313854
Attribute is: Weakness
Gain for this attribute: 0.0032716018855
Attribute is: SizeOfTumor
Gain for this attribute: 0.0258570293306
Attribute is: Diabetes
Gain for this attribute: 0.00582417248134
Attribute is: MI
Gain for this attribute: 0.00109042213585
Attribute is: PAD
Gain for this attribute: 0.00273871968586
Attribute is: Smokes
Gain for this attribute: 0.00443986103654
Attribute is: Asthma
Gain for this attribute: 0.00109042213585
Attribute is: Age
Gain for this attribute: 0.0032465590173
Positive outcomes in given set: 43.0
Negative outcomes in given set: 306.0
Entropy of given set is: 0.538515706679
Attribute with maximum gain is: {‘name’: ‘SizeOfTumor’, ‘gain’: 0.025857029330573714}

I haven’t yet implemented the recursive elements but to illustrate I’ve manually followed this process down two branches of the tree and I’ve visualized the tree using D3. See the chart below for an example of hwo this tree might eventually start to construct itself:

I’ll fill the tree out once I can get the recursion working properly!

Hope you enjoyed this post! Until next time…

Visualizing Energy Consumption in Philadelphia

Today is the deadline for Azavea’s Open Data Philly Visualization Contest, which is near to my heart because it calls on developers to design a data visualization application for my home city of Philadelphia!

My coworker Weilin Meng and I have been working for the past month on our submission and we are excited and honored to reveal our project, an interactive data visualization that explores energy consumption in the City of Philadelphia.

About the Open Data Philly Visualization Contest

OpenDataPhilly is a portal that provides access to free datasets related to the Philadelphia region. It contains datasets on Philadelphia’s education system, police department, maps of the Philadelphia region, property values, and more.

Azavea is a company in Philadelphia that creates GIS-based decision-support software. Azavea has a socially-conscious focus and its mission is to empower and improve communities through its work. To further this mission, Azavea has re-designed OpenDataPhilly. To celebrate their unveiling of the new design, Azavea has called on developers and data enthusiasts around the world to submit data visualizations to their Open Data Philly Visualization Contest.

About our submission

For this competition, we produced a data visualization that allows users to visualize energy consumption in the city of Philadelphia. The visualization consists of a map and scatterplot, which display the energy consumption patterns of large commercial buildings in the Philadelphia region. The map allows users to view each building’s natural gas consumption, electricity consumption, energy use intensity (EUI), energy star score, and greenhouse gas emissions using visualizations developed in D3.js and the Google Maps API. The visualization can be used to detect poor and outstanding performers against energy benchmarks, or it can be used to understand energy consumption patterns in the City of Philadelphia.

What data sources does the visualization use?

The visualization uses the following two sources of data, which are available for public use through OpenDataPhilly.com:

Buildings
http://www.opendataphilly.org/opendata/resource/6/buildings/.
This dataset, produced by the GIS Services Group of the City of Philadelphia, encodes the shapes of buildings and building footprints.

Building Energy Benchmarking
http://www.opendataphilly.org/opendata/resource/258/building-energy-benchmarking/
Compiled by the City of Philadelphia Mayor’s Office of Sustainability, the energy benchmarking data is self-reported data on energy consumption metrics pertaining to large commercial properties in Philadelphia.

How did we do it?

We produced our submission by first programmatically encoding all of the addresses in the energy benchmarking dataset to longitude and latitude points using the Google Maps geocoding service. Once we had these points, we encoded the energy data as a GeoJSON file containing a collection of points with the energy data stored in each point’s properties. We then converted OpenDataPhilly’s “buildings” shapefile into GeoJSON and linked this file with the collection of points by looking up each point in the buildings GeoJSON using a “point in polygon” approach. This allowed us to associate the energy benchmarking data with the polygons encoded in the buildings GeoJSON file. Once we produced our final GeoJSON file, we were able to use the Google Maps Javascript API and D3.js to draw the resulting polygons, heatmap, and scatterplot, each of which made use of the energy data in coloring, sizing, and charting. All of the preliminary data processing work was performed in Node.js and Python, while the final web-based visualization was produced primarily using D3.js and the Google Maps Javascript API.

How can I view the visualization first-hand?

You can view the visualization here: http://www.kennethelder.com/energy-benchmarking/

Please be aware that the visualization will likely not function properly on older browser versions and it has not been tested on Internet Explorer.

What are some of the features of the visualization?

The heatmap is initialized with “Total Greenhouse Gas Emissions” selected upon first visiting the page:

map_initial_view

The map supports zooming and panning:

map_zoom_and_pan

Clicking the “Scatter Plot” button slides out a scatter plot pane from the right. Points in the scatterplot are sized by energy star rating (100 being the best score, 50 being the mean score, and 0 being the worst score) and are colored by property type:

map_w_scatter

Users can change the measure used to generate the heatmap and the scatterplot y-axis values, and the scatterplot supports tooltips on hover:

map_w_scatter_hover

Users can toggle the heatmap on and off and can continue to pan and zoom in the map overlay while the scatterplot is showing:

map_w_heat_map_off

What technology and tools does the project make use of?

To show our appreciation for those developing open source tools and technology, we’ve included brief descriptions and links to some of the tools we used below:

GDAL: A translator library for working with vector and raster geospatial data formats, which we used for its ogr2ogr command-line utility
Geojson-Merge: A Node.js package / utility developed by Mapbox, which provides a command-line interface for merging multiple GeoJSON files together
GeoJSON-JS-Utils: A Javascript module that provides some useful functions for manipulating and working with GeoJSON, which we used primarily for its “point in polygon” implementation
D3.js: A powerful data visualization library for generating data visualizations using Javascript, which we used to produce the scatterplot, render the polygons on the Google Maps overlay, and color the polygons by EUI
D3-tip: A great tooltip implementation for D3.js written by Justin Palmer (if you haven’t seen his amazing data visualizations of Portland, Oregon, then check them out!)
JQuery: The Javascript library that we all know and love…
ColorBrewer: A color scheme by Cynthia Brewer that can be used to color shapes by their values (our project used the CSS implementation)

Visualizing Provider Charges

CMS reimbursement forms a nexus between political, corporate, and economic influences. This intersection ultimately determines the prices of healthcare procedures for senior citizens, and, indirectly, for consumers in the private insurance market. There has been considerable attention paid to the notion of variability in pricing for procedures in CMS reimbursement. Recent news stories have highlighted differences in costs for identical procedures performed in both proximate and disparate geographic areas. In an effort to introduce transparency into CMS pricing and reimbursement strategies, and to shed light on variability in pricing and fraud, CMS recently released provider utilization and charges data and made an earnest call to developers to build data visualization applications that use the dataset.

In my free time I’ve been working on one such data visualization application. The application is built using basic front-end technology (JQuery and D3) and a MySQL database to store the provider charges and utilization data. My goal was to make it possible to dynamically query the underlying dataset and produce interesting, meaningful charts without having knowledge of SQL and databases. Users can select variables to filter the dataset by and can calculate a weighted average value for charges submitted to CMS, payments made by CMS, and the CMS allowed amount. These weighted average charges can be grouped by various values in order to produce bar charts for visualizing the CMS data.

homescreen

Users can add filters, which presents a modal window that allows users to select filters from the available underlying variables. Selected filters are highlighted and can be removed by unselecting them. The list is regenerated as variables are selected so that the application is responsive to the user’s actions.

addfilter

As filters are selected they are added to the filters control section to the right of the chart

As filters are selected they are added to the filters control section to the right of the chart

Users can remove filters either from the filters modal or by removing them from the filter control section to the right of the chart. Once the user has selected an appropriate set of filters that satisfies their interests, they can regenerate the chart and the chart will be dynamically built by querying a PHP service behind the scenes using Ajax.

For example, let’s say the user wants to view the average charges billed by the provider, average payment amounts, and average CMS allowed amounts for all procedures performed in California. Let’s also assume that they want to group the results by provider type. This will produce a fairly large chart, with one bar for each provider type.

Average charges by provider type in California.

Average charges by provider type in California.

This provides some fairly immediate observations about the underlying data. For example, the highest submitted charges are coming from ambulatory surgical centers. This is hardly surprising as surgeries would undoubtedly be more expensive than routine procedures (such as routine evaluation and management or laboratory tests).

The user can also change the group by variable, allowing them to view the averages calculated over different buckets or subsets of the data. For example, the user might want to remove all filters and change the grouping variable to “state” in order to simply view the average charges, average payment, and average CMS allowed amounts by state.

changegroupby

state

By clicking “Regenerate Chart” the data is fetched using the new criteria and the chart is rebuilt. In this case we will have one bar per state and will be able to easily view the differences in total average charges and reimbursement across all CPT codes and procedures performed within the state.

bystate

If you look closely you may see some funny states (what the heck is ZZ??) but it turns out that these are just special classifications for additional procedures performed in areas that are covered but are not one of the standard 50 states. For example, the documentation for the provider utilization and charges file says the following:

‘XX’ = ‘Unknown’
‘AA’ = ‘Armed Forces Central/South America’
‘AE’ = ‘Armed Forces Europe’
‘AP’ = ‘Armed Forces Pacific’
‘AS’ = ‘American Samoa’
‘GU’ = ‘Guam’
‘MP’ = ‘North Mariana Islands’
‘PR’ = ‘Puerto Rico’
‘VI’ = ‘Virgin Islands’
‘ZZ’ = ‘Foreign Country’

While this chart doesn’t tell us much on its own, there are a few interesting nuggets that could be explored further. For example, the average payment amount is considerable higher in the ‘XX’ bucket, which is an ‘Unknown’ state. Why would this be? Could this potentially be a signal that there is some fraud occurring in charges submitted from ‘Unknown’ states?

It’s also interesting to note that the average submitted charges are so much higher for the ‘Armed Forces Europe’, ‘Armed Forces Pacific’, and ‘Foreign Country’ categories. Finally, this chart displays regional differences in CMS reimbursement, which is based on a variety of factors that are geographically dependent, including wages.

Finally, to illustrate another use, let’s filter the dataset down to include only chiropractors. Let’s not apply any other filters to the dataset and instead let’s group by HCPCS code.

chriopracthcpcs

This chart makes one thing very obvious. Chiropractors are charging and being paid much more for the procedure with HCPCS code 99205. Let’s list out what some of these procedures are:

99203: New patient visit with detailed history and exam, low degree of medical decision making, presenting with a moderately sever problem
99205: New patient visit with comprehensive history and exam, high degree of medical decision making, presenting with a moderate to highly severe problem

Notice the conspicuous absence of 99202 and 99204 codes. While 99202 is similar to 99203, it indicates less effort on the part of the physician, and therefore likely translates to lower reimbursement. The case is similar for 99204.

99202: New patient visit with an expanded, problem-focused history and exam, straightforward degree of medical decision making, presenting with a low-to-moderate problem.
99204: New patient visit with a comprehensive history and exam, moderate degree of medical decision making, presenting with a moderate to high severity problem

While this may be perfectly plausible, it could also be a sign of “upcoding,” when billing specialists assign a more valuable code to a procedure that was performed even though the procedure actually is more accurately captured by a lower-value code. While this chart is in no way proof or evidence of that fact, it could possibly be an indicator that this is occurring, and could be explored further. If the chart were to indicate the presence of upcoding, this would be an example of how data visualization can help CMS accomplish their goals of identifying fraud and abuse.