An Electronic Medical Record (EMR) contains a mountain of information for each patient. An EMR contains at least the following:
In statistical terms, the data in an EMR is very high dimensional. An EMR user will require a lot of time to familiarize themselves with each patient. Presenting the information in a visual dashboard can aid in this process, but only so far. A graph is quicker to interpret than a number, but there are still many data points for each patient.
To quicken familiarization, it would be preferable to reduce the number of data points to display for each patient. Dimensionality reduction compresses the very high dimensional data into something more manageable. EMR data is ripe for compression; take for example the following pieces of information about John Doe:
It is not necessary to have all three of those data points to paint a quick picture of his overall health. Dimensionality reduction would recognize these three data points as highly redundant. However, all three data points can be needed when an encounter is focused on his cardiovascular health. Therefor, dimensionality reduction should be focused on quickly categorizing a patient's overall health.
Clustering is the most common dimensionality reduction technique, but it can be clumsy to discretize health status into a finite number of artificial buckets. An alternative focuses on encoding a health status into a few continuous scores, often called principal components (PCs). The PCs are placed in order, with the earlier PCs explaining more of the original data than the later ones. Choosing only the two most informative PCs allows for provocative visualizations. The following graph was made from the PracticeFusion EMR data using Multiple Correspondence Analysis (MCA).
Each point in the faint cloud of points represents a patient in the PracticeFusion EMR data. The X and Y axes represent the first two PCs from the MCA. The location of the floating words represent the direction each patient is moved by each feature. All directions are relative, but comparing the location of a given patient to the nearby features is informative.. PatientGuid 986C4DCC-… is highlighted as an example. Digging deeper into this patient's EMR reveals:
PatientGuid 986C4DCC-… is an extreme example, but he highlights the relationship between the words and the points.
Focusing on just the words highlights relationships between the features in an EMR. Words that appear near each other in the tornado represent features commonly seen together in an EMR. Continuous features, such as age, have been expanded into a series of overlapping basis functions but then linked back together with a faint line. Age is displayed as a gentle curve with the tips curled towards being female. This is likely due to young men not interacting with the health care system and higher mortality among elderly males as compared to elderly females.
Pregnancy is a very telling feature that has obvious female connotations. More interesting is the association of healthy vital signs (lower BMI and Blood Pressure) with pregnancy. The causality is unknown, but there could be many links:
Younger people are associated with acute diagnoses like infections and injuries. Elderly people are associated with chronic diagnoses like cancer and circulatory disease. Elderly people are also associated with a Non-Smoking lifestyle.
Even high dimensional EMRs cannot contain all there is to know about a patient. An important source of information they are often missing is a patient's socioeconomic status. The Data.gov website provides a nice link to the IRS Tax Statistics. Average income values were extracted and linked to the MCA at the state level. The following graph shows the addition of income to the word tornado from above:
The demographics and smoking status features are retained to aid interpretation. The states are added and their positions were determined by the PCs of their patients. The background color represents the average income level of the states placed in the vicinity. Higher income levels occur in the upper half of the graph. Prominent research suggests that amonth americans, smoking decreases as income increases. The graph above seems to be supportive of this conclusion; however there was no test of causality. In a more general interpretation, it could be evidence of the divide between richer coastal states and poorer inland states.
However, it can be inappropriate to use this state based relationship to infer individual patient relationships. Andrew Gelman performed research that highlighted discontinuities between state and individual political relationships in his papers papers here and here.
The first graph used high level categories of diagnosis to aid in interpretation. This supplemental graph uses specific diagnoses and medications to allow for extra insights:
At this more detailed level it is possible to see some male specific diagnoses such as prostate disease. It is also possible to see more granularity between age and specific diagnoses.
These views have been made using all of the sample patients provided. A particular physician's practice would not treat patients in all parts of the cloud. Practice 7AFFC5D8-… patients have been highlighted with orange squares in this graph. This practice treats a disproportionate number of women so their points are biased towards the female features. If the target audience were physicians in this practice, it would make sense to perform the MCA on just their patients to estimate more enlightening dimensions. These original dimensions would still be appropriate for users at a health plan level.
R Core Team (2012).
R: A Language and Environment for Statistical Computing.
R Foundation for Statistical Computing, Vienna, Austria.
ISBN 3-900051-07-0, http://www.R-project.org/.
Wickham H (2009).
ggplot2: elegant graphics for data analysis.
Springer New York.
ISBN 978-0-387-98140-6, http://had.co.nz/ggplot2/book.
Xie Y (2012).
knitr: A general-purpose package for dynamic report generation in R.
R package version 0.7, http://CRAN.R-project.org/package=knitr.
Venables WN and Ripley BD (2002).
Modern Applied Statistics with S, Fourth edition.
Springer, New York.
ISBN 0-387-95457-0, http://www.stats.ox.ac.uk/pub/MASS4.
Wickham H (2007).
“Reshaping Data with the reshape Package.”
Journal of Statistical Software, 21(12), pp. 1–20.
http://www.jstatsoft.org/v21/i12/.
Nenadic O and Greenacre M (2007).
“Correspondence Analysis in R, with two- and three-dimensional graphics: The ca package.”
Journal of Statistical Software, 20(3), pp. 1-13.
http://www.jstatsoft.org.
Bates D, Maechler M and Bolker B (2012).
lme4: Linear mixed-effects models using S4 classes.
R package version 0.999999-0, http://CRAN.R-project.org/package=lme4.
Ridgeway G (2012).
gbm: Generalized Boosted Regression Models.
R package version 1.6-3.2, http://CRAN.R-project.org/package=gbm.
Jr FEH and users. wcfmo (2012).
Hmisc: Harrell Miscellaneous.
R package version 3.9-3, http://CRAN.R-project.org/package=Hmisc.