Exploratory analysis of Medicare drug cost data 2011-2015
Health care systems world-wide are under pressure due to the high costs associated with disease. In this post, I performed an analysis of Medicare data in the USA. Furthermore I used a drug-disease open database to cluster the costs by disease. I identified the most expensive diseases (mostly chronic diseases such as Diabetes) and the most expensive medicines.
Health care systems world-wide are under pressure due to the high costs associated with disease. Now more than ever, particularly in developed countries, we have access to the latest advancements in medicine. This contrasts with the challenge of making those treatments available to as many patients as possible. It is imperative to find ways maximize the positive impact on the quality of life of patients, while maintaining a sustainable health care system. For this purpose I performed an analysis of Medicare data in the USA. Furthermore I used a drug-disease open database to cluster the costs by disease. I identified the most expensive diseases (mostly chronic diseases such as Diabetes) and the most expensive medicines. A drug for the treatment of HCV infections (Harvoni) stands out with the highest total costs in 2015. After this first exploration, I propose the in-depth analysis of further data to enable more targeted conclusions and recommendations to improve health care, such as linking of price databases to compare drug costs for the similar indications or the analysis of population data registers that document life style characteristics of healthy and sick individuals to identify those at risk of developing high-cost diseases.
Relevance
Health care costs amount to a considerable part of the national budgets all over the world. In 2015, $3.2 trillion were spent for health care in the USA (17.8% of its GDP). In Germany, the health care spending reached 11.3% of GDP in 2014. On the one hand, this high health care costs can be explained by the population growth, particularly the elderly proportion, requiring higher investments to secure quality of life. On the other hand, new medicines are continously being discovered enabling the treatment of diseases that were once a sentence of death. This has as a consequence that many once fatal diseases have now become chronic with a high burden on the health care costs.
But how can governments and insurers make sure that patients receive the care they need, including latest technology advances, without bankrupting the system? One first step is the identification of high-cost diseases and drugs. This insights can then be used to identify population segments at high-risk of developing a disease, who can then be the focus of prevention measures.
Governments, insurers, patient organizations, pharmaceutical and biotech companies need all to leverage their available data, if we are to improve the health of patients now and in future generations.
Methods
Data sources
- Medicare Drug Spending Data 2011-2015: drug spending and utilization data. In this analysis only Medicare Part D drugs were considered (drugs patients generally administer themselves)
- Therapeutic Targets Database: Drug-to-disease mapping with ICD identifiers.
Tools
- pandas for data crunching
- fuzzywuzzy for fuzzy logic matching
- git for version control
Data preprocessing
First, I cleaned up and processed the drug spending data available from Medicare for the years 2011-2015. This data includes the total spending, claim number, and beneficiary number --among others-- for each drug identified by its brand and generic names.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_palette('Paired')
sns.set_style('whitegrid')
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
data = pd.read_csv('data/medicare_data_disease.csv')
data.head()
I also processed the data from the Therapeutic Targets Database, which presents the indications (diseases) associated with a drug generic name.
diseases = pd.read_csv('data/drug-disease_keys.csv')
diseases.head()
Then, I used a fuzzy logic algorithm to match each drug generic name of the Medicare data with the closest element from the Therapeutic Targets Database. After having a list of exact matches, I assigned the first associated indication to each Medicare drug. For details on how I did this, please check my github repository.
spending = data.groupby('Indication').sum().sort_values(by='Total Spending', ascending=False)
spending.head()
spending_drug = data.groupby('Brand Name').sum().sort_values(by='Total Spending', ascending=False)
spending_drug.head()
n_top = 40
fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=False, figsize=(8,8))
g = sns.barplot(x='Total Spending', y='Indication', data=spending.reset_index()[:n_top], estimator=np.sum, ax=ax1,
color=sns.xkcd_rgb['dodger blue'])
g.set(yticklabels=[i[:27] for i in spending[:n_top].index])
g.set_xlabel('Total Spending $')
g2 = sns.barplot(x='Total Spending', y='Brand Name', data=spending_drug.reset_index()[:n_top], estimator=np.sum, ax=ax2,
color='lightblue')
g2.set(yticklabels=[i[:20] for i in spending_drug[:n_top].index])
g2.set_xlabel('Total Spending $')
#plt.title('Top 50 indications by Beneficiary Count Sum from 2011 to 2015')
fig.suptitle('Top %s indications and drugs for 5-year total spending 2011-2015' %n_top, size=16)
plt.tight_layout()
fig.subplots_adjust(top=0.94)
plt.savefig('Top_%s_disease_drug.png' %n_top, dpi=300, bbox_inches='tight')
Indications (left part)
A look at the total spending for the 5-year period 2011-2015 reveals that the bulk of drug spending is covered by a small set of diseases/indications (left graph). The total spending per indication decreases rapidly by going down the list of drugs.
Diabetes occupies the first place in this list with a total 5-year spending exceding $50 billion. Following in the list, we find other chronic diseases such as schizophrenia, chronic obstructive pulmonary disease, hypertension (high blood pressure), hypercholesterolemia (high cholesterol), depression, hiv infections, multiple sclerosis, peptic ulcer disease, and chronic HCV infection (hepatitis C). Interestingly, pain medications are also in the top 4 indications by total spending.
It makes sense that treatment of chronic diseases receives the highest investment in drug spending, as patients with these diseases can live long lives when medicated.
Interestingly, the first cancer reaches only the 19th place of this list (chronic myelogenous leukemia). However, it must be noted that cancer is actually a collection of different diseases with different genetics, origin, and treatment options. These different cancers were not clustered in this analysis.
Drugs (right part)
When we look at the most expensive drugs for the total 5-year spending, we find on the top of the list: Lantus (insulin), nexium (peptic ulcer), and crestor(anti cholesterol). It makes sense as these are medicines to treat chronic diseases.
However, we cannot learn much on a high level from looking at the total spending only. Therefore, a closer look is needed.
spend_2015_ind = data[data['Year']==2015].groupby('Indication').sum().sort_values(by='Total Spending', ascending=False)
#spend_2015_drug = data[data['Year']==2015].groupby('Brand Name').sum().sort_values(by='Total Spending',
# ascending=False)
spend_2015_ind.head()
top_10_spend = data[data['Year']==2015].sort_values(by='Total Spending', ascending=False)[['Brand Name',
'Total Spending',
'Year']][:10]
top_10_spend
fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=False, figsize=(8,5))
g=sns.factorplot(x='Year', y='Total Spending', hue='Brand Name', palette='coolwarm',
hue_order=top_10_spend['Brand Name'],
data=data[data['Brand Name'].isin(top_10_spend['Brand Name'])], ax=ax1)
ax1.set_title('Annual spending for top 10 drugs')
ax1.set_ylabel('Total Spending $')
plt.close(g.fig)
ax2.scatter(x=spend_2015_ind['Beneficiary Count'][:100],
y=spend_2015_ind['Total Spending'][:100],
s=spend_2015_ind['Claim Count'][:100]/100000,
#c=spend_2015_ind.reset_index()['Indication'][:100])
color=sns.xkcd_rgb['dodger blue'], alpha=0.75)
ax2.set_title('Top 100 indications in 2015')
plt.xlabel('Beneficiary Count')
plt.ylabel('Total Spending $')
plt.axis([0, None, 0, None])
for label, x, y in zip(spend_2015_ind.index,
spend_2015_ind['Beneficiary Count'][:10],
spend_2015_ind['Total Spending'][:10]):
plt.annotate(label, xy=(x, y), color='red', alpha=0.75)
fig.suptitle('Annual drug spending development and overview of highest-cost indications', size=16)
plt.tight_layout()
fig.subplots_adjust(top=0.85)
plt.savefig('Top_bubble_disease_drug.png', dpi=300, bbox_inches='tight')
Annual spending development for top 10 drugs (left)
The drug landscape is not temporally static. Therefore, I analyzed the annual spending since 2011 for the 10 top drugs in 2015. Eight out of these ten drugs consistently received higher spending every year, a reflection of the general health care spending panorama. However, the rate of growth for each drug is dramatically different. Particularly striking is the case of the drug Harvoni, which exhibited a >7-fold growth in total spending between 2014 and 2015.
Harvoni is a medicine for the treatment of hepatitis C (HCV infection) that was launched in 2014. It is the first drug with cure rates close to 100%. Harvoni practically cures a chronic disease and this is reflected in its pricing at over $90k for a 12 week treatment.
The remaining drugs in the figure are mostly used for the treatment of chronic diseases.
But how can we more extensively evaluate the burden posed by the different diseases/indications?
Top 100 indications in 2015 (right)
In order to find out more about the distribution of the most expensive indications, I plotted the drug spendings grouped by indication for the year 2015 in a scatter plot. This way, we can not only look at the total spending but also at the number of beneficiaries for a particular indication. The size of the bubbles represents the relative number of claims.
From this graph we can assess the magnitude of how the most expensive diseases affect society. Diabetes is not only the most expensive single indication by total spending but also affects a very large number of people.
The indications with the most beneficiaries are hypertension, pain and high cholesterol. They also represent some of the highest number of claims (bubble size). This indicates that the average cost associated with each claim is low, as these are generally medications with expired patents that are priced very low.
Again it is interesting to take a look at chronic HCV infection. This is the indication for the drug Harvoni. Both the number of beneficiaries and claims are extremely low compared with other diseases. However, chronic HCV infection reached the second place in the highes total drug spending in 2015.
Next steps
I have shown in this analysis that very interesting insights can be gained from analyzing a smaller set of publicly
available data. It follows that a more detailed and deeper analysis could enable more targeted conclusions and
recommendations for improving the health care system and the quality of life of patients suffering from disease.
Access to non-public owned data would make even deeper analysis possible.
Additional analysis could include:
- Clustering of diseases/indications to higher-level categories (cancer, metabolic disease, circulatory disease, nervous system disease, etc.)
- Linking of price databases to compare drug costs for the same indication on a population level
- Analysis of population data registers that document life style characteristics of healthy and ill individuals to identify those at risk of developing high-cost diseases (e.g. Medical Expenditure Panel Survey, Behavioral Risk Factor Surveillance System data)
Limitations
One limitation from this analysis is that only Part D drugs were considered. A further analysis could include Part B drugs too.
Moreover it was assumed that the fuzzy logic matching was successful in most cases. A more detailed test is required to assess match success more stringently.
All conclusions are only valid for the 2011-2015 interval. No data for 2016 was analyzed.