Health care systems world-wide are under pressure due to the high costs associated with disease. Now more than ever, particularly in developed countries, we have access to the latest advancements in medicine. This contrasts with the challenge of making those treatments available to as many patients as possible. It is imperative to find ways maximize the positive impact on the quality of life of patients, while maintaining a sustainable health care system. For this purpose I performed an analysis of Medicare data in the USA. Furthermore I used a drug-disease open database to cluster the costs by disease. I identified the most expensive diseases (mostly chronic diseases such as Diabetes) and the most expensive medicines. A drug for the treatment of HCV infections (Harvoni) stands out with the highest total costs in 2015. After this first exploration, I propose the in-depth analysis of further data to enable more targeted conclusions and recommendations to improve health care, such as linking of price databases to compare drug costs for the similar indications or the analysis of population data registers that document life style characteristics of healthy and sick individuals to identify those at risk of developing high-cost diseases.

Relevance

Health care costs amount to a considerable part of the national budgets all over the world. In 2015, $3.2 trillion were spent for health care in the USA (17.8% of its GDP). In Germany, the health care spending reached 11.3% of GDP in 2014. On the one hand, this high health care costs can be explained by the population growth, particularly the elderly proportion, requiring higher investments to secure quality of life. On the other hand, new medicines are continously being discovered enabling the treatment of diseases that were once a sentence of death. This has as a consequence that many once fatal diseases have now become chronic with a high burden on the health care costs.

But how can governments and insurers make sure that patients receive the care they need, including latest technology advances, without bankrupting the system? One first step is the identification of high-cost diseases and drugs. This insights can then be used to identify population segments at high-risk of developing a disease, who can then be the focus of prevention measures.

Governments, insurers, patient organizations, pharmaceutical and biotech companies need all to leverage their available data, if we are to improve the health of patients now and in future generations.

Methods

Data sources

Tools

  • pandas for data crunching
  • fuzzywuzzy for fuzzy logic matching
  • git for version control

Data preprocessing

First, I cleaned up and processed the drug spending data available from Medicare for the years 2011-2015. This data includes the total spending, claim number, and beneficiary number --among others-- for each drug identified by its brand and generic names.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_palette('Paired')
sns.set_style('whitegrid')
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')
data = pd.read_csv('data/medicare_data_disease.csv')
data.head()
Unnamed: 0 Brand Name Generic Name Claim Count Total Spending Beneficiary Count Total Annual Spending Per User Unit Count Average Cost Per Unit (Weighted) Beneficiary Count No LIS Average Beneficiary Cost Share No LIS Beneficiary Count LIS Average Beneficiary Cost Share LIS Year Matched Drug Name Indication
0 0 10 wash sulfacetamide sodium 24.0 1569.19 16.0 98.074375 5170.0 0.303518 NaN NaN NaN NaN 2011 sulfacetamide Acne vulgaris
1 1 1st tier unifine pentips pen needle, diabetic 2472.0 57666.73 893.0 64.576405 293160.0 0.196766 422.0 42.347204 471.0 7.54586 2011 NaN NaN
2 2 1st tier unifine pentips plus pen needle, diabetic NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011 NaN NaN
3 3 60pse-400gfn-20dm guaifenesin/dm/pseudoephedrine 12.0 350.10 11.0 31.827273 497.0 0.704427 NaN NaN NaN NaN 2011 pseudoephedrine Nasal congestion
4 4 8-mop methoxsalen 11.0 9003.26 NaN NaN 298.0 30.212282 NaN NaN NaN NaN 2011 methoxsalen Cutaneous T-cell lymphoma

I also processed the data from the Therapeutic Targets Database, which presents the indications (diseases) associated with a drug generic name.

diseases = pd.read_csv('data/drug-disease_keys.csv')
diseases.head()
Unnamed: 0 TTDDRUGID LNM Indication ICD9 ICD10
0 0 DAP000001 quetiapine Schizophrenia 295, 710.0 F20, M32
1 1 DAP000002 theophylline Chronic obstructive pulmonary disease 490-492, 494-496 J40-J44, J47
2 2 DAP000003 risperidone Schizophrenia 295, 710.0 F20, M32
3 3 DAP000004 dasatinib Chronic myelogenous leukemia 205.1, 208.9 C91-C95, C92.1
4 4 DAP000004 dasatinib Solid tumours; Multiple myeloma 140-199, 203.0, 210-229 C00-C75, C7A, C7B, C90.0, D10-D36, D3A

Then, I used a fuzzy logic algorithm to match each drug generic name of the Medicare data with the closest element from the Therapeutic Targets Database. After having a list of exact matches, I assigned the first associated indication to each Medicare drug. For details on how I did this, please check my github repository.

Results

Figure 1: Most expensive drugs and indications by total spending in a 5-year interval

spending = data.groupby('Indication').sum().sort_values(by='Total Spending', ascending=False)
spending.head()
Unnamed: 0 Claim Count Total Spending Beneficiary Count Total Annual Spending Per User Unit Count Average Cost Per Unit (Weighted) Beneficiary Count No LIS Average Beneficiary Cost Share No LIS Beneficiary Count LIS Average Beneficiary Cost Share LIS Year
Indication
Diabetes mellitus 1619475 367700954.0 5.360758e+10 73949096.0 562815.013217 2.494315e+10 7277.339052 40704222.0 96730.900344 33243845.0 6840.136642 1449360
Schizophrenia 589890 120108011.0 3.029475e+10 16787537.0 665302.614794 5.232045e+09 23737.078668 4132330.0 75431.535074 12655011.0 2733.240495 503250
Chronic obstructive pulmonary disease 260320 85571788.0 2.668149e+10 18010399.0 117832.051227 5.169355e+09 1525.772351 8891428.0 16115.553801 9118941.0 953.965398 201300
Pain 2076820 449297282.0 2.237135e+10 125509481.0 635933.153421 3.582438e+10 8484.565888 68195577.0 86753.860239 57310277.0 5810.508368 1660725
Hypertension 1241330 659834372.0 2.140793e+10 127524840.0 453862.885999 3.924869e+10 2875.275230 84758076.0 99376.738594 42766156.0 8392.438238 1338645
spending_drug = data.groupby('Brand Name').sum().sort_values(by='Total Spending', ascending=False)
spending_drug.head()
Unnamed: 0 Claim Count Total Spending Beneficiary Count Total Annual Spending Per User Unit Count Average Cost Per Unit (Weighted) Beneficiary Count No LIS Average Beneficiary Cost Share No LIS Beneficiary Count LIS Average Beneficiary Cost Share LIS Year
Brand Name
lantus/lantus solostar 10560 40959410.0 1.419734e+10 7627126.0 9059.358978 7.905816e+08 87.058010 3685935.0 1757.877943 3941191.0 119.054297 10065
nexium 13330 37338541.0 1.129409e+10 6968266.0 8159.874433 1.624007e+09 35.029401 2808631.0 1282.120804 4159635.0 108.193170 10065
crestor 4670 43304032.0 1.084924e+10 8312848.0 6460.499385 1.917524e+09 27.902605 5275631.0 1477.330224 3037217.0 126.576936 10065
advair diskus 385 30806126.0 1.036056e+10 7096159.0 7313.316443 2.273054e+09 22.805665 3613170.0 1284.084879 3482989.0 98.586905 10065
abilify 45 12506518.0 9.434570e+09 1861785.0 25165.999223 3.818410e+08 127.685877 333884.0 2541.346071 1527901.0 101.829169 10065
n_top = 40
fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=False, figsize=(8,8))
g = sns.barplot(x='Total Spending', y='Indication', data=spending.reset_index()[:n_top], estimator=np.sum, ax=ax1, 
                color=sns.xkcd_rgb['dodger blue'])
g.set(yticklabels=[i[:27] for i in spending[:n_top].index])
g.set_xlabel('Total Spending $')
g2 = sns.barplot(x='Total Spending', y='Brand Name', data=spending_drug.reset_index()[:n_top], estimator=np.sum, ax=ax2,
                 color='lightblue')
g2.set(yticklabels=[i[:20] for i in spending_drug[:n_top].index])
g2.set_xlabel('Total Spending $')
#plt.title('Top 50 indications by Beneficiary Count Sum from 2011 to 2015')
fig.suptitle('Top %s indications and drugs for 5-year total spending 2011-2015' %n_top, size=16)
plt.tight_layout()
fig.subplots_adjust(top=0.94)
plt.savefig('Top_%s_disease_drug.png' %n_top, dpi=300, bbox_inches='tight')

Indications (left part)

A look at the total spending for the 5-year period 2011-2015 reveals that the bulk of drug spending is covered by a small set of diseases/indications (left graph). The total spending per indication decreases rapidly by going down the list of drugs.

Diabetes occupies the first place in this list with a total 5-year spending exceding $50 billion. Following in the list, we find other chronic diseases such as schizophrenia, chronic obstructive pulmonary disease, hypertension (high blood pressure), hypercholesterolemia (high cholesterol), depression, hiv infections, multiple sclerosis, peptic ulcer disease, and chronic HCV infection (hepatitis C). Interestingly, pain medications are also in the top 4 indications by total spending.

It makes sense that treatment of chronic diseases receives the highest investment in drug spending, as patients with these diseases can live long lives when medicated.

Interestingly, the first cancer reaches only the 19th place of this list (chronic myelogenous leukemia). However, it must be noted that cancer is actually a collection of different diseases with different genetics, origin, and treatment options. These different cancers were not clustered in this analysis.

Drugs (right part)

When we look at the most expensive drugs for the total 5-year spending, we find on the top of the list: Lantus (insulin), nexium (peptic ulcer), and crestor(anti cholesterol). It makes sense as these are medicines to treat chronic diseases.

However, we cannot learn much on a high level from looking at the total spending only. Therefore, a closer look is needed.

Figure 2: Drug spending is growing but at very heterogeneous rates

spend_2015_ind = data[data['Year']==2015].groupby('Indication').sum().sort_values(by='Total Spending', ascending=False)
#spend_2015_drug = data[data['Year']==2015].groupby('Brand Name').sum().sort_values(by='Total Spending', 
# ascending=False)
spend_2015_ind.head()
Unnamed: 0 Claim Count Total Spending Beneficiary Count Total Annual Spending Per User Unit Count Average Cost Per Unit (Weighted) Beneficiary Count No LIS Average Beneficiary Cost Share No LIS Beneficiary Count LIS Average Beneficiary Cost Share LIS Year
Indication
Diabetes mellitus 323895 80808515.0 1.538882e+10 16756712.0 179199.546237 5.861327e+09 2112.287818 9688833.0 24716.663299 7067683.0 1459.860044 290160
Chronic HCV infection 5421 272915.0 8.349020e+09 90487.0 182144.098903 7.546096e+06 2134.680173 30454.0 10701.886971 60033.0 156.351475 4030
Chronic obstructive pulmonary disease 52064 17764181.0 6.756824e+09 3803356.0 30714.837787 1.067257e+09 561.945076 1931374.0 4177.371552 1871982.0 232.217985 40300
Schizophrenia 117978 25030047.0 5.468897e+09 3493417.0 192134.968549 1.084651e+09 8292.122415 938911.0 19320.941152 2554333.0 573.729378 100750
Pain 415364 94109025.0 4.956161e+09 27047833.0 164077.757502 7.597111e+09 4357.883695 15366894.0 19810.346154 11680708.0 1105.913592 332475
top_10_spend = data[data['Year']==2015].sort_values(by='Total Spending', ascending=False)[['Brand Name', 
                                                                                           'Total Spending', 
                                                                                           'Year']][:10]
top_10_spend
Brand Name Total Spending Year
19770 harvoni 7.030633e+09 2015
20104 lantus/lantus solostar 4.359504e+09 2015
18926 crestor 2.883122e+09 2015
18069 advair diskus 2.270016e+09 2015
21640 spiriva 2.191466e+09 2015
19988 januvia 2.131952e+09 2015
21404 revlimid 2.077425e+09 2015
20658 nexium 2.012921e+09 2015
20291 lyrica 1.766474e+09 2015
19818 humira/humira pen 1.662292e+09 2015
fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=False, figsize=(8,5))

g=sns.factorplot(x='Year', y='Total Spending', hue='Brand Name', palette='coolwarm', 
                 hue_order=top_10_spend['Brand Name'],
                 data=data[data['Brand Name'].isin(top_10_spend['Brand Name'])], ax=ax1)
ax1.set_title('Annual spending for top 10 drugs')
ax1.set_ylabel('Total Spending $')
plt.close(g.fig)

ax2.scatter(x=spend_2015_ind['Beneficiary Count'][:100], 
            y=spend_2015_ind['Total Spending'][:100],
            s=spend_2015_ind['Claim Count'][:100]/100000,
            #c=spend_2015_ind.reset_index()['Indication'][:100])
            color=sns.xkcd_rgb['dodger blue'], alpha=0.75)
ax2.set_title('Top 100 indications in 2015')
plt.xlabel('Beneficiary Count')
plt.ylabel('Total Spending $')
plt.axis([0, None, 0, None])
for label, x, y in zip(spend_2015_ind.index, 
                       spend_2015_ind['Beneficiary Count'][:10], 
                       spend_2015_ind['Total Spending'][:10]):
    plt.annotate(label, xy=(x, y), color='red', alpha=0.75)
fig.suptitle('Annual drug spending development and overview of highest-cost indications', size=16)
plt.tight_layout()
fig.subplots_adjust(top=0.85)
plt.savefig('Top_bubble_disease_drug.png', dpi=300, bbox_inches='tight')

Annual spending development for top 10 drugs (left)

The drug landscape is not temporally static. Therefore, I analyzed the annual spending since 2011 for the 10 top drugs in 2015. Eight out of these ten drugs consistently received higher spending every year, a reflection of the general health care spending panorama. However, the rate of growth for each drug is dramatically different. Particularly striking is the case of the drug Harvoni, which exhibited a >7-fold growth in total spending between 2014 and 2015.

Harvoni is a medicine for the treatment of hepatitis C (HCV infection) that was launched in 2014. It is the first drug with cure rates close to 100%. Harvoni practically cures a chronic disease and this is reflected in its pricing at over $90k for a 12 week treatment.

The remaining drugs in the figure are mostly used for the treatment of chronic diseases.

But how can we more extensively evaluate the burden posed by the different diseases/indications?

Top 100 indications in 2015 (right)

In order to find out more about the distribution of the most expensive indications, I plotted the drug spendings grouped by indication for the year 2015 in a scatter plot. This way, we can not only look at the total spending but also at the number of beneficiaries for a particular indication. The size of the bubbles represents the relative number of claims.

From this graph we can assess the magnitude of how the most expensive diseases affect society. Diabetes is not only the most expensive single indication by total spending but also affects a very large number of people.

The indications with the most beneficiaries are hypertension, pain and high cholesterol. They also represent some of the highest number of claims (bubble size). This indicates that the average cost associated with each claim is low, as these are generally medications with expired patents that are priced very low.

Again it is interesting to take a look at chronic HCV infection. This is the indication for the drug Harvoni. Both the number of beneficiaries and claims are extremely low compared with other diseases. However, chronic HCV infection reached the second place in the highes total drug spending in 2015.

Next steps

I have shown in this analysis that very interesting insights can be gained from analyzing a smaller set of publicly available data. It follows that a more detailed and deeper analysis could enable more targeted conclusions and recommendations for improving the health care system and the quality of life of patients suffering from disease.
Access to non-public owned data would make even deeper analysis possible.

Additional analysis could include:

  • Clustering of diseases/indications to higher-level categories (cancer, metabolic disease, circulatory disease, nervous system disease, etc.)
  • Linking of price databases to compare drug costs for the same indication on a population level
  • Analysis of population data registers that document life style characteristics of healthy and ill individuals to identify those at risk of developing high-cost diseases (e.g. Medical Expenditure Panel Survey, Behavioral Risk Factor Surveillance System data)

Limitations

One limitation from this analysis is that only Part D drugs were considered. A further analysis could include Part B drugs too.

Moreover it was assumed that the fuzzy logic matching was successful in most cases. A more detailed test is required to assess match success more stringently.

All conclusions are only valid for the 2011-2015 interval. No data for 2016 was analyzed.