Health care systems world-wide are under pressure due to the high costs associated with disease. Now more than ever, particularly in developed countries, we have access to the latest advancements in medicine. This contrasts with the challenge of making those treatments available to as many patients as possible. It is imperative to find ways maximize the positive impact on the quality of life of patients, while maintaining a sustainable health care system. For this purpose I performed an analysis of Medicare data in the USA. Furthermore I used a drug-disease open database to cluster the costs by disease. I identified the most expensive diseases (mostly chronic diseases such as Diabetes) and the most expensive medicines. A drug for the treatment of HCV infections (Harvoni) stands out with the highest total costs in 2015. After this first exploration, I propose the in-depth analysis of further data to enable more targeted conclusions and recommendations to improve health care, such as linking of price databases to compare drug costs for the similar indications or the analysis of population data registers that document life style characteristics of healthy and sick individuals to identify those at risk of developing high-cost diseases.

Relevance

Health care costs amount to a considerable part of the national budgets all over the world. In 2015, $3.2 trillion were spent for health care in the USA (17.8% of its GDP). In Germany, the health care spending reached 11.3% of GDP in 2014. On the one hand, this high health care costs can be explained by the population growth, particularly the elderly proportion, requiring higher investments to secure quality of life. On the other hand, new medicines are continously being discovered enabling the treatment of diseases that were once a sentence of death. This has as a consequence that many once fatal diseases have now become chronic with a high burden on the health care costs.

But how can governments and insurers make sure that patients receive the care they need, including latest technology advances, without bankrupting the system? One first step is the identification of high-cost diseases and drugs. This insights can then be used to identify population segments at high-risk of developing a disease, who can then be the focus of prevention measures.

Governments, insurers, patient organizations, pharmaceutical and biotech companies need all to leverage their available data, if we are to improve the health of patients now and in future generations.

Methods

Data sources

Medicare Drug Spending Data 2011-2015: drug spending and utilization data. In this analysis only Medicare Part D drugs were considered (drugs patients generally administer themselves)
Therapeutic Targets Database: Drug-to-disease mapping with ICD identifiers.

Tools

pandas for data crunching
fuzzywuzzy for fuzzy logic matching
git for version control

Data preprocessing

First, I cleaned up and processed the drug spending data available from Medicare for the years 2011-2015. This data includes the total spending, claim number, and beneficiary number --among others-- for each drug identified by its brand and generic names.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_palette('Paired')
sns.set_style('whitegrid')
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

data = pd.read_csv('data/medicare_data_disease.csv')
data.head()

I also processed the data from the Therapeutic Targets Database, which presents the indications (diseases) associated with a drug generic name.

diseases = pd.read_csv('data/drug-disease_keys.csv')
diseases.head()

Then, I used a fuzzy logic algorithm to match each drug generic name of the Medicare data with the closest element from the Therapeutic Targets Database. After having a list of exact matches, I assigned the first associated indication to each Medicare drug. For details on how I did this, please check my github repository.

Results

Figure 1: Most expensive drugs and indications by total spending in a 5-year interval

spending = data.groupby('Indication').sum().sort_values(by='Total Spending', ascending=False)
spending.head()

spending_drug = data.groupby('Brand Name').sum().sort_values(by='Total Spending', ascending=False)
spending_drug.head()

n_top = 40
fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=False, figsize=(8,8))
g = sns.barplot(x='Total Spending', y='Indication', data=spending.reset_index()[:n_top], estimator=np.sum, ax=ax1, 
                color=sns.xkcd_rgb['dodger blue'])
g.set(yticklabels=[i[:27] for i in spending[:n_top].index])
g.set_xlabel('Total Spending $')
g2 = sns.barplot(x='Total Spending', y='Brand Name', data=spending_drug.reset_index()[:n_top], estimator=np.sum, ax=ax2,
                 color='lightblue')
g2.set(yticklabels=[i[:20] for i in spending_drug[:n_top].index])
g2.set_xlabel('Total Spending $')
#plt.title('Top 50 indications by Beneficiary Count Sum from 2011 to 2015')
fig.suptitle('Top %s indications and drugs for 5-year total spending 2011-2015' %n_top, size=16)
plt.tight_layout()
fig.subplots_adjust(top=0.94)
plt.savefig('Top_%s_disease_drug.png' %n_top, dpi=300, bbox_inches='tight')

Indications (left part)

A look at the total spending for the 5-year period 2011-2015 reveals that the bulk of drug spending is covered by a small set of diseases/indications (left graph). The total spending per indication decreases rapidly by going down the list of drugs.

Diabetes occupies the first place in this list with a total 5-year spending exceding $50 billion. Following in the list, we find other chronic diseases such as schizophrenia, chronic obstructive pulmonary disease, hypertension (high blood pressure), hypercholesterolemia (high cholesterol), depression, hiv infections, multiple sclerosis, peptic ulcer disease, and chronic HCV infection (hepatitis C). Interestingly, pain medications are also in the top 4 indications by total spending.

It makes sense that treatment of chronic diseases receives the highest investment in drug spending, as patients with these diseases can live long lives when medicated.

Interestingly, the first cancer reaches only the 19th place of this list (chronic myelogenous leukemia). However, it must be noted that cancer is actually a collection of different diseases with different genetics, origin, and treatment options. These different cancers were not clustered in this analysis.

Drugs (right part)

When we look at the most expensive drugs for the total 5-year spending, we find on the top of the list: Lantus (insulin), nexium (peptic ulcer), and crestor(anti cholesterol). It makes sense as these are medicines to treat chronic diseases.

However, we cannot learn much on a high level from looking at the total spending only. Therefore, a closer look is needed.

Figure 2: Drug spending is growing but at very heterogeneous rates

spend_2015_ind = data[data['Year']==2015].groupby('Indication').sum().sort_values(by='Total Spending', ascending=False)
#spend_2015_drug = data[data['Year']==2015].groupby('Brand Name').sum().sort_values(by='Total Spending', 
# ascending=False)
spend_2015_ind.head()

top_10_spend = data[data['Year']==2015].sort_values(by='Total Spending', ascending=False)[['Brand Name', 
                                                                                           'Total Spending', 
                                                                                           'Year']][:10]
top_10_spend

fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=False, figsize=(8,5))

g=sns.factorplot(x='Year', y='Total Spending', hue='Brand Name', palette='coolwarm', 
                 hue_order=top_10_spend['Brand Name'],
                 data=data[data['Brand Name'].isin(top_10_spend['Brand Name'])], ax=ax1)
ax1.set_title('Annual spending for top 10 drugs')
ax1.set_ylabel('Total Spending $')
plt.close(g.fig)

ax2.scatter(x=spend_2015_ind['Beneficiary Count'][:100], 
            y=spend_2015_ind['Total Spending'][:100],
            s=spend_2015_ind['Claim Count'][:100]/100000,
            #c=spend_2015_ind.reset_index()['Indication'][:100])
            color=sns.xkcd_rgb['dodger blue'], alpha=0.75)
ax2.set_title('Top 100 indications in 2015')
plt.xlabel('Beneficiary Count')
plt.ylabel('Total Spending $')
plt.axis([0, None, 0, None])
for label, x, y in zip(spend_2015_ind.index, 
                       spend_2015_ind['Beneficiary Count'][:10], 
                       spend_2015_ind['Total Spending'][:10]):
    plt.annotate(label, xy=(x, y), color='red', alpha=0.75)
fig.suptitle('Annual drug spending development and overview of highest-cost indications', size=16)
plt.tight_layout()
fig.subplots_adjust(top=0.85)
plt.savefig('Top_bubble_disease_drug.png', dpi=300, bbox_inches='tight')

Annual spending development for top 10 drugs (left)

The drug landscape is not temporally static. Therefore, I analyzed the annual spending since 2011 for the 10 top drugs in 2015. Eight out of these ten drugs consistently received higher spending every year, a reflection of the general health care spending panorama. However, the rate of growth for each drug is dramatically different. Particularly striking is the case of the drug Harvoni, which exhibited a >7-fold growth in total spending between 2014 and 2015.

Harvoni is a medicine for the treatment of hepatitis C (HCV infection) that was launched in 2014. It is the first drug with cure rates close to 100%. Harvoni practically cures a chronic disease and this is reflected in its pricing at over $90k for a 12 week treatment.

The remaining drugs in the figure are mostly used for the treatment of chronic diseases.

But how can we more extensively evaluate the burden posed by the different diseases/indications?

Top 100 indications in 2015 (right)

In order to find out more about the distribution of the most expensive indications, I plotted the drug spendings grouped by indication for the year 2015 in a scatter plot. This way, we can not only look at the total spending but also at the number of beneficiaries for a particular indication. The size of the bubbles represents the relative number of claims.

From this graph we can assess the magnitude of how the most expensive diseases affect society. Diabetes is not only the most expensive single indication by total spending but also affects a very large number of people.

The indications with the most beneficiaries are hypertension, pain and high cholesterol. They also represent some of the highest number of claims (bubble size). This indicates that the average cost associated with each claim is low, as these are generally medications with expired patents that are priced very low.

Again it is interesting to take a look at chronic HCV infection. This is the indication for the drug Harvoni. Both the number of beneficiaries and claims are extremely low compared with other diseases. However, chronic HCV infection reached the second place in the highes total drug spending in 2015.

Next steps

I have shown in this analysis that very interesting insights can be gained from analyzing a smaller set of publicly available data. It follows that a more detailed and deeper analysis could enable more targeted conclusions and recommendations for improving the health care system and the quality of life of patients suffering from disease.
Access to non-public owned data would make even deeper analysis possible.

Additional analysis could include:

Clustering of diseases/indications to higher-level categories (cancer, metabolic disease, circulatory disease, nervous system disease, etc.)
Linking of price databases to compare drug costs for the same indication on a population level
Analysis of population data registers that document life style characteristics of healthy and ill individuals to identify those at risk of developing high-cost diseases (e.g. Medical Expenditure Panel Survey, Behavioral Risk Factor Surveillance System data)

Limitations

One limitation from this analysis is that only Part D drugs were considered. A further analysis could include Part B drugs too.

Moreover it was assumed that the fuzzy logic matching was successful in most cases. A more detailed test is required to assess match success more stringently.

All conclusions are only valid for the 2011-2015 interval. No data for 2016 was analyzed.

	Unnamed: 0	Brand Name	Generic Name	Claim Count	Total Spending	Beneficiary Count	Total Annual Spending Per User	Unit Count	Average Cost Per Unit (Weighted)	Beneficiary Count No LIS	Average Beneficiary Cost Share No LIS	Beneficiary Count LIS	Average Beneficiary Cost Share LIS	Year	Matched Drug Name	Indication
0	0	10 wash	sulfacetamide sodium	24.0	1569.19	16.0	98.074375	5170.0	0.303518	NaN	NaN	NaN	NaN	2011	sulfacetamide	Acne vulgaris
1	1	1st tier unifine pentips	pen needle, diabetic	2472.0	57666.73	893.0	64.576405	293160.0	0.196766	422.0	42.347204	471.0	7.54586	2011	NaN	NaN
2	2	1st tier unifine pentips plus	pen needle, diabetic	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2011	NaN	NaN
3	3	60pse-400gfn-20dm	guaifenesin/dm/pseudoephedrine	12.0	350.10	11.0	31.827273	497.0	0.704427	NaN	NaN	NaN	NaN	2011	pseudoephedrine	Nasal congestion
4	4	8-mop	methoxsalen	11.0	9003.26	NaN	NaN	298.0	30.212282	NaN	NaN	NaN	NaN	2011	methoxsalen	Cutaneous T-cell lymphoma

	Unnamed: 0	TTDDRUGID	LNM	Indication	ICD9	ICD10
0	0	DAP000001	quetiapine	Schizophrenia	295, 710.0	F20, M32
1	1	DAP000002	theophylline	Chronic obstructive pulmonary disease	490-492, 494-496	J40-J44, J47
2	2	DAP000003	risperidone	Schizophrenia	295, 710.0	F20, M32
3	3	DAP000004	dasatinib	Chronic myelogenous leukemia	205.1, 208.9	C91-C95, C92.1
4	4	DAP000004	dasatinib	Solid tumours; Multiple myeloma	140-199, 203.0, 210-229	C00-C75, C7A, C7B, C90.0, D10-D36, D3A

	Unnamed: 0	Claim Count	Total Spending	Beneficiary Count	Total Annual Spending Per User	Unit Count	Average Cost Per Unit (Weighted)	Beneficiary Count No LIS	Average Beneficiary Cost Share No LIS	Beneficiary Count LIS	Average Beneficiary Cost Share LIS	Year
Indication
Diabetes mellitus	1619475	367700954.0	5.360758e+10	73949096.0	562815.013217	2.494315e+10	7277.339052	40704222.0	96730.900344	33243845.0	6840.136642	1449360
Schizophrenia	589890	120108011.0	3.029475e+10	16787537.0	665302.614794	5.232045e+09	23737.078668	4132330.0	75431.535074	12655011.0	2733.240495	503250
Chronic obstructive pulmonary disease	260320	85571788.0	2.668149e+10	18010399.0	117832.051227	5.169355e+09	1525.772351	8891428.0	16115.553801	9118941.0	953.965398	201300
Pain	2076820	449297282.0	2.237135e+10	125509481.0	635933.153421	3.582438e+10	8484.565888	68195577.0	86753.860239	57310277.0	5810.508368	1660725
Hypertension	1241330	659834372.0	2.140793e+10	127524840.0	453862.885999	3.924869e+10	2875.275230	84758076.0	99376.738594	42766156.0	8392.438238	1338645

	Unnamed: 0	Claim Count	Total Spending	Beneficiary Count	Total Annual Spending Per User	Unit Count	Average Cost Per Unit (Weighted)	Beneficiary Count No LIS	Average Beneficiary Cost Share No LIS	Beneficiary Count LIS	Average Beneficiary Cost Share LIS	Year
Brand Name
lantus/lantus solostar	10560	40959410.0	1.419734e+10	7627126.0	9059.358978	7.905816e+08	87.058010	3685935.0	1757.877943	3941191.0	119.054297	10065
nexium	13330	37338541.0	1.129409e+10	6968266.0	8159.874433	1.624007e+09	35.029401	2808631.0	1282.120804	4159635.0	108.193170	10065
crestor	4670	43304032.0	1.084924e+10	8312848.0	6460.499385	1.917524e+09	27.902605	5275631.0	1477.330224	3037217.0	126.576936	10065
advair diskus	385	30806126.0	1.036056e+10	7096159.0	7313.316443	2.273054e+09	22.805665	3613170.0	1284.084879	3482989.0	98.586905	10065
abilify	45	12506518.0	9.434570e+09	1861785.0	25165.999223	3.818410e+08	127.685877	333884.0	2541.346071	1527901.0	101.829169	10065

	Unnamed: 0	Claim Count	Total Spending	Beneficiary Count	Total Annual Spending Per User	Unit Count	Average Cost Per Unit (Weighted)	Beneficiary Count No LIS	Average Beneficiary Cost Share No LIS	Beneficiary Count LIS	Average Beneficiary Cost Share LIS	Year
Indication
Diabetes mellitus	323895	80808515.0	1.538882e+10	16756712.0	179199.546237	5.861327e+09	2112.287818	9688833.0	24716.663299	7067683.0	1459.860044	290160
Chronic HCV infection	5421	272915.0	8.349020e+09	90487.0	182144.098903	7.546096e+06	2134.680173	30454.0	10701.886971	60033.0	156.351475	4030
Chronic obstructive pulmonary disease	52064	17764181.0	6.756824e+09	3803356.0	30714.837787	1.067257e+09	561.945076	1931374.0	4177.371552	1871982.0	232.217985	40300
Schizophrenia	117978	25030047.0	5.468897e+09	3493417.0	192134.968549	1.084651e+09	8292.122415	938911.0	19320.941152	2554333.0	573.729378	100750
Pain	415364	94109025.0	4.956161e+09	27047833.0	164077.757502	7.597111e+09	4357.883695	15366894.0	19810.346154	11680708.0	1105.913592	332475

	Brand Name	Total Spending	Year
19770	harvoni	7.030633e+09	2015
20104	lantus/lantus solostar	4.359504e+09	2015
18926	crestor	2.883122e+09	2015
18069	advair diskus	2.270016e+09	2015
21640	spiriva	2.191466e+09	2015
19988	januvia	2.131952e+09	2015
21404	revlimid	2.077425e+09	2015
20658	nexium	2.012921e+09	2015
20291	lyrica	1.766474e+09	2015
19818	humira/humira pen	1.662292e+09	2015