Predicting Insurance Cost with Demographic Data

By Shane Chambers & Guillermo Veron

In [38]:
insurance.head()
Out[38]:
age sex bmi children smoker region charges
0 19 female 27.900 0 yes southwest 16884.92400
1 18 male 33.770 1 no southeast 1725.55230
2 28 male 33.000 3 no southeast 4449.46200
3 33 male 22.705 0 no northwest 21984.47061
4 32 male 28.880 0 no northwest 3866.85520

Data used in this project can be found here. This is a Medical Cost Personal Datasets dataset designed for machine learning exploration from Kaggle.

This dataset contains ~1400 observations, each with 7 dimensions.

Goal:

Predict the cost of insurance for an individual (Regression)

Search for some underlying structure in the data (Unsupervised clustering)

Approach:

Regression: Fit LR, KNN, and random forest models using both normalized and unnormalized data

Clustering: K-means clustering

Relationships Between Variables

Now we can explore the relationships between variables using seaborn's pairplot:

In [49]:
import seaborn as sns

sns.pairplot(insurance, plot_kws= {'alpha' : .2})
Out[49]:
<seaborn.axisgrid.PairGrid at 0x1a1f313588>

We begin to see some interesting trends, namely three distinct groups that appear in the charges vs. age plot. We can try to investigate this by coloring the plots by some of our categorical variables:

Pairplot Colored by Smoking Status:

In [50]:
sns.pairplot(insurance, hue = 'smoker', plot_kws= {'alpha' : .2})
Out[50]:
<seaborn.axisgrid.PairGrid at 0x1a1fba65c0>

Here we begin to see the age vs. charges groups separate, and we see a new trend in the relationship between charges and bmi.

Predict Charges With Regression: Methods

  1. Encode data, train-test-split (both normalized and unnormalized)
  2. KNN (hyper-parameter tuning of # of neighbors, K)
  3. Linear Regression (no hyper-parameter tuning)
  4. Random Forest (hyper-parameter tuning of n_estimators and max depth)

Results: KNN Prediction of charges: Normalized

In [56]:
# Visualize and label the point of optimal nearest neighbors
fig = plt.figure(figsize=(15, 10))
sns.lineplot(x = model_opt['num_of_neighbors'],
               y = model_opt['MSE'])

plt.text(knn_min_X_coord + 3, knn_min_Y_coord, 'Optimal Number of Neighbors: ' + str(knn_min_X_coord))
Out[56]:
Text(7.0, 3    7.317143e+07
Name: MSE, dtype: float64, 'Optimal Number of Neighbors: 4.0')
In [58]:
# Get mean error
knn_norm_me = np.sqrt(mean_squared_error(y_test, y_pred_nn))

# Visualize
fig = plt.figure(figsize=(15, 10))
sns.scatterplot(X_test[..., 0], y_test)
sns.scatterplot(X_test[..., 0], y_pred_nn)
plt.text(.4, 50000, 'ME: ' + str(round(knn_norm_me)))
Out[58]:
Text(0.4, 50000, 'ME: 8554.0')

As we can see, our model does a good job predicting charges in the bottom 'group', however has trouble deciding the y variable in the two groups above it with higher charges. Now, let's try this same approach but with the unnormalized data:

Results: KNN Prediction of charges: Unnormalized

In [60]:
# Write a for loop to iterate through all possible # of neighbors and stor MSE for each iteration
num_of_neighbors_unnorm = []
knn_MSE_unnorm = []

for i in range(1, X_unnorm_test.shape[0]):
    nnr = KNeighborsRegressor(n_neighbors=i)
    nnr.fit(X_unnorm_train, y_unnorm_train)
    knn_preds = nnr.predict(X_unnorm_test)
    MSE = mean_squared_error(y_unnorm_test, knn_preds)
    
    num_of_neighbors_unnorm.append(i)
    knn_MSE_unnorm.append(MSE)
    
# Store the # of neighbors and corresponding MSE as a pandas dataframe
model_opt_unnorm = pd.DataFrame({
    'num_of_neighbors' : num_of_neighbors_unnorm,
    'MSE' : knn_MSE_unnorm
})

# Find the # of neighbors corresponding with the lowest MSE
knn_min_X_coord_unnorm = model_opt_unnorm[model_opt_unnorm['MSE'] == model_opt_unnorm['MSE'].min()].iloc[0]['num_of_neighbors']
# Find the lowest MSE for plotting purposes
knn_min_Y_coord_unnorm = model_opt_unnorm[model_opt_unnorm['MSE'] == model_opt_unnorm['MSE'].min()]['MSE']

# Instantiate, fit, predict
nnr = KNeighborsRegressor(n_neighbors=int(knn_min_X_coord_unnorm))
nnr.fit(X_unnorm_train, y_unnorm_train)
y_pred_nn_unnorm = nnr.predict(X_unnorm_test)

# Get mean error
knn_unnorm_me = np.sqrt(mean_squared_error(y_unnorm_test, y_pred_nn_unnorm))

# visualize
plt.close()
fig = plt.figure(figsize=(15, 10))
sns.scatterplot(X_unnorm_test['age'], y_unnorm_test)
sns.scatterplot(X_unnorm_test['age'], y_pred_nn_unnorm)
plt.text(20, 50000, 'ME: ' + str(round(knn_unnorm_me)))
Out[60]:
Text(20, 50000, 'ME: 11086.0')

Here we see that normalization helps the model, with a lower ME than using the unnormalized data. The model still however has trouble predicting higher charges, so we will investigate whether multiple linear regression can outperform KNN in terms of predictive power.

Results: Linear Regression Prediction of charges: Normalized

In [61]:
from sklearn.linear_model import LinearRegression

y_model = LinearRegression(fit_intercept=False).fit(X_train, y_train).predict(X_test)

LR_norm_me = np.sqrt(mean_squared_error(y_test, y_model))

fig = plt.figure(figsize=(15, 10))
sns.scatterplot(X_test[..., 0], y_test)
sns.scatterplot(X_test[..., 0], y_model)
plt.text(.4, 50000, 'ME: ' + str(round(LR_norm_me)))
Out[61]:
Text(0.4, 50000, 'ME: 7396.0')

So far, this is the best performing model in therms of ME, however the model still struggles to predict those with higher charges. Below we can investigate what happens when we train a linear regression model on unnormalized data:

Results: Linear Regression Prediction of charges: Unormalized

In [62]:
y_unnorm_model = LinearRegression(fit_intercept=False).fit(X_unnorm_train, y_unnorm_train).predict(X_unnorm_test)

x_ax = 'age'

LR_unnorm_me = np.sqrt(mean_squared_error(y_unnorm_test, y_unnorm_model))

fig = plt.figure(figsize=(15, 10))
sns.scatterplot(X_unnorm_test[x_ax], y_unnorm_test)
sns.scatterplot(X_unnorm_test[x_ax], y_unnorm_model)
plt.text(20, 50000, 'ME: ' + str(round(LR_unnorm_me)))
Out[62]:
Text(20, 50000, 'ME: 6109.0')

Interestingly, this model is the best performing one thus far with an ME of 6109. Now, let's try a random forrest approach to see if this model can outperform the previous attempts.

Results: Random Forest Prediction of charges: Normalized

In [63]:
from sklearn.model_selection import KFold, GridSearchCV
from sklearn.ensemble import RandomForestRegressor

kf = KFold(n_splits=5, shuffle=True, random_state=1000)
# Create dt instance 
rf = RandomForestRegressor()
# Create grid search instance 
gscv = GridSearchCV(
    rf, 
    {"max_depth": range(1, 16), "n_estimators": range(2, 16)}, 
    cv=kf, 
    n_jobs=-1
)
gscv.fit(X_train, y_train)
# Get cross-validation data

cv_df = pd.DataFrame(gscv.cv_results_)
piv_df = cv_df.pivot(index="param_max_depth",
           columns="param_n_estimators",
           values="mean_test_score").round(3)
piv_df.style.background_gradient("nipy_spectral", axis=None)
Out[63]:
param_n_estimators 2 3 4 5 6 7 8 9 10 11 12 13 14 15
param_max_depth
1 0.602 0.601 0.604 0.602 0.602 0.603 0.601 0.603 0.602 0.602 0.602 0.603 0.602 0.603
2 0.704 0.713 0.712 0.712 0.713 0.717 0.722 0.722 0.711 0.716 0.716 0.723 0.72 0.723
3 0.738 0.747 0.757 0.761 0.774 0.769 0.76 0.763 0.769 0.768 0.773 0.773 0.77 0.765
4 0.752 0.781 0.784 0.786 0.783 0.794 0.791 0.793 0.796 0.787 0.799 0.797 0.801 0.801
5 0.766 0.779 0.789 0.798 0.794 0.797 0.798 0.809 0.805 0.799 0.804 0.81 0.805 0.81
6 0.743 0.773 0.793 0.804 0.797 0.801 0.801 0.804 0.803 0.803 0.811 0.805 0.805 0.807
7 0.752 0.781 0.783 0.789 0.785 0.797 0.796 0.805 0.794 0.798 0.806 0.803 0.804 0.802
8 0.763 0.775 0.768 0.793 0.787 0.787 0.792 0.794 0.791 0.798 0.796 0.803 0.797 0.8
9 0.728 0.749 0.777 0.791 0.79 0.787 0.788 0.788 0.791 0.794 0.803 0.799 0.801 0.8
10 0.679 0.768 0.761 0.764 0.77 0.783 0.788 0.788 0.8 0.797 0.796 0.799 0.786 0.797
11 0.735 0.756 0.757 0.769 0.782 0.789 0.789 0.784 0.798 0.794 0.797 0.788 0.792 0.796
12 0.736 0.749 0.763 0.779 0.785 0.774 0.786 0.792 0.797 0.797 0.79 0.798 0.795 0.795
13 0.704 0.756 0.768 0.782 0.773 0.772 0.778 0.784 0.78 0.78 0.791 0.794 0.794 0.795
14 0.688 0.745 0.748 0.765 0.781 0.781 0.787 0.78 0.788 0.788 0.79 0.792 0.794 0.797
15 0.725 0.737 0.759 0.765 0.771 0.768 0.782 0.786 0.786 0.802 0.788 0.794 0.79 0.796

Here we can programatically extract the optimal n_estimators and max_depth from our hyperparameter tuning corresponding with the grid above:

In [65]:
rf = RandomForestRegressor(max_depth = opt_max_depth_norm, n_estimators = opt_n_est_norm)

# Fit 
rf.fit(X_train, y_train)

# Predict
y_predicted = rf.predict(X_test)

# Test/Visualize
RF_norm_me = np.sqrt(mean_squared_error(y_test, y_predicted))

fig = plt.figure(figsize=(15, 10))
sns.scatterplot(X_test[..., 0], y_test)
sns.scatterplot(X_test[..., 0], y_predicted)
plt.text(.4, 50000, 'ME: ' + str(round(RF_norm_me)))
Out[65]:
Text(0.4, 50000, 'ME: 4692.0')

Here we have a new lowest ME, ande we can see that the model is able to begin to predict the charges variable as the number gets higher,a task previous models were unable to handle.

Finally, we can attempt to repeat this same workflow with random Forest modeling with the unnormalized data and compare models

Results: Random Forest Prediction of charges: Unnormalized

In [66]:
kf = KFold(n_splits=5, shuffle=True, random_state=1000)
# Create dt instance 
rf = RandomForestRegressor()
# Create grid search instance 
gscv = GridSearchCV(
    rf, 
    {"max_depth": range(1, 16), "n_estimators": range(2, 16)}, 
    cv=kf, 
    n_jobs=-1
)
gscv.fit(X_unnorm_train, y_unnorm_train)
# Get cross-validation data
cv_df = pd.DataFrame(gscv.cv_results_)
piv_df = cv_df.pivot(index="param_max_depth",
           columns="param_n_estimators",
           values="mean_test_score").round(3)
piv_df.style.background_gradient("nipy_spectral", axis=None)
/Users/sac/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_search.py:813: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)
Out[66]:
param_n_estimators 2 3 4 5 6 7 8 9 10 11 12 13 14 15
param_max_depth
1 0.603 0.602 0.602 0.602 0.603 0.602 0.601 0.602 0.603 0.603 0.603 0.603 0.602 0.602
2 0.812 0.81 0.811 0.812 0.812 0.812 0.811 0.812 0.815 0.813 0.812 0.813 0.813 0.81
3 0.831 0.836 0.836 0.835 0.834 0.834 0.837 0.837 0.836 0.839 0.839 0.839 0.837 0.837
4 0.82 0.829 0.835 0.828 0.832 0.834 0.835 0.833 0.836 0.834 0.835 0.834 0.837 0.835
5 0.815 0.816 0.829 0.823 0.827 0.834 0.826 0.83 0.829 0.828 0.828 0.827 0.83 0.834
6 0.797 0.79 0.804 0.818 0.811 0.821 0.819 0.82 0.822 0.828 0.828 0.825 0.821 0.826
7 0.793 0.793 0.813 0.802 0.8 0.808 0.813 0.806 0.808 0.808 0.813 0.812 0.817 0.82
8 0.737 0.778 0.785 0.803 0.784 0.801 0.798 0.808 0.814 0.809 0.812 0.815 0.811 0.809
9 0.738 0.773 0.795 0.787 0.79 0.801 0.801 0.799 0.805 0.798 0.805 0.804 0.801 0.809
10 0.757 0.769 0.773 0.773 0.788 0.791 0.793 0.799 0.809 0.797 0.801 0.802 0.794 0.806
11 0.754 0.77 0.782 0.772 0.787 0.791 0.793 0.789 0.803 0.8 0.797 0.801 0.799 0.802
12 0.744 0.756 0.777 0.796 0.773 0.798 0.801 0.797 0.793 0.803 0.801 0.8 0.796 0.801
13 0.76 0.749 0.768 0.776 0.786 0.79 0.799 0.791 0.799 0.803 0.807 0.798 0.799 0.798
14 0.739 0.752 0.774 0.776 0.796 0.782 0.801 0.796 0.805 0.795 0.784 0.804 0.794 0.794
15 0.725 0.759 0.777 0.783 0.777 0.803 0.781 0.792 0.792 0.8 0.799 0.801 0.8 0.806
In [68]:
rf = RandomForestRegressor(max_depth = opt_max_depth_unnorm, n_estimators = opt_n_est_unnorm)

# Fit 
rf.fit(X_unnorm_train, y_unnorm_train)

# Predict
y_predicted_unnorm = rf.predict(X_unnorm_test)

# Test/Visualize
RF_unnorm_me = np.sqrt(mean_squared_error(y_unnorm_test, y_predicted_unnorm))

fig = plt.figure(figsize=(15, 10))
sns.scatterplot(X_unnorm_test[x_ax], y_unnorm_test)
sns.scatterplot(X_unnorm_test[x_ax], y_predicted_unnorm)
plt.text(20, 50000, 'ME: ' + str(round(RF_unnorm_me)))
Out[68]:
Text(20, 50000, 'ME: 4573.0')

This model performs slightly better than the model using normalized data, however this iteration is clearly able to predict charges based on those two unidentified 'groups' we have been referring to.

Summary of Models

The table below summarizes the mean error of each model created:

In [69]:
model_summary = pd.DataFrame({'Model' : ['KNN', 'KNN', 'Linear Regression', 'Linear Regression', 'Random Forest', 'Random Forest'], 
                              'Normalized' : ['Y', 'N', 'Y', 'N', 'Y', 'N'],
                             'Score' : [knn_norm_me, knn_unnorm_me, LR_norm_me, LR_unnorm_me, RF_norm_me, RF_unnorm_me]})
model_summary
Out[69]:
Model Normalized Score
0 KNN Y 8554.030009
1 KNN N 11085.879483
2 Linear Regression Y 7396.245246
3 Linear Regression N 6108.831746
4 Random Forest Y 4692.312043
5 Random Forest N 4572.651646

Clustering with unsupervised learning

We've found an interesting relationship between age and charges, which our models have struggled to handle. This may mean that there are groups within these data not listed in the variables given. To explore this, we will use unsupervised learning to determine the optimal number of clusters to see if there is any hidden structure in our data.

In [70]:
from sklearn.cluster import KMeans

insur_clustering = pd.get_dummies(insurance.copy())
insur_norm_clustering = normalize(insur_clustering)

def plot_elbow(dataset, max_clusters):
    """Plot elbow curve for k-means."""
    inertias = []
    for i in range(1, max_clusters + 1):
        kmeans = KMeans(n_clusters=i, random_state=768797)
        kmeans.fit(dataset)
        inertias.append(kmeans.inertia_)

    plt.plot(range(1, max_clusters + 1), inertias)
    plt.title("Elbow Plot")
    plt.xlabel("K")
    plt.ylabel("SSD")

plot_elbow(insur_norm_clustering, 10)
In [71]:
cluster_data = insurance.copy()

# Instantiate
model = KMeans(
    n_clusters=2,
    random_state=1000,
)
# fit and predict
cluster_labels = model.fit_predict(pd.get_dummies(insurance))

#Make the cluster group its own column in the pandas df
cluster_data['group'] = cluster_labels
cluster_data['group'] = cluster_data['group'].replace([0, 1], ['g1', 'g2'])

#Re-make the pairplot, coloring by these clusters
sns.pairplot(cluster_data, hue = 'group', plot_kws= {'alpha' : .2})
Out[71]:
<seaborn.axisgrid.PairGrid at 0x1a1d0d59b0>

We don't see any convincing grouping here, so we can try with k = 3 below:

In [72]:
cluster_data = insurance.copy()

# Instantiate
model = KMeans(
    n_clusters=3,
    random_state=1000,
)
# fit and predict
cluster_labels = model.fit_predict(pd.get_dummies(cluster_data))

#Make the cluster group its own column in the pandas df
cluster_data['group'] = cluster_labels
cluster_data['group'] = cluster_data['group'].replace([0, 1, 2], ['g1', 'g2', 'g3'])

#Re-make the pairplot, coloring by these clusters
sns.pairplot(cluster_data, hue = 'group', plot_kws= {'alpha' : .2})
Out[72]:
<seaborn.axisgrid.PairGrid at 0x1a21c3e710>

It has become clear that the groups are being defined simply based on splitting the charges variable into 3 groups, perhaps because the scale of this variable is so large. Moving forward, we should repeat this clustering with normalized variables.

Conclusion/Discussion

We were successfully able to predict the charges variable. This Would be useful information for insurance companies to tailor their charges based on an individuals demographic data

The K-means clustering seemed to be dominated by the charges variable, possibly because of it's scale. For better results, this could be repeated with scaled data, or possibly another clustering algorithm such as DBSCAN.

For full project, see final-project.ipynb here