By Shane Chambers & Guillermo Veron
insurance.head()
| age | sex | bmi | children | smoker | region | charges | |
|---|---|---|---|---|---|---|---|
| 0 | 19 | female | 27.900 | 0 | yes | southwest | 16884.92400 |
| 1 | 18 | male | 33.770 | 1 | no | southeast | 1725.55230 |
| 2 | 28 | male | 33.000 | 3 | no | southeast | 4449.46200 |
| 3 | 33 | male | 22.705 | 0 | no | northwest | 21984.47061 |
| 4 | 32 | male | 28.880 | 0 | no | northwest | 3866.85520 |
Data used in this project can be found here. This is a Medical Cost Personal Datasets dataset designed for machine learning exploration from Kaggle.
This dataset contains ~1400 observations, each with 7 dimensions.
Predict the cost of insurance for an individual (Regression)
Search for some underlying structure in the data (Unsupervised clustering)
Regression: Fit LR, KNN, and random forest models using both normalized and unnormalized data
Clustering: K-means clustering
Now we can explore the relationships between variables using seaborn's pairplot:
import seaborn as sns
sns.pairplot(insurance, plot_kws= {'alpha' : .2})
<seaborn.axisgrid.PairGrid at 0x1a1f313588>
We begin to see some interesting trends, namely three distinct groups that appear in the charges vs. age plot. We can try to investigate this by coloring the plots by some of our categorical variables:
sns.pairplot(insurance, hue = 'smoker', plot_kws= {'alpha' : .2})
<seaborn.axisgrid.PairGrid at 0x1a1fba65c0>
Here we begin to see the age vs. charges groups separate, and we see a new trend in the relationship between charges and bmi.
charges: Normalized¶# Visualize and label the point of optimal nearest neighbors
fig = plt.figure(figsize=(15, 10))
sns.lineplot(x = model_opt['num_of_neighbors'],
y = model_opt['MSE'])
plt.text(knn_min_X_coord + 3, knn_min_Y_coord, 'Optimal Number of Neighbors: ' + str(knn_min_X_coord))
Text(7.0, 3 7.317143e+07 Name: MSE, dtype: float64, 'Optimal Number of Neighbors: 4.0')
# Get mean error
knn_norm_me = np.sqrt(mean_squared_error(y_test, y_pred_nn))
# Visualize
fig = plt.figure(figsize=(15, 10))
sns.scatterplot(X_test[..., 0], y_test)
sns.scatterplot(X_test[..., 0], y_pred_nn)
plt.text(.4, 50000, 'ME: ' + str(round(knn_norm_me)))
Text(0.4, 50000, 'ME: 8554.0')
As we can see, our model does a good job predicting charges in the bottom 'group', however has trouble deciding the y variable in the two groups above it with higher charges. Now, let's try this same approach but with the unnormalized data:
charges: Unnormalized¶# Write a for loop to iterate through all possible # of neighbors and stor MSE for each iteration
num_of_neighbors_unnorm = []
knn_MSE_unnorm = []
for i in range(1, X_unnorm_test.shape[0]):
nnr = KNeighborsRegressor(n_neighbors=i)
nnr.fit(X_unnorm_train, y_unnorm_train)
knn_preds = nnr.predict(X_unnorm_test)
MSE = mean_squared_error(y_unnorm_test, knn_preds)
num_of_neighbors_unnorm.append(i)
knn_MSE_unnorm.append(MSE)
# Store the # of neighbors and corresponding MSE as a pandas dataframe
model_opt_unnorm = pd.DataFrame({
'num_of_neighbors' : num_of_neighbors_unnorm,
'MSE' : knn_MSE_unnorm
})
# Find the # of neighbors corresponding with the lowest MSE
knn_min_X_coord_unnorm = model_opt_unnorm[model_opt_unnorm['MSE'] == model_opt_unnorm['MSE'].min()].iloc[0]['num_of_neighbors']
# Find the lowest MSE for plotting purposes
knn_min_Y_coord_unnorm = model_opt_unnorm[model_opt_unnorm['MSE'] == model_opt_unnorm['MSE'].min()]['MSE']
# Instantiate, fit, predict
nnr = KNeighborsRegressor(n_neighbors=int(knn_min_X_coord_unnorm))
nnr.fit(X_unnorm_train, y_unnorm_train)
y_pred_nn_unnorm = nnr.predict(X_unnorm_test)
# Get mean error
knn_unnorm_me = np.sqrt(mean_squared_error(y_unnorm_test, y_pred_nn_unnorm))
# visualize
plt.close()
fig = plt.figure(figsize=(15, 10))
sns.scatterplot(X_unnorm_test['age'], y_unnorm_test)
sns.scatterplot(X_unnorm_test['age'], y_pred_nn_unnorm)
plt.text(20, 50000, 'ME: ' + str(round(knn_unnorm_me)))
Text(20, 50000, 'ME: 11086.0')
Here we see that normalization helps the model, with a lower ME than using the unnormalized data. The model still however has trouble predicting higher charges, so we will investigate whether multiple linear regression can outperform KNN in terms of predictive power.
charges: Normalized¶from sklearn.linear_model import LinearRegression
y_model = LinearRegression(fit_intercept=False).fit(X_train, y_train).predict(X_test)
LR_norm_me = np.sqrt(mean_squared_error(y_test, y_model))
fig = plt.figure(figsize=(15, 10))
sns.scatterplot(X_test[..., 0], y_test)
sns.scatterplot(X_test[..., 0], y_model)
plt.text(.4, 50000, 'ME: ' + str(round(LR_norm_me)))
Text(0.4, 50000, 'ME: 7396.0')
So far, this is the best performing model in therms of ME, however the model still struggles to predict those with higher charges. Below we can investigate what happens when we train a linear regression model on unnormalized data:
charges: Unormalized¶y_unnorm_model = LinearRegression(fit_intercept=False).fit(X_unnorm_train, y_unnorm_train).predict(X_unnorm_test)
x_ax = 'age'
LR_unnorm_me = np.sqrt(mean_squared_error(y_unnorm_test, y_unnorm_model))
fig = plt.figure(figsize=(15, 10))
sns.scatterplot(X_unnorm_test[x_ax], y_unnorm_test)
sns.scatterplot(X_unnorm_test[x_ax], y_unnorm_model)
plt.text(20, 50000, 'ME: ' + str(round(LR_unnorm_me)))
Text(20, 50000, 'ME: 6109.0')
Interestingly, this model is the best performing one thus far with an ME of 6109. Now, let's try a random forrest approach to see if this model can outperform the previous attempts.
charges: Normalized¶from sklearn.model_selection import KFold, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
kf = KFold(n_splits=5, shuffle=True, random_state=1000)
# Create dt instance
rf = RandomForestRegressor()
# Create grid search instance
gscv = GridSearchCV(
rf,
{"max_depth": range(1, 16), "n_estimators": range(2, 16)},
cv=kf,
n_jobs=-1
)
gscv.fit(X_train, y_train)
# Get cross-validation data
cv_df = pd.DataFrame(gscv.cv_results_)
piv_df = cv_df.pivot(index="param_max_depth",
columns="param_n_estimators",
values="mean_test_score").round(3)
piv_df.style.background_gradient("nipy_spectral", axis=None)
| param_n_estimators | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| param_max_depth | ||||||||||||||
| 1 | 0.602 | 0.601 | 0.604 | 0.602 | 0.602 | 0.603 | 0.601 | 0.603 | 0.602 | 0.602 | 0.602 | 0.603 | 0.602 | 0.603 |
| 2 | 0.704 | 0.713 | 0.712 | 0.712 | 0.713 | 0.717 | 0.722 | 0.722 | 0.711 | 0.716 | 0.716 | 0.723 | 0.72 | 0.723 |
| 3 | 0.738 | 0.747 | 0.757 | 0.761 | 0.774 | 0.769 | 0.76 | 0.763 | 0.769 | 0.768 | 0.773 | 0.773 | 0.77 | 0.765 |
| 4 | 0.752 | 0.781 | 0.784 | 0.786 | 0.783 | 0.794 | 0.791 | 0.793 | 0.796 | 0.787 | 0.799 | 0.797 | 0.801 | 0.801 |
| 5 | 0.766 | 0.779 | 0.789 | 0.798 | 0.794 | 0.797 | 0.798 | 0.809 | 0.805 | 0.799 | 0.804 | 0.81 | 0.805 | 0.81 |
| 6 | 0.743 | 0.773 | 0.793 | 0.804 | 0.797 | 0.801 | 0.801 | 0.804 | 0.803 | 0.803 | 0.811 | 0.805 | 0.805 | 0.807 |
| 7 | 0.752 | 0.781 | 0.783 | 0.789 | 0.785 | 0.797 | 0.796 | 0.805 | 0.794 | 0.798 | 0.806 | 0.803 | 0.804 | 0.802 |
| 8 | 0.763 | 0.775 | 0.768 | 0.793 | 0.787 | 0.787 | 0.792 | 0.794 | 0.791 | 0.798 | 0.796 | 0.803 | 0.797 | 0.8 |
| 9 | 0.728 | 0.749 | 0.777 | 0.791 | 0.79 | 0.787 | 0.788 | 0.788 | 0.791 | 0.794 | 0.803 | 0.799 | 0.801 | 0.8 |
| 10 | 0.679 | 0.768 | 0.761 | 0.764 | 0.77 | 0.783 | 0.788 | 0.788 | 0.8 | 0.797 | 0.796 | 0.799 | 0.786 | 0.797 |
| 11 | 0.735 | 0.756 | 0.757 | 0.769 | 0.782 | 0.789 | 0.789 | 0.784 | 0.798 | 0.794 | 0.797 | 0.788 | 0.792 | 0.796 |
| 12 | 0.736 | 0.749 | 0.763 | 0.779 | 0.785 | 0.774 | 0.786 | 0.792 | 0.797 | 0.797 | 0.79 | 0.798 | 0.795 | 0.795 |
| 13 | 0.704 | 0.756 | 0.768 | 0.782 | 0.773 | 0.772 | 0.778 | 0.784 | 0.78 | 0.78 | 0.791 | 0.794 | 0.794 | 0.795 |
| 14 | 0.688 | 0.745 | 0.748 | 0.765 | 0.781 | 0.781 | 0.787 | 0.78 | 0.788 | 0.788 | 0.79 | 0.792 | 0.794 | 0.797 |
| 15 | 0.725 | 0.737 | 0.759 | 0.765 | 0.771 | 0.768 | 0.782 | 0.786 | 0.786 | 0.802 | 0.788 | 0.794 | 0.79 | 0.796 |
Here we can programatically extract the optimal n_estimators and max_depth from our hyperparameter tuning corresponding with the grid above:
rf = RandomForestRegressor(max_depth = opt_max_depth_norm, n_estimators = opt_n_est_norm)
# Fit
rf.fit(X_train, y_train)
# Predict
y_predicted = rf.predict(X_test)
# Test/Visualize
RF_norm_me = np.sqrt(mean_squared_error(y_test, y_predicted))
fig = plt.figure(figsize=(15, 10))
sns.scatterplot(X_test[..., 0], y_test)
sns.scatterplot(X_test[..., 0], y_predicted)
plt.text(.4, 50000, 'ME: ' + str(round(RF_norm_me)))
Text(0.4, 50000, 'ME: 4692.0')
Here we have a new lowest ME, ande we can see that the model is able to begin to predict the charges variable as the number gets higher,a task previous models were unable to handle.
Finally, we can attempt to repeat this same workflow with random Forest modeling with the unnormalized data and compare models
charges: Unnormalized¶kf = KFold(n_splits=5, shuffle=True, random_state=1000)
# Create dt instance
rf = RandomForestRegressor()
# Create grid search instance
gscv = GridSearchCV(
rf,
{"max_depth": range(1, 16), "n_estimators": range(2, 16)},
cv=kf,
n_jobs=-1
)
gscv.fit(X_unnorm_train, y_unnorm_train)
# Get cross-validation data
cv_df = pd.DataFrame(gscv.cv_results_)
piv_df = cv_df.pivot(index="param_max_depth",
columns="param_n_estimators",
values="mean_test_score").round(3)
piv_df.style.background_gradient("nipy_spectral", axis=None)
/Users/sac/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_search.py:813: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal. DeprecationWarning)
| param_n_estimators | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| param_max_depth | ||||||||||||||
| 1 | 0.603 | 0.602 | 0.602 | 0.602 | 0.603 | 0.602 | 0.601 | 0.602 | 0.603 | 0.603 | 0.603 | 0.603 | 0.602 | 0.602 |
| 2 | 0.812 | 0.81 | 0.811 | 0.812 | 0.812 | 0.812 | 0.811 | 0.812 | 0.815 | 0.813 | 0.812 | 0.813 | 0.813 | 0.81 |
| 3 | 0.831 | 0.836 | 0.836 | 0.835 | 0.834 | 0.834 | 0.837 | 0.837 | 0.836 | 0.839 | 0.839 | 0.839 | 0.837 | 0.837 |
| 4 | 0.82 | 0.829 | 0.835 | 0.828 | 0.832 | 0.834 | 0.835 | 0.833 | 0.836 | 0.834 | 0.835 | 0.834 | 0.837 | 0.835 |
| 5 | 0.815 | 0.816 | 0.829 | 0.823 | 0.827 | 0.834 | 0.826 | 0.83 | 0.829 | 0.828 | 0.828 | 0.827 | 0.83 | 0.834 |
| 6 | 0.797 | 0.79 | 0.804 | 0.818 | 0.811 | 0.821 | 0.819 | 0.82 | 0.822 | 0.828 | 0.828 | 0.825 | 0.821 | 0.826 |
| 7 | 0.793 | 0.793 | 0.813 | 0.802 | 0.8 | 0.808 | 0.813 | 0.806 | 0.808 | 0.808 | 0.813 | 0.812 | 0.817 | 0.82 |
| 8 | 0.737 | 0.778 | 0.785 | 0.803 | 0.784 | 0.801 | 0.798 | 0.808 | 0.814 | 0.809 | 0.812 | 0.815 | 0.811 | 0.809 |
| 9 | 0.738 | 0.773 | 0.795 | 0.787 | 0.79 | 0.801 | 0.801 | 0.799 | 0.805 | 0.798 | 0.805 | 0.804 | 0.801 | 0.809 |
| 10 | 0.757 | 0.769 | 0.773 | 0.773 | 0.788 | 0.791 | 0.793 | 0.799 | 0.809 | 0.797 | 0.801 | 0.802 | 0.794 | 0.806 |
| 11 | 0.754 | 0.77 | 0.782 | 0.772 | 0.787 | 0.791 | 0.793 | 0.789 | 0.803 | 0.8 | 0.797 | 0.801 | 0.799 | 0.802 |
| 12 | 0.744 | 0.756 | 0.777 | 0.796 | 0.773 | 0.798 | 0.801 | 0.797 | 0.793 | 0.803 | 0.801 | 0.8 | 0.796 | 0.801 |
| 13 | 0.76 | 0.749 | 0.768 | 0.776 | 0.786 | 0.79 | 0.799 | 0.791 | 0.799 | 0.803 | 0.807 | 0.798 | 0.799 | 0.798 |
| 14 | 0.739 | 0.752 | 0.774 | 0.776 | 0.796 | 0.782 | 0.801 | 0.796 | 0.805 | 0.795 | 0.784 | 0.804 | 0.794 | 0.794 |
| 15 | 0.725 | 0.759 | 0.777 | 0.783 | 0.777 | 0.803 | 0.781 | 0.792 | 0.792 | 0.8 | 0.799 | 0.801 | 0.8 | 0.806 |
rf = RandomForestRegressor(max_depth = opt_max_depth_unnorm, n_estimators = opt_n_est_unnorm)
# Fit
rf.fit(X_unnorm_train, y_unnorm_train)
# Predict
y_predicted_unnorm = rf.predict(X_unnorm_test)
# Test/Visualize
RF_unnorm_me = np.sqrt(mean_squared_error(y_unnorm_test, y_predicted_unnorm))
fig = plt.figure(figsize=(15, 10))
sns.scatterplot(X_unnorm_test[x_ax], y_unnorm_test)
sns.scatterplot(X_unnorm_test[x_ax], y_predicted_unnorm)
plt.text(20, 50000, 'ME: ' + str(round(RF_unnorm_me)))
Text(20, 50000, 'ME: 4573.0')
This model performs slightly better than the model using normalized data, however this iteration is clearly able to predict charges based on those two unidentified 'groups' we have been referring to.
The table below summarizes the mean error of each model created:
model_summary = pd.DataFrame({'Model' : ['KNN', 'KNN', 'Linear Regression', 'Linear Regression', 'Random Forest', 'Random Forest'],
'Normalized' : ['Y', 'N', 'Y', 'N', 'Y', 'N'],
'Score' : [knn_norm_me, knn_unnorm_me, LR_norm_me, LR_unnorm_me, RF_norm_me, RF_unnorm_me]})
model_summary
| Model | Normalized | Score | |
|---|---|---|---|
| 0 | KNN | Y | 8554.030009 |
| 1 | KNN | N | 11085.879483 |
| 2 | Linear Regression | Y | 7396.245246 |
| 3 | Linear Regression | N | 6108.831746 |
| 4 | Random Forest | Y | 4692.312043 |
| 5 | Random Forest | N | 4572.651646 |
We've found an interesting relationship between age and charges, which our models have struggled to handle. This may mean that there are groups within these data not listed in the variables given. To explore this, we will use unsupervised learning to determine the optimal number of clusters to see if there is any hidden structure in our data.
from sklearn.cluster import KMeans
insur_clustering = pd.get_dummies(insurance.copy())
insur_norm_clustering = normalize(insur_clustering)
def plot_elbow(dataset, max_clusters):
"""Plot elbow curve for k-means."""
inertias = []
for i in range(1, max_clusters + 1):
kmeans = KMeans(n_clusters=i, random_state=768797)
kmeans.fit(dataset)
inertias.append(kmeans.inertia_)
plt.plot(range(1, max_clusters + 1), inertias)
plt.title("Elbow Plot")
plt.xlabel("K")
plt.ylabel("SSD")
plot_elbow(insur_norm_clustering, 10)
cluster_data = insurance.copy()
# Instantiate
model = KMeans(
n_clusters=2,
random_state=1000,
)
# fit and predict
cluster_labels = model.fit_predict(pd.get_dummies(insurance))
#Make the cluster group its own column in the pandas df
cluster_data['group'] = cluster_labels
cluster_data['group'] = cluster_data['group'].replace([0, 1], ['g1', 'g2'])
#Re-make the pairplot, coloring by these clusters
sns.pairplot(cluster_data, hue = 'group', plot_kws= {'alpha' : .2})
<seaborn.axisgrid.PairGrid at 0x1a1d0d59b0>
We don't see any convincing grouping here, so we can try with k = 3 below:
cluster_data = insurance.copy()
# Instantiate
model = KMeans(
n_clusters=3,
random_state=1000,
)
# fit and predict
cluster_labels = model.fit_predict(pd.get_dummies(cluster_data))
#Make the cluster group its own column in the pandas df
cluster_data['group'] = cluster_labels
cluster_data['group'] = cluster_data['group'].replace([0, 1, 2], ['g1', 'g2', 'g3'])
#Re-make the pairplot, coloring by these clusters
sns.pairplot(cluster_data, hue = 'group', plot_kws= {'alpha' : .2})
<seaborn.axisgrid.PairGrid at 0x1a21c3e710>
It has become clear that the groups are being defined simply based on splitting the charges variable into 3 groups, perhaps because the scale of this variable is so large. Moving forward, we should repeat this clustering with normalized variables.
We were successfully able to predict the charges variable. This Would be useful information for insurance companies to tailor their charges based on an individuals demographic data
The K-means clustering seemed to be dominated by the charges variable, possibly because of it's scale. For better results, this could be repeated with scaled data, or possibly another clustering algorithm such as DBSCAN.