A complete description of the dataset can be found here. The objective of this notebook is to show how to use a pipeline to screen through the predictive power of out-of-the-box machine learning algorithms.
Read and examine the data
Preprocessing
Set up the pipeline of regressors, train, and predict
Explanation of weights for best performing model
Cross validate the model
Conclusion
#data crunching imports
import pandas as pd
import numpy as np
#vis imports
import matplotlib.pyplot as plt
import seaborn as sns
# import plotly.express as px
#splitting of data
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
#metrics and stats
from sklearn.metrics import r2_score, explained_variance_score
from sklearn.metrics import mean_absolute_error, mean_squared_error
from scipy.stats import norm
#modeling regressors: trees and linear with Ridge
import xgboost as xgb
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.linear_model import Ridge
import lightgbm as lgb
# ML model interpretation with ELI5 package
import eli5 as eli
from eli5.sklearn import PermutationImportance
# supress warnings - added to create a crisp output. Please, check versions of libraries you use for consistency.
import warnings
warnings.filterwarnings('ignore')
#read in data
concrete = pd.read_csv('concrete.csv')
concrete.columns #read columns
Index(['Cement (component 1)(kg in a m^3 mixture)', 'Blast Furnace Slag (component 2)(kg in a m^3 mixture)', 'Fly Ash (component 3)(kg in a m^3 mixture)', 'Water (component 4)(kg in a m^3 mixture)', 'Superplasticizer (component 5)(kg in a m^3 mixture)', 'Coarse Aggregate (component 6)(kg in a m^3 mixture)', 'Fine Aggregate (component 7)(kg in a m^3 mixture)', 'Age (day)', 'Concrete compressive strength(MPa. megapascals)'], dtype='object')
Create a copy of the data and rename columns in a more readable format.
data = concrete.copy() #create a copy and rename cols
data.columns = ['Cement', 'BFS','FlyAsh','Water','Superplasticizer','CoarseAggr', 'FineAggr', 'Age','CompressiveStrength']
data
Cement | BFS | FlyAsh | Water | Superplasticizer | CoarseAggr | FineAggr | Age | CompressiveStrength | |
---|---|---|---|---|---|---|---|---|---|
0 | 540.0 | 0.0 | 0.0 | 162.0 | 2.5 | 1040.0 | 676.0 | 28.0 | 79.99 |
1 | 540.0 | 0.0 | 0.0 | 162.0 | 2.5 | 1055.0 | 676.0 | 28.0 | 61.89 |
2 | 332.5 | 142.5 | 0.0 | 228.0 | 0.0 | 932.0 | 594.0 | 270.0 | 40.27 |
3 | 332.5 | 142.5 | 0.0 | 228.0 | 0.0 | 932.0 | 594.0 | 365.0 | 41.05 |
4 | 198.6 | 132.4 | 0.0 | 192.0 | 0.0 | 978.4 | 825.5 | 360.0 | 44.30 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1025 | 276.4 | 116.0 | 90.3 | 179.6 | 8.9 | 870.1 | 768.3 | 28.0 | 44.28 |
1026 | 322.2 | 0.0 | 115.6 | 196.0 | 10.4 | 817.9 | 813.4 | 28.0 | 31.18 |
1027 | 148.5 | 139.4 | 108.6 | 192.7 | 6.1 | 892.4 | 780.0 | 28.0 | 23.70 |
1028 | 159.1 | 186.7 | 0.0 | 175.6 | 11.3 | 989.6 | 788.9 | 28.0 | 32.77 |
1029 | 260.9 | 100.5 | 78.3 | 200.6 | 8.6 | 864.5 | 761.5 | 28.0 | 32.40 |
1030 rows × 9 columns
Examine the feature vs feature relationships visually on a scatter matrix.
pd.plotting.scatter_matrix(data, figsize=(20,12), color='r', diagonal='kde');
The only clear trend visible is the cement-CompressiveStrength one. Let's examine the correlation coeffecients explicitly. We are looking at a highly non-linear relationship of age and ingredients to compressive strength (more details here). A review of the kernel density estimations in the scatter matrix above shows that we expect outliers to appear in the distributions of the measured parameters. Furthermore, non-monotonic relationships cannot be excluded. Therefore, we will use Kendall's $\tau$ correlation method instead of Pearson's method.
cor_coefs = data.corr(method='kendall')
The resulting table and heatmap are shown below. The correlation observed are weak or at best of medium strength.
cor_coefs.round(3)
Cement | BFS | FlyAsh | Water | Superplasticizer | CoarseAggr | FineAggr | Age | CompressiveStrength | |
---|---|---|---|---|---|---|---|---|---|
Cement | 1.000 | -0.168 | -0.329 | -0.065 | 0.028 | -0.103 | -0.119 | 0.004 | 0.327 |
BFS | -0.168 | 1.000 | -0.203 | 0.036 | 0.078 | -0.247 | -0.220 | -0.015 | 0.119 |
FlyAsh | -0.329 | -0.203 | 1.000 | -0.210 | 0.350 | 0.056 | 0.044 | 0.002 | -0.060 |
Water | -0.065 | 0.036 | -0.210 | 1.000 | -0.529 | -0.150 | -0.245 | 0.063 | -0.206 |
Superplasticizer | 0.028 | 0.078 | 0.350 | -0.529 | 1.000 | -0.139 | 0.121 | -0.006 | 0.250 |
CoarseAggr | -0.103 | -0.247 | 0.056 | -0.150 | -0.139 | 1.000 | -0.054 | -0.031 | -0.124 |
FineAggr | -0.119 | -0.220 | 0.044 | -0.245 | 0.121 | -0.054 | 1.000 | -0.042 | -0.122 |
Age | 0.004 | -0.015 | 0.002 | 0.063 | -0.006 | -0.031 | -0.042 | 1.000 | 0.449 |
CompressiveStrength | 0.327 | 0.119 | -0.060 | -0.206 | 0.250 | -0.124 | -0.122 | 0.449 | 1.000 |
plt.figure(figsize=(12,8))
sns.heatmap(cor_coefs.round(3), cmap='coolwarm', annot=True);
# max correlation value and corresponding feature
# get index where max appears
idx_max_corr = cor_coefs['CompressiveStrength'].drop(index = ['CompressiveStrength']).argmax()
# use index to extract the feature name
feat_max_corr = cor_coefs['CompressiveStrength'].index[idx_max_corr]
# use the feature name to get the max value
val_max_corr = cor_coefs['CompressiveStrength'].drop(index = ['CompressiveStrength'])[feat_max_corr]
print(f"Max correlation observed for feature {feat_max_corr} with value {np.round(val_max_corr, 3)}")
Max correlation observed for feature Age with value 0.449
data.isna().sum()
Cement 0 BFS 0 FlyAsh 0 Water 0 Superplasticizer 0 CoarseAggr 0 FineAggr 0 Age 0 CompressiveStrength 0 dtype: int64
There are no missing values, therefore, preprocessing steps are simple: split response and predictors (y and X respectively) and create train-test sets.
y = data['CompressiveStrength'] #response variable (target)
X = data[data.columns[:-1]] #predictors
When splitting into train and test sets we use the random engine for consistency in evaluating/comparing performance on train and test sets.
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size = 0.2,
shuffle = True,
random_state = 2021)
Instead of performing one regression we choose to regress the data with multiple tree regressors and a simple linear one. Specifically we perform the following:
All tree regressors are used as out-of-the-box (i.e. default parameter values) and compared against an optimized Ridge with $\alpha = 5.$ The objective here is not to optimize the hyperparameters of the regressors but rather constructing a limited auto-ml type analysis to pinpoint to the best one.
regressors = [xgb.XGBRegressor(),
lgb.LGBMRegressor(),
RandomForestRegressor(),
ExtraTreesRegressor(),
Ridge(alpha=5.)]
models = ['xgboost', 'lgbm', 'random_forest', 'x_trees', 'ridge'] # model names
Create a list of predictions, train each model/regressor in a loop, and append the predictions in the created list.
predictions = [] #list to hold predictions of each model
for regressor in regressors:
modeling = Pipeline(steps = [('regressor', regressor)])
modeling.fit(Xtrain, ytrain)
predictions.append(modeling.predict(Xtest))
Seaborn conveniently gives us the regplot
method which we use to examine measured vs predicted outcome. The spread of the points in indicative of the predictive strength of each model.
fig, ax = plt.subplots(1, len(models), figsize = (12,7))
fig.tight_layout()
for i in range(0, len(models)):
sns.regplot(x = ytest, y = predictions[i], ax = ax[i])
ax[i].set_title(models[i])
ax[i].set_xlabel('Measured Comp. Strength [MPa]')
ax[i].set_ylabel('Predicted Comp. Strength [MPa]');
Tree regression (unoptimized) clearly outperforms linear regression. However, there is no way of telling which tree regression is better. We should examine statistical metrics.
Let's turn to examine the residuals of each fit and the metrics associated with it. For a good statistical fit we are looking to find residuals normally distributed about a mean value of 0. Although, we could create Q-Q plots or go about rigorous hypothesis testing for normal distribution of values a simple fit overlayed on top of the histogram of observed residuals suffices. In order to identify outliers we also create a box plot.
def examine_residuals(ytest, yhat, model_name):
"""
provides a visual for the behavior of residual values:
normalncy and outliers.
"""
residuals = ytest - yhat
mu, std = norm.fit(residuals)
x_vals = np.linspace(residuals.min(), residuals.max(), 100)
p = norm.pdf(x_vals, mu, std)
fig, ax = plt.subplots(1, 2, figsize = (12, 7))
fig.suptitle("Residuals of modeling with " + model_name, fontsize = 14)
ax[0].hist(residuals, density = True, color = 'b')
ax[0].plot(x_vals, p, linewidth = 2, color = 'orange')
sns.boxplot(x = residuals, ax = ax[1]);
def extract_regression_metrics(ytest, predictions, models):
"""
create a dataframe that holds regression metrics
for each of the regressors
"""
metrics_df = pd.DataFrame()
for i in range(0, len(models)):
metrics_dict = {'r2':r2_score(ytest, predictions[i]).round(3),
'Explained_Var': explained_variance_score(ytest, predictions[i]).round(3),
'MAE':mean_absolute_error(ytest, predictions[i]).round(3),
'MSE':mean_squared_error(ytest, predictions[i]).round(3)}
metrics_dfi = pd.DataFrame(metrics_dict, index = [models[i]])
metrics_df = pd.concat([metrics_dfi,metrics_df])
return metrics_df
for i in range(0, len(models)):
examine_residuals(ytest, predictions[i], model_name=models[i])
metrics = extract_regression_metrics(ytest, predictions, models)
metrics
r2 | Explained_Var | MAE | MSE | |
---|---|---|---|---|
ridge | 0.548 | 0.551 | 8.789 | 124.787 |
x_trees | 0.908 | 0.908 | 3.221 | 25.563 |
random_forest | 0.902 | 0.902 | 3.625 | 27.124 |
lgbm | 0.927 | 0.927 | 3.069 | 20.253 |
xgboost | 0.927 | 0.927 | 2.886 | 20.311 |
Out of all the tree regressors xgboost seems to be performing a little better based on the mean absolute error (MAE). Let's examine plots of the metrics.
fig, ax = plt.subplots(1, 4, figsize = (16,5))
fig.tight_layout()
for i in range(0, len(metrics.columns)):
ax[i].plot(metrics.index, metrics[metrics.columns[i]], marker = 'o')
ax[i].set_ylabel('')
ax[i].set_xticklabels(list(metrics.index), rotation = 45)
ax[i].set_title(metrics.columns[i]);
Since xgboost seems to be the "winner" we focus on it. First we explain the model's weights, i.e. contributions from each predictor to the final response.
weights = eli.explain_weights(regressors[0].fit(Xtrain,ytrain))
weights
Weight | Feature |
---|---|
0.2510 | Age |
0.2394 | Cement |
0.1538 | Superplasticizer |
0.1379 | BFS |
0.1007 | Water |
0.0464 | FineAggr |
0.0463 | FlyAsh |
0.0246 | CoarseAggr |
weights_df = eli.format_as_dataframe(weights)
plt.figure(figsize = (12,7))
sns.barplot(x = 'feature', y = 'weight', data = weights_df, palette = 'summer')
plt.grid(axis='y')
perm = PermutationImportance(regressors[0], scoring = 'r2').fit(Xtest, ytest)
eli.show_weights(perm, feature_names = list(Xtest.columns))
Weight | Feature |
---|---|
0.8474 ± 0.1388 | Age |
0.6476 ± 0.1019 | Cement |
0.2088 ± 0.0391 | BFS |
0.1826 ± 0.0292 | Water |
0.1080 ± 0.0245 | Superplasticizer |
0.0421 ± 0.0150 | FineAggr |
0.0298 ± 0.0100 | CoarseAggr |
0.0029 ± 0.0071 | FlyAsh |
Both the weights of the model and the permutated feature importance agree that Age followed by Cement are the most important features. Age was also the most correlated feature to CompressiveStrength.
Due to the stochastic nature of tree models it is always good to cross validate the performce on different parts of the data. Therefore, here we split the data into a new group of train-test subsets. We also increase the test size to 40% of total. Then we cross_validate the performance of xgboost regressor 30 times on the same metrics as above.
This will show if we got "lucky" in the performance of xgboost as recorded previously. If not then we have a really solid model on which to run predictions on.
Note: In reality LightGBM is perfoming equally well. A rigorous cross validation process would have focused on both XGBoost and LightGBM models. Then hypothesis testing between the distributions of the cross validated results would indicate if there is a true (out of the box) winner or not.
from sklearn.model_selection import cross_validate
Xtrain_, Xtest_, ytrain_, ytest_ = train_test_split(X, y, test_size = 0.4,
shuffle = True,
random_state = 42)
scores_=cross_validate(regressors[0], Xtest_, ytest_,
scoring=['r2', 'explained_variance',
'neg_mean_absolute_error','neg_mean_squared_error'],
cv=30)
The scores_
dictionary holds all the values of the regression metrics produced after cross validating the model 30 times. We turn this into a dataframe for easier manipulation of the values it holds.
cv_scores_df = pd.DataFrame(scores_)
cv_scores_df.head()
fit_time | score_time | test_r2 | test_explained_variance | test_neg_mean_absolute_error | test_neg_mean_squared_error | |
---|---|---|---|---|---|---|
0 | 0.114666 | 0.002334 | 0.908671 | 0.909659 | -3.396428 | -26.555192 |
1 | 0.053154 | 0.002363 | 0.849104 | 0.852225 | -3.352535 | -19.849173 |
2 | 0.054944 | 0.002347 | 0.929630 | 0.933802 | -3.977411 | -23.962970 |
3 | 0.057264 | 0.002356 | 0.846868 | 0.848218 | -5.781331 | -56.167634 |
4 | 0.054327 | 0.002251 | 0.876412 | 0.912978 | -2.690353 | -12.663648 |
Columns related to time are not really of interest therefore we drop them. We are also reversing back the sign of mean errors (set to negative to be used as a maximization param).
cv_scores_df = cv_scores_df.loc[:,'test_r2':'test_neg_mean_squared_error']
cv_scores_df['test_MAE'] = -1*cv_scores_df['test_neg_mean_absolute_error']
cv_scores_df['test_MSE'] = -1*cv_scores_df['test_neg_mean_squared_error']
cv_scores_df.drop(columns=['test_neg_mean_absolute_error', 'test_neg_mean_squared_error'], inplace=True)
Finally, we are plotting the distributions of the errors computed during the cross validation process.
fig, ax = plt.subplots(1, 4, figsize=(16, 6))
for j in range(0, 4):
# print(cv_scores_df.columns[j])
sns.histplot(cv_scores_df[cv_scores_df.columns[j]],ax = ax[j])
cv_scores_df.describe().drop(index=['count']).round(3) #statistics of 30 cross validations
test_r2 | test_explained_variance | test_MAE | test_MSE | |
---|---|---|---|---|
mean | 0.860 | 0.875 | 3.976 | 35.575 |
std | 0.094 | 0.081 | 1.246 | 26.838 |
min | 0.555 | 0.641 | 2.054 | 6.851 |
25% | 0.847 | 0.857 | 2.973 | 16.149 |
50% | 0.872 | 0.889 | 3.856 | 28.254 |
75% | 0.922 | 0.928 | 4.756 | 45.550 |
max | 0.973 | 0.982 | 7.089 | 125.086 |
In conclusion, boosted regression trees perform the best on the set. The expected coefficient of determination of the fit is $\bar{R^{2}} = 0.860$ with a standard deviation $\sigma=0.094$.
Optimization of the hyperparameters is likely to improve the metrics of performance.