Analysis of the Concrete Compressive Strength dataset

A complete description of the dataset can be found here. The objective of this notebook is to show how to use a pipeline to screen through the predictive power of out-of-the-box machine learning algorithms.

Read and examine the data
Preprocessing
Set up the pipeline of regressors, train, and predict
Explanation of weights for best performing model
Cross validate the model
Conclusion

Read and examine the data

Create a copy of the data and rename columns in a more readable format.

Examine the feature vs feature relationships visually on a scatter matrix.

The only clear trend visible is the cement-CompressiveStrength one. Let's examine the correlation coeffecients explicitly. We are looking at a highly non-linear relationship of age and ingredients to compressive strength (more details here). A review of the kernel density estimations in the scatter matrix above shows that we expect outliers to appear in the distributions of the measured parameters. Furthermore, non-monotonic relationships cannot be excluded. Therefore, we will use Kendall's $\tau$ correlation method instead of Pearson's method.

The resulting table and heatmap are shown below. The correlation observed are weak or at best of medium strength.

Preprocessing

There are no missing values, therefore, preprocessing steps are simple: split response and predictors (y and X respectively) and create train-test sets.

When splitting into train and test sets we use the random engine for consistency in evaluating/comparing performance on train and test sets.

Instead of performing one regression we choose to regress the data with multiple tree regressors and a simple linear one. Specifically we perform the following:

All tree regressors are used as out-of-the-box (i.e. default parameter values) and compared against an optimized Ridge with $\alpha = 5.$ The objective here is not to optimize the hyperparameters of the regressors but rather constructing a limited auto-ml type analysis to pinpoint to the best one.

Set up the pipeline of regressors, train, and predict

Create a list of predictions, train each model/regressor in a loop, and append the predictions in the created list.

Seaborn conveniently gives us the regplot method which we use to examine measured vs predicted outcome. The spread of the points in indicative of the predictive strength of each model.

Tree regression (unoptimized) clearly outperforms linear regression. However, there is no way of telling which tree regression is better. We should examine statistical metrics.

Let's turn to examine the residuals of each fit and the metrics associated with it. For a good statistical fit we are looking to find residuals normally distributed about a mean value of 0. Although, we could create Q-Q plots or go about rigorous hypothesis testing for normal distribution of values a simple fit overlayed on top of the histogram of observed residuals suffices. In order to identify outliers we also create a box plot.

Out of all the tree regressors xgboost seems to be performing a little better based on the mean absolute error (MAE). Let's examine plots of the metrics.

Since xgboost seems to be the "winner" we focus on it. First we explain the model's weights, i.e. contributions from each predictor to the final response.

Explanation of weights for best performing model

Both the weights of the model and the permutated feature importance agree that Age followed by Cement are the most important features. Age was also the most correlated feature to CompressiveStrength.

Cross validate the model

Due to the stochastic nature of tree models it is always good to cross validate the performce on different parts of the data. Therefore, here we split the data into a new group of train-test subsets. We also increase the test size to 40% of total. Then we cross_validate the performance of xgboost regressor 30 times on the same metrics as above.

This will show if we got "lucky" in the performance of xgboost as recorded previously. If not then we have a really solid model on which to run predictions on.

Note: In reality LightGBM is perfoming equally well. A rigorous cross validation process would have focused on both XGBoost and LightGBM models. Then hypothesis testing between the distributions of the cross validated results would indicate if there is a true (out of the box) winner or not.

The scores_ dictionary holds all the values of the regression metrics produced after cross validating the model 30 times. We turn this into a dataframe for easier manipulation of the values it holds.

Columns related to time are not really of interest therefore we drop them. We are also reversing back the sign of mean errors (set to negative to be used as a maximization param).

Finally, we are plotting the distributions of the errors computed during the cross validation process.

Conclusion

In conclusion, boosted regression trees perform the best on the set. The expected coefficient of determination of the fit is $\bar{R^{2}} = 0.860$ with a standard deviation $\sigma=0.094$.
Optimization of the hyperparameters is likely to improve the metrics of performance.