SGN-41007 Pattern Recognition and Machine Learning: Regularization and error estimation (Feb 14)

Today we studied error estimation using cross validation. This includes K-fold CV, stratified K-fold CV, Group K-fold CV, LOO CV. For each method there is a corresponding Sklearn generator that generates the splits. The CV methods differ from a simple train_test_split in that they train and test many times. For example 10-fold CV produces 10 estimates of accuracy instead of just one. Averaging the 10 numbers increases the stability and reliability of the estimator, and helps you in model selection (e.g., in deciding should I use SVM or LogReg).

Next, we studied L₁ and L₂ regularization with linear models. Regularization helps to avoid overfitting, where your model performs very well with training data but poorly with test data (see also previous blog entry where convnet overfits). These regularization techniques add a penalty to the loss function (such as log-loss or hinge loss), and the penalty is either the ed L₁ or L₂ norm of the weight vector. Note also that the same technique applies with deep nets that are just stacked LogReg models. For example, in Keras, the regularizers module enables the use of penalty if your net seems to overfit. In DNN's, the most widely used regularizer is probably the dropout, which we have discussed earlier, but all of them serve the same purpose: avoid overfitting to training data.

There was a question on what tricks are the winning teams using to succeed in the competition. Among the reports due last Monday, I have spotted a few widely used ones. I don't want to spoil the competition by revealing them all, but here are a few general tricks:

Model averaging: instead of using just one model, train many predictors (LR, SVM, RF, deep net), and average their predictions; either by majority vote (which class is predicted most often) or averaging the class probabilities. If you want to get fancy, you may want to try different combinations of predictors in your local test bench and automate the choice of predictors to include. I have done this in this very related competition (you can find my name on the leaderboard). Also check the "congratulations thread" of the discussion forum there.
Semisupervised learning: You can learn from the test data as well with this one. Googling for "semi supervised logistic regression" will find many papers where the test data is used to aid the classification. Then there is the dirty trick we used in this competition (check the leaderboard here as well):

Train a model with the training data
Predict labels for the test data
Append the training set with test samples and predicted labels
Retrain the model with train + test data and with the fake labels
Predict the labels for test data again

The accuracy of the above usually improves if you include only those test samples whose probability is high (e.g. > 0.6).

The reports also discuss many other innovative tricks related to feature extraction and classifiers to use. I will require the teams to disclose these in the final report (instructions out soon), and open them for all the world after the competition.

SGN-41007 Pattern Recognition and Machine Learning

keskiviikko 14. helmikuuta 2018

Regularization and error estimation (Feb 14)

1 kommentti:

Prep for exam, visiting lectures (Feb 22)

Hae tästä blogista