keskiviikko 31. tammikuuta 2018

Logistic regression and random forest (Jan 29)

Today westudied the remainder of the linear classifiers part. In particular, we looked at multiclass extensions of binary classifiers. For example, sklearn.svm.LinearSVC can only work with binary data, but can be extended to multiclass using the meta-classifiers sklearn.multiclass.OneVsRestClassifier and sklearn.multiclass.OneVsOneClassifier.

Logistic regression is yet another linear classifier. Unlike others, it is based on maximizing the likelihood of data given model parameters. In practice, the training boils down to minimization of the logistic loss function. The minimization is done by deriving an expression for the gradient of the log loss (we did this at the lecture), and iteratively adjusting the weight vector towards the negative gradient (we do this in next week exercises).

Note: Keras calls log loss binary crossentropy and in the multiclass case categorical crossentropy.

After finishing the logistic regression part, we looked at the first enseble classifier: random forest. The random forest are based on decision trees for which efficient training procedures exist. However, the drawback of DT is that it totally overlearns the data and becomes similar to 1-Nearest Neighbor. Random forest trains a collection of DT's by hiding parts of the data: not all samples are shown to each tree, and not all features are shown either.

1 kommentti:

Prep for exam, visiting lectures (Feb 22)

On the last lecture, we spent the first 30 minutes on an old exam. In particular, we learned how to calculate the ROC and AUC for a test sam...