keskiviikko 31. tammikuuta 2018

Ensemble methods and neural nets (Jan 31)

Today we continued the ensemble methods started last time.

However, before that we discussed the problem of model comparison in the competition. Namely, the train_test_split will mix samples from same recording into train and validation sets and will give an overly optimistic 99% score. Here is a great post on the topic.

Random forest have the ability of estimating the importance of each feature. This is done by randomizing the features one at the time and observing the amount of degradation of accuracy. If a feature is important, then its scrambling drops the accuracy a lot, while scrambling a non-important feature has only a minor effect to performance. In the lecture, we looked at an example, where we estimated the feature importances for the Kaggle competition data:


from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

from utils import load_data, extract_features
import matplotlib.pyplot as plt

if __name__ == "__main__":
    
    path = "audio_data"
    
    X_train, y_train, X_test, y_test, class_names = load_data(path)
    
    # Axes: 
    # 2 = average over time
    # 1 = average over frequency
    # None = average over both and concatenate
    
    F_train = extract_features(X_train, axis = None)
    F_test  = extract_features(X_test, axis = None)
    
    # Train random forest model and evaluate accuracy
    #model = RandomForestClassifier(n_estimators = 100)
    model = ExtraTreesClassifier(n_estimators = 100)
    
    model.fit(F_train, y_train)
    y_pred = model.predict(F_test)
    acc = accuracy_score(y_test, y_pred)
    
    print("Accuracy %.2f %%" % (100.0 * acc))
    
    # Plot feature importances
    importances = model.feature_importances_
    plt.bar(range(len(importances)), importances)
    plt.title("Feature importances")
    

The resulting feature importances are plotted on the below graph. Here, the first 501 feature are the frequency-wise averages for each 501 timepoint. The last 40 features are the corresponding averages along the time axis. It can be clearly seen, that the time-averages are clearly more significant for prediction accuracy.
After the RF part, we briefly looked at other ensembles: AdaBoost, GradientBoosing and Extremely Randomized trees. We also mentioned xgboost, which has often been a winner in Kaggle competitions.

At the second hour, we started the neural network part of the course. The nets are based on mimicking human brain functionality, although they have diverted quite far from the original idea. Among the topics, we discussed the differences between 1990's nets and modern nets (more layers, more data, more exotic layers). At the end of the lecture, we looked at how Keras can be used for training a simple 2-layer dense network.

Logistic regression and random forest (Jan 29)

Today westudied the remainder of the linear classifiers part. In particular, we looked at multiclass extensions of binary classifiers. For example, sklearn.svm.LinearSVC can only work with binary data, but can be extended to multiclass using the meta-classifiers sklearn.multiclass.OneVsRestClassifier and sklearn.multiclass.OneVsOneClassifier.

Logistic regression is yet another linear classifier. Unlike others, it is based on maximizing the likelihood of data given model parameters. In practice, the training boils down to minimization of the logistic loss function. The minimization is done by deriving an expression for the gradient of the log loss (we did this at the lecture), and iteratively adjusting the weight vector towards the negative gradient (we do this in next week exercises).

Note: Keras calls log loss binary crossentropy and in the multiclass case categorical crossentropy.

After finishing the logistic regression part, we looked at the first enseble classifier: random forest. The random forest are based on decision trees for which efficient training procedures exist. However, the drawback of DT is that it totally overlearns the data and becomes similar to 1-Nearest Neighbor. Random forest trains a collection of DT's by hiding parts of the data: not all samples are shown to each tree, and not all features are shown either.

torstai 25. tammikuuta 2018

Linear classifiers (Jan 24)

Today we delved deeped into linear classifiers and studied the LDA and SVM.
Before the actual content, we discussed the course groupwork. Each group should submit their predictions to Kaggle as described in the assignment text.
Next, we studied a demo of projecting 2D data to 1D with different degrees of separation. The below animation recaps the idea: a good projection makes the classes nicely distinct in the 1D projected space. This is also seen in the Fisher score: bigger score means better separation. Luckily we do not need to try all directions (like in the demo), but can use the LDA formula to find the best direction with one line of code.

The LDA projection vector can be found in two ways: 1) Solve a generalized eigenvalue problem, or, 2) use the simpler formula:

             w = (S1 + S2)-1 (m1 - m2),

with S1 and S2 the covariance matrices of the two classes and m1 and m2 the respective class means.

In the beginning of the second hour, we applied LDA to pixel classification. On the example, the user left-clicks at face areas and right-clicks at non-face areas. This is used as training material for skin detector that gives results like the one below.

This demo (code pasted at the bottom of this page) served as a motivation to look at SVM. The support vector machine is know for kernel trick, which (implicitly) maps the data into a higher dimension before classification. This enables non-linear class boundaries, which will be needed when the background is more variable than in the above simple example.

The essence of SVM is max margin classification, i.e., it attempts to maximize the margin between classes (as shown below). In case the classes are overlapping, we assign a penalty for each sample that is on the wrong side of the margin. The balance between margin width and the number of samples falling on the wrong side is adjusted using the parameter C.


# -*- coding: utf-8 -*-
"""
Created on Tue Jan 24 11:05:43 2017

@author: heikki.huttunen@tut.fi

An example of using the LDA for pixel color classification.
Left mouse button shows examples of foreground, and
right mouse button examples of background.
"""

import numpy as np
import matplotlib.pyplot as plt
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from mpl_toolkits.mplot3d import Axes3D

def onclick(event):
    """
    Function that is run every time when used clicks the mouse.
    """
    
    ix, iy = event.xdata, event.ydata
    button = event.button
    
    if button == 2:
        # Stop when used clicks middle button (or kills window)
        fig.canvas.mpl_disconnect(cid)
        plt.close("all")
    else:
        # Otherwise add to the coords list.
        global coords
        coords.append([int(ix), int(iy), button])

if __name__ == "__main__":
        
    # Load test image and show it.
    
    img = plt.imread("hh.jpg")
    
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.imshow(img)
    ax.set_title("Left-click: face; right-click: non-face; middle: exit")
    coords = []
    
    # Link mouse click to our function above.
    
    cid = fig.canvas.mpl_connect('button_press_event', onclick)
    plt.show()
    
    X = []
    y = []
    
    for ix, iy, button in coords:
        
        # Collect nearby samples to the user clicked point
    
        w = img[iy-3 : iy+4, ix-3:ix+4, :]
        
        # Unnecessarily complicated line to collect color channels
        # into a matrix (results in 49x3 matrix)
        
        C = np.array([w[...,c].ravel() for c in [0,1,2]]).T
        X.append(C)
        
        # Store class information to y.
        
        if button == 1:
            y.append(np.ones(C.shape[0]))
        else:
            y.append(np.zeros(C.shape[0]))
            
    X = np.concatenate(X, axis = 0)
    y = np.concatenate(y, axis = 0)
    X_test = np.array([img[...,c].ravel() for c in [0,1,2]]).T
    
    # Switch between sklearn and our own implementation.
    # Don't know why these produce slightly different results.
    # Both seem to work though.
    
    use_sklearn = True
    
    if use_sklearn:
        clf = LinearDiscriminantAnalysis()
        clf.fit(X, y)
        y_hat = clf.predict(X_test)
 
    else: # Do it the hard way:
        C0 = np.cov(X[y==0, :], rowvar = False)
        C1 = np.cov(X[y==1, :], rowvar = False)
        m0 = np.mean(X[y==0, :], axis = 0)
        m1 = np.mean(X[y==1, :], axis = 0)
        
        w = np.dot(np.linalg.inv(C0 + C1), (m1 - m0))
        T = 0.5 * (np.dot(w, m1) + np.dot(w, m0))    
 
        y_hat = np.dot(X_test, w) - T

        # Now y_hat is the class score.
        # Let's just threshold that at 0.
        # And cast from bool to integer
        
        y_hat = (y_hat > 0).astype(np.uint8)
    
    # Manipulate the vector form prediction to the original image shape.
    
    class_img = np.reshape(y_hat, img.shape[:2])
    prob_img  = np.reshape(clf.predict_proba(X_test)[:, 1], img.shape[:2])
    
    fig, ax = plt.subplots(1, 3)
    ax[0].imshow(img)
    ax[1].imshow(prob_img)
    img[class_img == 0] = 0
    ax[2].imshow(img)

    plt.show()
    

maanantai 22. tammikuuta 2018

K-NN & Linear classifiers (Jan 22)

Today we started by looking at an opencv human detection example code. This links nicely the detection theory with machine learning. The code uses Histogram of Oriented Gradients (HOG) representation of the image and classifies each image location as "person" or "not person" using a support vector machine. The essential code lines are below.


    # Initialize HOG detector.
    
    hog = cv2.HOGDescriptor()
    detectorCoefficients = cv2.HOGDescriptor_getDefaultPeopleDetector()
    hog.setSVMDetector( detectorCoefficients )

    # Load test image
    
    filename = 'person1.jpg'    
    img = cv2.imread(filename)
    
    # Detect humans
    
    found, w = hog.detectMultiScale(img, 
                                    winStride=(8,8), 
                                    padding=(32,32), 
                                    scale=1.05,
                                    hitThreshold = -1)

    draw_detections(img, found)

The first classifier was K-Nearest Neighbor, which searches K (e.g., K=5) nearest samples and copies the most frequent class label to the query sample. The benefit is its simplicity and flexibility, but a major drawback is the slow speed at prediction time.

Linear classifiers are a simpler alternative. A linear classifier is characterized by a linear decision boundary, and described by the decision rule:

F(x) = "class 1" if wTx > b
F(x) = "class 0" if wTx <= b

Here, w and b are the weights and constant offset learnt from the data and x is the new sample to be classified.

If you dare, you may test an image detection/recognition demo using deep learning from Google's Tensorflow project. It is available for Android phones as unverified apk at this link. Below is one detection example from last weekend. (Standard disclaimers on using unverified apps apply. Only install at your own risk).


keskiviikko 17. tammikuuta 2018

Detection Theory; ROC&AUC (Jan 17)

Today we studied the basics of detection theory (slide set 3). In the beginning of the lecture, we implemented a sinusoid detector in Python. The input was a 3-second long audio with sinusoid (beep) of 1000 Hz present in the inteval 1s...2s. The first and last second were blank. In the experiment, we deliberately added Gaussian noise on top of the clean signal, and experimented how well the sinusoid could be detected under noise.

The code is below. You need python sounddevice module, which installs in Anaconda command prompt with "pip install sounddevice". The below code uses a separate module "sin_detect", which is hidden here as it appears in next week exercises.


# Example of playing a noisy sinusoid
# Requires installation of sounddevice module.
# In anaconda: "pip install sounddevice" should do this.

import sounddevice as sd
import numpy as np
from sin_detect import detect
import matplotlib.pyplot as plt

if __name__ == "__main__":
    
    f = 1000    # Play a beep at this Hz
    Fs = 8192   # Samplerate
    sigma = 2   # Stddev of the noise
    
    n = np.arange(3 * Fs) # The sample times
    x = np.sin(2 * np.pi * f * n / Fs)
    
    # Zero out the beginning and the end
    x[:Fs] = 0
    x[2*Fs:] = 0
    
    # Add noise
    x_noisy = x + sigma * np.random.randn(*x.shape)
    
    # Play the sound
    sd.play(x_noisy, Fs)
    
    # Detect
    y = detect(x_noisy, frequency = f, Fs = Fs)
    
    # Plot results
    
    fig, ax = plt.subplots(3, 1)
    
    ax[0].plot(n, x)
    ax[0].set_title("Ground truth sinusoid")
    ax[0].grid("on")
    
    ax[1].plot(n, x_noisy)
    ax[1].set_title("Noisy sinusoid")
    ax[1].grid("on")
    
    ax[2].plot(n, y)
    ax[2].set_title("Detection result")
    ax[2].grid("on")
    
    plt.tight_layout()
    
    plt.show()

Another real life example is the TUT Age Estimator demo, which relies on face detection before the actual age estimation. Detection theory relies in Neyman-Pearson theorem, which states that the best approach is to compute the likelihood ratio and compare that to a threshold. The likelihood ratio usually simplifies to a simple computation rule (e.g., correlation between two sequences), whose output is then compared with a threshold.

The threshold is a free parameter, whose value determines the sensitivity of the detector. The smaller the threshold, the more events are detected and vice versa. Highly sensitive detector (= small threshold) probably finds all real events (e.g., locations in signal with a beep) but also produces many false alarms (locations detected as beep, but only noise).

To compare different detectors, we would like to check their performance regardless of the threshold (or for all thresholds at the same time). To this aim, we defined the Receiver Operating Characteristic (ROC) curve. For ROC, one can compute the area under the curve (AUC; AUC_ROC; AUROC). Bigger AUC values indicate that the detector performance is better. A completely random guess results in AUC = 0.5 (diagonal ROC) while perfect detection gives AUC = 1.0.

tiistai 16. tammikuuta 2018

Estimation Theory (Jan 15)

On the third lecture, we looked more thoroughly at estimation theory, which is in our context closely linked with regression: prediction of numerical values (regression) as opposed to prediction of classes (classification).

In the beginning, there was a demo of predicting house prices at Hervanta area in Tampere. For this, I had downloaded a price database and preprocessed it in Excel to produce a CSV file. The code itself binarizes all categorical labels (e.g., quality has three values and is converted to three binary indicators), fits a scikit-learn model and predicts for 20% of the samples left out of the training set.


# -*- coding: utf-8 -*-
"""
Created on Mon Jan 15 09:08:53 2018

@author: hehu
"""

import numpy as np
from sklearn.preprocessing import LabelBinarizer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

if __name__ == "__main__":
    
    file = "prices.csv"
    
    X = []
    y = []
    
    # Load data line by line. Attributes (apt. size, year, etc) are
    # added to X, and the target (actual selling price) to y.
    
    with open(file, "r") as f:
        for line in f:
            
            # Skip first line
            if line.startswith("num_rooms"):
                continue
            
            parts = line.strip().split(";")
            
            rooms = int(parts[0])
            kind  = parts[1]
            
            # Numbers use Finnish locale with decimals separated by comma.
            # Just use replace(), although the proper way would be with
            # locale module.
            
            sqm   = float(parts[2].replace(",", "."))
            price = float(parts[3])
            year  = int(parts[4])
            elev  = parts[5]
            cond  = parts[6]
            
            X.append([rooms, kind, sqm, year, elev, cond])
            y.append(price)
            
    X = np.array(X)
    y = np.array(y)
    
    # Binarize categorical attributes (has_elevator, etc.)
    
    binarized_cols = [1, 4, 5]
    
    for col in binarized_cols:
        lb = LabelBinarizer()
        z = lb.fit_transform(X[:, col])
        X = np.append(X, z, axis = 1)
        
    for col in binarized_cols[::-1]: 
        X = np.delete(X, col, axis = 1)

    X = X.astype(float)
    y = y.astype(float)

    # Split to train and test
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                        random_state=42)
    
    # Fit the regression model and predict.
    
    model = LinearRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    error = mean_absolute_error(y_test, y_pred)
    
    print(np.column_stack((y_test, y_pred)))
    
    print("Mean error: {:.1f} eur/sqm".format(error))
    print("Coefficients: " + str(model.coef_))
    # cols: num_rooms, sqm, year...

Next, we looked at maximum likelihood estimation. The main principle is to choose a model for the data (e.g., my data is a noisy sinusoid: x[n] = A * sin(2*pi*f*n + phi) + w[n]), write down the probability of observing exactly these samples x[0], x[1], ... x[n-1] given a parameter value (e.g., A). Then the question is only to maximize this likelihood function with respect to the unknown parameter (e.g., A). In most cases the trick is to take the logarithm before differentiation--otherwise the mathematics is too hard to solve. On the second lecture, we saw an example MLE probled solved on the whiteboard.

keskiviikko 10. tammikuuta 2018

Estimation theory (Jan 10)

Today we studied estimation theory (recap of least squares and first look at maximum likelihood).

At the beginning of the first lecture, we saw the solutions to the Moodle exercise worth 3 points:
 
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

if __name__ == '__main__':

    # Question 1: load data using numpy
    
    location_data = np.loadtxt("locationData.csv")
    print(np.shape(location_data))

    # Question 2: visualize using plt functions
    
    plt.plot(location_data[:, 0], location_data[:, 1], 'b-')
    plt.show()
    
    ax = plt.subplot(1, 1, 1, projection = "3d")
    plot3D = ax.plot(location_data[:,0], 
                      location_data[:,1], 
                      location_data[:,2])
    plt.show()
    
    # Question 3: Normalize data to zero mean and unit variance
    
    def normalize_data(data):
        
        return (data - np.mean(data, axis = 0)) / np.std(data, axis = 0)
        
    X_norm = normalize_data(location_data)
    
    print(np.mean(X_norm, axis = 0)) # Should be [0, 0, 0]
    print(np.std(X_norm, axis = 0)) # Should be [1, 1, 1]

The main point of the assignment was to make sure everyone has used Python and maybe installed Anaconda. I looked at a random sample of the returns, and they all look good. Thus I will add 3 points to all who have submitted any solution.

Some points raised while discussing the solution above:
  • plt.plot used many many times in a sequence will not wipe old plots away.
  • The function uses broadcasting. In Numpy, this is an efficient way to vectorize computation. For example, the statement
    data - np.mean(data, axis = 0)

    operates with data of dimensions 600x3 and 1x3. This is allowed if the trailing dimensions agree (both have last dimension == 3).
On the second hour the focus was on linear regression models: least squares and maximum likelihood. In regression the prediction target is real-valued and it is eay to consider accuracy in terms of output variance. In general, minimum variance unbiased estimators are desired but often hard to derive. Our key focus is in least squares and maximum likelihood.

It seems that this time the video lecture worked but the screen was missing (some cable was loose). The admins promised the next time all should work just fine. Also, there will be a live cast from now on.

tiistai 9. tammikuuta 2018

Introduction and basics of Python (Jan 8)

Today was the first lecture and we discussed course organization and took a brief look at using Python for Machine Learning.

Things to remember from today:
  • Passing the course requires: 1) exercises (at least 60%), 2) assignment (competition) and final exam.
  • Remember to register for the exercises in POP.
  • Remember to  register a group for the competition (max 4 members). The deadline is 19.1.
  • First exercises are already in Moodle. Return by Wednesday noon at the latest.
  • You can use classroom (TC303) computers or your own laptop in the exercise sessions. If you use your own, we recommend to install anaconda python (or miniconda with appropriate packages).
The lecture slides are available at the course website. Unfortunately it seems that the video recording failed, so below is a brief summary of the lecture. If you want, last year videos are available (note that dates and competition topics are of course invalid).

The first hour concentrated mostly on the organization of the course (see above).

On the second hour, we looked at the beginning of the slide set. First we emphasized the difference between model based and training based approach for solving recognition and detection problems.  
  • If, for example, the problem is to detect whether a sinusoidal beep is present in an audio signal, there is no point to solve it by showing example. This is because there is a perfect model (formula) for the sinusoid, and we can mathematically define exactly what we are looking for.
  • On the other hand, if the task is to classify pictures of cats and dogs apart, the model based approach is no longer useful: there is no formula that would describe all possible pictures of cats or dogs.
This Wednesday (10.1, noon) there is a deadline for submitting your solutions for the three first exercises in Moodle. This is a special case; usually we return the exercises in the classroom.

The Python tasks are rather straightforward. At the end of the lecture, we looked at how to read the competition target labels (file y_train). Since the lecture video is not available, here's the code.
 
if __name__ == "__main__":
    
    f = open("y_train.csv", "r")
    
    labels = []
    
    for line in f: # f is "iterable"
        
        if "id" in line: # Skip first line
            continue
        
        idx, label = line.split(",")
        labels.append(label.strip())
        
    print(labels)
    

Prep for exam, visiting lectures (Feb 22)

On the last lecture, we spent the first 30 minutes on an old exam. In particular, we learned how to calculate the ROC and AUC for a test sam...