keskiviikko 31. tammikuuta 2018

Ensemble methods and neural nets (Jan 31)

Today we continued the ensemble methods started last time.

However, before that we discussed the problem of model comparison in the competition. Namely, the train_test_split will mix samples from same recording into train and validation sets and will give an overly optimistic 99% score. Here is a great post on the topic.

Random forest have the ability of estimating the importance of each feature. This is done by randomizing the features one at the time and observing the amount of degradation of accuracy. If a feature is important, then its scrambling drops the accuracy a lot, while scrambling a non-important feature has only a minor effect to performance. In the lecture, we looked at an example, where we estimated the feature importances for the Kaggle competition data:


from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

from utils import load_data, extract_features
import matplotlib.pyplot as plt

if __name__ == "__main__":
    
    path = "audio_data"
    
    X_train, y_train, X_test, y_test, class_names = load_data(path)
    
    # Axes: 
    # 2 = average over time
    # 1 = average over frequency
    # None = average over both and concatenate
    
    F_train = extract_features(X_train, axis = None)
    F_test  = extract_features(X_test, axis = None)
    
    # Train random forest model and evaluate accuracy
    #model = RandomForestClassifier(n_estimators = 100)
    model = ExtraTreesClassifier(n_estimators = 100)
    
    model.fit(F_train, y_train)
    y_pred = model.predict(F_test)
    acc = accuracy_score(y_test, y_pred)
    
    print("Accuracy %.2f %%" % (100.0 * acc))
    
    # Plot feature importances
    importances = model.feature_importances_
    plt.bar(range(len(importances)), importances)
    plt.title("Feature importances")
    

The resulting feature importances are plotted on the below graph. Here, the first 501 feature are the frequency-wise averages for each 501 timepoint. The last 40 features are the corresponding averages along the time axis. It can be clearly seen, that the time-averages are clearly more significant for prediction accuracy.
After the RF part, we briefly looked at other ensembles: AdaBoost, GradientBoosing and Extremely Randomized trees. We also mentioned xgboost, which has often been a winner in Kaggle competitions.

At the second hour, we started the neural network part of the course. The nets are based on mimicking human brain functionality, although they have diverted quite far from the original idea. Among the topics, we discussed the differences between 1990's nets and modern nets (more layers, more data, more exotic layers). At the end of the lecture, we looked at how Keras can be used for training a simple 2-layer dense network.

5 kommenttia:

Prep for exam, visiting lectures (Feb 22)

On the last lecture, we spent the first 30 minutes on an old exam. In particular, we learned how to calculate the ROC and AUC for a test sam...