tiistai 16. tammikuuta 2018

Estimation Theory (Jan 15)

On the third lecture, we looked more thoroughly at estimation theory, which is in our context closely linked with regression: prediction of numerical values (regression) as opposed to prediction of classes (classification).

In the beginning, there was a demo of predicting house prices at Hervanta area in Tampere. For this, I had downloaded a price database and preprocessed it in Excel to produce a CSV file. The code itself binarizes all categorical labels (e.g., quality has three values and is converted to three binary indicators), fits a scikit-learn model and predicts for 20% of the samples left out of the training set.


# -*- coding: utf-8 -*-
"""
Created on Mon Jan 15 09:08:53 2018

@author: hehu
"""

import numpy as np
from sklearn.preprocessing import LabelBinarizer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

if __name__ == "__main__":
    
    file = "prices.csv"
    
    X = []
    y = []
    
    # Load data line by line. Attributes (apt. size, year, etc) are
    # added to X, and the target (actual selling price) to y.
    
    with open(file, "r") as f:
        for line in f:
            
            # Skip first line
            if line.startswith("num_rooms"):
                continue
            
            parts = line.strip().split(";")
            
            rooms = int(parts[0])
            kind  = parts[1]
            
            # Numbers use Finnish locale with decimals separated by comma.
            # Just use replace(), although the proper way would be with
            # locale module.
            
            sqm   = float(parts[2].replace(",", "."))
            price = float(parts[3])
            year  = int(parts[4])
            elev  = parts[5]
            cond  = parts[6]
            
            X.append([rooms, kind, sqm, year, elev, cond])
            y.append(price)
            
    X = np.array(X)
    y = np.array(y)
    
    # Binarize categorical attributes (has_elevator, etc.)
    
    binarized_cols = [1, 4, 5]
    
    for col in binarized_cols:
        lb = LabelBinarizer()
        z = lb.fit_transform(X[:, col])
        X = np.append(X, z, axis = 1)
        
    for col in binarized_cols[::-1]: 
        X = np.delete(X, col, axis = 1)

    X = X.astype(float)
    y = y.astype(float)

    # Split to train and test
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                        random_state=42)
    
    # Fit the regression model and predict.
    
    model = LinearRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    error = mean_absolute_error(y_test, y_pred)
    
    print(np.column_stack((y_test, y_pred)))
    
    print("Mean error: {:.1f} eur/sqm".format(error))
    print("Coefficients: " + str(model.coef_))
    # cols: num_rooms, sqm, year...

Next, we looked at maximum likelihood estimation. The main principle is to choose a model for the data (e.g., my data is a noisy sinusoid: x[n] = A * sin(2*pi*f*n + phi) + w[n]), write down the probability of observing exactly these samples x[0], x[1], ... x[n-1] given a parameter value (e.g., A). Then the question is only to maximize this likelihood function with respect to the unknown parameter (e.g., A). In most cases the trick is to take the logarithm before differentiation--otherwise the mathematics is too hard to solve. On the second lecture, we saw an example MLE probled solved on the whiteboard.

Ei kommentteja:

Lähetä kommentti

Prep for exam, visiting lectures (Feb 22)

On the last lecture, we spent the first 30 minutes on an old exam. In particular, we learned how to calculate the ROC and AUC for a test sam...