In the beginning, there was a demo of predicting house prices at Hervanta area in Tampere. For this, I had downloaded a price database and preprocessed it in Excel to produce a CSV file. The code itself binarizes all categorical labels (e.g., quality has three values and is converted to three binary indicators), fits a scikit-learn model and predicts for 20% of the samples left out of the training set.
# -*- coding: utf-8 -*- """ Created on Mon Jan 15 09:08:53 2018 @author: hehu """ import numpy as np from sklearn.preprocessing import LabelBinarizer from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.metrics import mean_absolute_error if __name__ == "__main__": file = "prices.csv" X = [] y = [] # Load data line by line. Attributes (apt. size, year, etc) are # added to X, and the target (actual selling price) to y. with open(file, "r") as f: for line in f: # Skip first line if line.startswith("num_rooms"): continue parts = line.strip().split(";") rooms = int(parts[0]) kind = parts[1] # Numbers use Finnish locale with decimals separated by comma. # Just use replace(), although the proper way would be with # locale module. sqm = float(parts[2].replace(",", ".")) price = float(parts[3]) year = int(parts[4]) elev = parts[5] cond = parts[6] X.append([rooms, kind, sqm, year, elev, cond]) y.append(price) X = np.array(X) y = np.array(y) # Binarize categorical attributes (has_elevator, etc.) binarized_cols = [1, 4, 5] for col in binarized_cols: lb = LabelBinarizer() z = lb.fit_transform(X[:, col]) X = np.append(X, z, axis = 1) for col in binarized_cols[::-1]: X = np.delete(X, col, axis = 1) X = X.astype(float) y = y.astype(float) # Split to train and test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Fit the regression model and predict. model = LinearRegression() model.fit(X_train, y_train) y_pred = model.predict(X_test) error = mean_absolute_error(y_test, y_pred) print(np.column_stack((y_test, y_pred))) print("Mean error: {:.1f} eur/sqm".format(error)) print("Coefficients: " + str(model.coef_)) # cols: num_rooms, sqm, year...
Next, we looked at maximum likelihood estimation. The main principle is to choose a model for the data (e.g., my data is a noisy sinusoid: x[n] = A * sin(2*pi*f*n + phi) + w[n]), write down the probability of observing exactly these samples x[0], x[1], ... x[n-1] given a parameter value (e.g., A). Then the question is only to maximize this likelihood function with respect to the unknown parameter (e.g., A). In most cases the trick is to take the logarithm before differentiation--otherwise the mathematics is too hard to solve. On the second lecture, we saw an example MLE probled solved on the whiteboard.
Ei kommentteja:
Lähetä kommentti