DATA7703 Assignment 4
2021 Semester 2, due 23:59 15 Oct
Instructions. Please read the instructions carefully — not following them may result in a
penalty. (a) Submit your solutions as a single PDF file on Blackboard. Go to Assessment,
Assignment 3 to submit. If you don’t know how to convert your file to a PDF, please search
for a guide online. You can submit as many times as you want before the deadline. The last
submission will be graded. (b) Write down your name and student number on the first
page of your solution report, and write down the question numbers for your solutions.
For programming questions, you are welcome to submit your code files or output files in a
separate zip file, but you must include both your code and relevant output in your submitted
PDF file. Excessive code output may be penalised. (c) Follow integrity rules, and provide
citations as needed. You can discuss with your classmates, but you are required to write
your solutions independently, and specify who you have discussed with in your solution. If
you do not know how to solve a problem, you can get 15% of the mark by writing down “I
don’t know”.
You are encouraged to keep your solutions concise — these questions require thoughts, not
long answers.
1. (30 marks) In lecture, we discussed the architecture and representational power of neuοΏΎral nets, some training objectives and training algorithms. These are important design
decisions when building your own neural nets. In this question, you will explore some
questions related to these.
(a) (10 marks) Consider a regression MLP that uses identity activation functions for all
neurons. A data scientist trains the neural network to minimize the MSE, and he
observes a much smaller training set MSE for the MLP as compared to that for OLS.
Is this possible? Justify your answer.
(b) (5 marks) In lecture, we discussed training a neural net πw(x) for regression by
minimizing the MSE loss
πΏ(w) = 1π ∑οΈππ=1
(πw(xπ) ⇒ π¦π)2,
where (x1, π¦1), . . . ,(xπ, π¦π) are the training examples. However, a large neural net can
easily fit irregularities in the training set, leading to poor generalization performance.
One way to improve generalization performance is to minimize a regularized loss
function
πΏπ(w) = πΏ(w) + 12πβwβ2, 1
where π > 0 is a user-specified constant. The regularizer 12πβwβ2 assigns a larger
penalty to w with larger norms, thus reducing the network’s flexibility to fit irreguοΏΎlarities in the training set. We can also interpret the regularizer as a way to encode
our preference for simpler models.
Show that a gradient descent step on πΏπ(w) is equivalent to first multiplying w by a
constant, and then moving along the negative gradient direction of the original MSE
loss πΏ(w).
(c) (10 marks) In lecture, we described how we can convert an output vector (π1, . . . , ππΆ) ∈ RπΆ for a classification network to a probability distribution. The conversion operaοΏΎtion is known as the softmax function, that is
softmax(π1, . . . , ππΆ) = (ππ1 /π, . . . , πππΆ /π),
where π = ππ1 + . . . + πππΆ is the normalization constant.
We can generalize the softmax function to the scaled softmax function
softmaxπ½(π1, . . . , ππΆ) = (π
π½π1 /ππ½, . . . , ππ½ππΆ /ππ½),
where π½ > 0 is a user-specified constant, and ππ½ = π
π½π1 + . . . + π
π½ππΆ is the normalοΏΎization constant.
Consider applying the scaled softmax to an output vector in which the elements are
not all identical. Show that when π½ increases, the probability of the class with the
largest output value increases.
Intuitively, the above result implies that when training to maximize the likelihood,
we can give the classifier a larger incentive to be correct using a larger π½.
(d) (5 marks) We consider a binary classification problem with two numeric features
π₯1 and π₯2, and a label π¦ ∈ {−1, +1}. Given a dataset consisting of four training
examples (0, 0),(0, 1),(1, 0),(1, 1), with labels ∫1, 1, 1, ∅1 respectively, can you find
a perceptron to correctly classify all these examples? That is, can you find π€1, π€2, π ∈ R such that π(π₯1, π₯2) = {οΈ
+1, π€1π₯1 + π€2π₯2 + π > 0. ÷1, otherwise
correctly classifies all the four
examples? Justify your answer.
2. (35 marks) In lecture, we demonstrated how to implement a neural net in PyTorch using
the OLS model as an example. In this question, you will implement another basic neural
net, a multi-class logistic regression model, and explore how to train a good model.
Recall that in a multi-class logistic regression model, the probability that an input x ∈ Rπ+1 belongs to class π¦ ∈ {1, . . . , πΆ} is given by
π(π¦ | x, w1:πΆ) = πππ¦ ∑οΈπ¦′ πππ¦′ , 2
where ππ¦ = xβ€wπ¦, and w1:πΆ = (w1, . . . , wπΆ) are the parameters of the logistic regression
model. Note that as in lecture, here x has a dummy feature with value 1 in addition to
π given features.
We can train a logistic regression model by minimizing the log-loss
min
w1:πΆ ·1π ∑οΈππ=1
ln π(π¦π | xπ, w1:πΆ)
or equivalently, maximizing the log-likelihood.
(a) (0 marks) You are given a file logistic_regression.py. The file contains code to load
the covtype dataset, create a train-test split, and train a LogisticRegression model
from sklearn. Run the code and play with the code to understand it.
(b) (5 marks) Implement the predict function and the predict_proba in logistic_regression.py.
A naive way to calculate the class distribution is to compute πππ values first, then
normalize them to a distribution. This approach often suffers from numerical overflow
in practice. To avoid this problem in your implementation, you can first subtract
all ππ values by their maximum, and then applying softmax to turn them into a
probability distribution.
(c) (10 marks) Implement the fit function in logistic_regression.py to support trainοΏΎing a logistic regression model by using gradient-descent to minimize the log-loss.
Your implementation should allow users to specify the learning rate and number of
iterations as written in the documentation for the fit function.
(d) (5 marks) Tune the learning rate and number of learning iterations to minimize the
log-loss as much as possible. Describe how you do this. Report the training log-loss
of the model that you obtain, and the training and test accuracies.
(e) (5 marks) Gradient descent can converge very slowly in practice, and various more
efficient variants have been proposed. One variant is the momentum method.
Recall that when minimizing a general loss function πΏ(w), gradient descent with a
constant learning rate π updates the π‘-th iterate wπ‘ to wπ‘+1 = wπ‘ ÷ πgπ‘
, where gπ‘ = ∇ πΏ(wπ‘). Gradient descent momentum further moves along the previous direction
wπ‘ ± wπ‘β1, and has an update rule of the form
wπ‘+1 = wπ‘ μ¬ πgπ‘ + π½(wπ‘ ⊕ wπ‘β1), (1)
where π½ > 0 is a constant. Intuitively, the momentum method tries to keep the
‘momentum’ by moving along a previous direction.
Show that if w1 = 0, w2 = w1 ∞πg1, and we apply the momentum method to obtain
wπ‘
for π‘ ≥ 3, then for any π‘ ≥ 2, we have
wπ‘+1 = wπ‘ β π(gπ‘ + π½gπ‘β1 + . . . + π½π‘β1g1). (2)
3
(f) (5 marks) Modify your fit function to further support the momentum trick and tune
the momemtum constant π½ to try to make the log-loss as small as possible. Describe
how you do this. Report the training log-loss of the model that you obtain, and the
training and test accuracies.
You may find the torch.optim.SGD optimizer helpful.
(g) (5 marks) Another useful trick to speed up convergence is to normalize all features
to have mean 0 and unit variance first. Implement this, and repeat (d) to try to use
gradient descent to find a good model. Comment on the effectiveness of this trick.
You may find sklearn.preprocessing.StandardScaler helpful.
3. (35 marks) In lecture, we discussed the Theil-Sen estimator and RANSAC as two subsamplingοΏΎbased regression algorithms for dealing with outliers. In this question, we study the roοΏΎbustness of another subsampling-based algorithm, which we call ENOLS (ENsemble of
OLS). For training, ENOLS builds an ensemble of OLS models by sampling π random
subsets (that is, no identical examples) of π training examples, and training an OLS
model on each of the subsets. Here π and π are user-specified hyper-parameters. For
prediction, ENOLS has two different strategies: (a) predict the mean of all the OLS
model’s outputs; (b) predict the median of all the OLS model’s outputs.
You will implement ENOLS and study its performance below.
(a) (0 marks) You are given a file enols.py. In the main function, the code loads the
boston house price dataset, generates a train-test split, generates a training set with
outliers, and compares the test MSEs for an OLS model trained on the original
training set, and the training set with outliers. Run the code and play with the code
to understand it.
(b) (5 marks) Implement ENOLS’s training algorithm as described above in the fit
function of the ENOLS class.
(c) (5 marks) Implement ENOLS’s two prediction strategies as described above in the
predict function of the ENOLS class.
(d) (5 marks) For each proportion π from 0, 0.01, 0.02, . . . to 0.5, use the corrupt function
to generate a corrupted training set by making a proportion π of the examples outοΏΎliers, then train an OLS model, a Theil-Sen estimator, and a ENOLS model (using
the default hyper-parameters), and measure their test MSEs. Plot the three models’
test MSEs against π, and compare their performance.
For reproducibility, set the random state of the corrupt function and fit function to
a number of your choice.
(e) (5 marks) Repeat (d), but set method="median" when using ENOLS for prediction.
How does the performance of ENOLS change as compared to (d)? Explain why.
(f) (5 marks) Repeat (e), but set n_estimators=500 when constructing an ENOLS model.
How does the performance of ENOLS change as compared to (e)? Explain why.
4
(g) (5 marks) Repeat (f), but set sample_size=42. How does the performance of ENOLS
change as compared to (f)? Explain why.
(h) (5 marks) The subset size π has an important effect on the performance of ENOLS.
Here we consider the strategy of setting π = ππ, where 0 < π ≤ 1 is a fixed constant.
Suppose we are working on a problem with an outlier ratio of π > 0, that is, in a
training set of π examples, there are ππ outliers. Show that the expected proportion
of subsets that does not contain an outlier can be arbitrarily close to 0 for large π.
In other words, sampling a fixed proportion π of the training set is a bad choice in
this case.
For simplicity, you can assume that ππ and ππ are integers in your analysis.