The University of Adelaide, School of Computer Science
Introduction to Statistical Machine Learning
Semester 2, 2019 Assignment 3: Implementation of PCA and Kernel PCA
DUE: 4 Nov. 2019, Monday 11:55 PM
Submission
Instructions and submission guidelines:
• You must sign an assessment declaration coversheet to submit with your assignment.
• Submit your assignment via the Canvas MyUni.
Kernel PCA and PCA
You are required to write code for performing PCA and kernel PCA (KPCA) on a given dataset.
After performing PCA/KPCA, you will use various classifiers to perform classification. You will
investigate the effect of applying PCA, KPCA and KPCA with different types of kernels.
Specifically, the tasks you need to complete are:
• Write code for performing PCA
• Apply PCA to the given dataset and then perform classification with the nearest
neighbour classifier. Analyse the performance change against different reduced
dimensions. (suggestion: from 256 to 10)
• Write code for performing KPCA. Use three kernel functions, Linear Kernel
(equivalent to PCA), Gaussian-RBF kernel and Hellinger kernel (See the lecture slides)
• Apply KPCA to the given dataset and then perform classification with nearest
neighbour classifier. Analyse the performance change against different reduced
dimensions. (suggestion: from 256 to 10)
• [Master student only] Append 256 noisy dimensions to the original data, that is, for
each sample xi, appending a 256-dimensional random feature vector to xi to make it
a 512-dimensional feature. Then repeat the KPCA experiments by using a linear SVM
as the classifier. Test and analyse the results. The following lines give an example code
(Matlab) for appending noise features
o % Let X be the original data
o [N,d] = size(X); o R = randn(N,d); % using Gaussian noise in this experiment
o X_ = [X,R];
Hint:
For Gaussian-RBF kernel, you need to set the bandwith hyper-parameter 𝜎2
. An empirical
approach is to choose it as 12𝑁2 ∑ ||𝑥𝑖 − 𝑥𝑗||2
𝑖𝑗 . You are encouraged to scale up and down this
empirical chosen parameter a little bit to find out the best performed hyper-parameter.
You can choose either Matlab, Python, or C/C++. I would personally suggest Matlab or
Python.
The PCA and KPCA part of your code should not rely on any 3rd-party toolbox. Only Matlab's
built-in API's or Python/ C/C++'s standard libraries are allowed. However, you can use 3
rd
-
party implementation of linear SVM for your experiments.
You are also required to submit a report (<10 pages in PDF format), which should have the
following sections (report contributes 50% to the mark; code 50%):
• An algorithmic description of PCA and KPCA. (5%)
• Your understanding of PCA and KPCA (anything that you believe is relevant to this
algorithm) (5%)
• Some analyses of your implementation. You should plot an error curve against the number
of reduced dimensions. For KPCA, results from different kernels should be plotted on the
same figure with different colours. (20% for master students and 40% for undergraduate
students)
• You should present and analyse the results of adding noisy dimensions before performing
PCA/KPCA. For example, you can compare the impact of noisy dimensions to different
kernels in KPCA. (20% for master students only)
In summary, you need to submit (1) the code that implements PCA/KPCA and (2) a report in
PDF.
Data
You will use USPS dataset to test your model. It contains 10 classes with 7291 training
samples and 2007 testing samples. The feature dimensionality is 256. The training data can
be found in usps.libsvm and testing data can be found in usps.t.libsvm.