Transfer Learning

Recap

Just to recap, we would have several tasks each of which has Support and Query set. The tasks that are used for training has ground truth labels for all the examples in support and query set. Where as, the tasks that are used for validation or testing, has ground truth labels for Support set but not query set. We have to predict those labels for query set given support set samples with labels.

This idea is presented in 2019 and the references are given below. It basically has 2 stages namely Training and Fine-Tuning(test case).

The goal here is, given an abundant (in huge quantity) Base class labeled data, $\boldsymbol X_b$ , and a small amount of Novel class(unseen during training ) labeled data, $\boldsymbol X_n$ The model should learn quickly to classify those Novel classes with very few labeled examples.

Training Stage

We use $\boldsymbol X_b$ to train a standard classification with feature extractor ( $\boldsymbol {f_{\theta}}$ ), which is a typical CNN, and a classifier on top of it, $\boldsymbol {C(.\mid W_b)}$ . Here, $\boldsymbol {W_b}$ is the weight matrix in the classifier. In general setup, the classifier can have many layers, here it is one single layer.

\gdef\task#1{\ \{S_#1, Q_#1\}\ } \begin{aligned} \text D &= \{ T_1 \cup T_2 \cup \ \dots \ \cup T_i\ \cup\ \dots\} \\ &= \{ \task{1} \cup \task{2} \cup \ \dots\ \cup \task{i} \cup\ \dots\} \\ &= \{ S_1\ \cup\ Q_1\ \cup\ S_2\ \cup\ Q_2\ \cup\ \dots\ \cup\ S_i\ \cup\ Q_i \cup\ \dots\}\\\\ D &= \{x_i,\ y_i\}_{1}^{m}\text {\enspace where, }x_i\ and\ y_i \text{ are input image and the corresponding label} \end{aligned}

As all these support and query set has labels, we can use it for training both feature extractor and classifier with cross entropy loss as our loss function. This dataset $D$ is divided into training and validation to train the model.

Fine-Tuning Stage

For this we use Novel class ( unseen during training ) data, $\boldsymbol {X_n}$ , to fine tune only the classifier weights. In the above figure, you can see that the feature extractor( $\boldsymbol {f_{\theta}}$ ) is fixed. Only the classifier weights are learned with those few samples of Novel class data. We denote those weights as $\boldsymbol {W_n}$ as shown in figure.

\gdef\task#1{\ \{S_#1, Q_#1\}\ } \begin{aligned} \text D &= \{ T_1 \cup T_2 \cup \ \dots \ \cup T_i\ \cup\ \dots\} \\ &= \{ \task{1} \cup \task{2} \cup \ \dots\ \cup \task{i} \cup\ \dots\} \\ &= \{ S_1\ \cup\ Q_1\ \cup\ S_2\ \cup\ Q_2\ \cup\ \dots\ \cup\ S_i\ \cup\ Q_i \cup\ \dots\}\\\\ D_{train} &= \{ S_1\ \cup S_2\ \cup\ \dots\ \cup\ S_i\ \cup\ \dots \},\ and \\ D_{test} &= \{ Q_1\ \cup Q_2\ \cup\ \dots\ \cup\ Q_i\ \cup\ \dots \} \\ \end{aligned}

Here, training(fine-tuning) happens with data from support set and validated using query set data

Models

The authors of this paper came up with two different models that can be trained this way. Baseline and Baseline++. Both has same feature extractor but a different classifier.

$\boldsymbol {C(.\mid W_b)}$ in Baseline

Here, $\boldsymbol d$ is the output dimension from feature extractor( $\boldsymbol {f_{\theta}}$ ), and $\boldsymbol c$ is the number of output classes. Remaining is self explanatory

$\boldsymbol {C(.\mid W_b)}$ in Baseline++

Here, $\bold {\small W_j}$ is the column vector and the weights in this column vector are connected to $\boldsymbol {j^{th}}$ neuron in the classification layer, Where, $\large {\bold {{\small W_j} { \Large \epsilon} } \mathbb{ R}^{d}}$ and $\large {\small Z_i} { \Large \epsilon} \mathbb{ R}^{d}$ where, $\large z_i = {f_{\theta}{\small (X_i)}}$

For the $i^{th}$ training example $\small {X_i}$ , we calculate $z_i$ . We calculate similarities (cosine similarity) between $z_i$ and each of $\small W_j$ .Similarity scores, $S_i = \{ sim({\small X_i,\ W_1)},\ sim({\small X_i,\ W_2)},\ \dots\ ,\ sim({\small X_i,\ W_c)},\}$ .These are normalized using softmax to get the probabilities.

The learned weight vectors for each class can be thought of prototypes or representational vectors for each class. It classifies the input to the class based on similarity score for each class. For example, if the input is most similar to $\small W_2$ , then the input is classified as class-2.

Pros and Cons

Pros

Easy to implement and train
For feature extractor, We can use some networks which are trained over large datasets like ImageNet.
Faster to train on GPU's as it is heavily parallelizable and train across multiple GPU's(ie, distributed training)

Cons

As we are training over large dataset, it is prone to overfitting.
There is a very good chance that the set of parameters (like optimizer, #epochs, learning rate etc.,.) might not work when we are fine tuning when we are dealing with Cross-Domain. It may so happen that the Tasks might come from different distribution ( novel class data ) and we might have to fine tune the classifier weights all-over again to get the desired performance. Again, It won't work if the data itself is a cross-domain like, few example from automobiles, and few examples from animals etc.
This requires huge amount of data in the beginning and it is highly impossible in few cases like when the data is collected in a robot exploration.

Source code in Pytorch (official)

GitHub - wyharveychen/CloserLookFewShot: source code to ICLR'19, 'A Closer Look at Few-shot Classification'GitHub

References

A CLOSER LOOK AT FEW-SHOT CLASSIFICATION and it's official implementation in GitHub (GitHub code )
https://mpatacchiola.github.io/blog/

PreviousHow to solve this ?NextMetric Learning

Last updated 5 years ago

Was this helpful?