Transfer Learning

Recap

Just to recap, we would have several tasks each of which has Support and Query set. The tasks that are used for training has ground truth labels for all the examples in support and query set. Where as, the tasks that are used for validation or testing, has ground truth labels for Support set but not query set. We have to predict those labels for query set given support set samples with labels.

This idea is presented in 2019 and the references are given below. It basically has 2 stages namely Training and Fine-Tuning(test case).

The goal here is, given an abundant (in huge quantity) Base class labeled data, Xb\boldsymbol X_b, and a small amount of Novel class(unseen during training ) labeled data, Xn\boldsymbol X_nThe model should learn quickly to classify those Novel classes with very few labeled examples.

Training Stage

From https://arxiv.org/pdf/1904.04232.pdf

We useXb\boldsymbol X_bto train a standard classification with feature extractor (fθ\boldsymbol {f_{\theta}}), which is a typical CNN, and a classifier on top of it, C(.∣Wb)\boldsymbol {C(.\mid W_b)}. Here, Wb\boldsymbol {W_b}is the weight matrix in the classifier. In general setup, the classifier can have many layers, here it is one single layer.

D={T1∪T2∪ … ∪Ti ∪ … }={ {S1,Q1} ∪ {S2,Q2} ∪ … ∪ {Si,Qi} ∪ … }={S1 ∪ Q1 ∪ S2 ∪ Q2 ∪ … ∪ Si ∪ Qi∪ … }D={xi, yi}1mwhere, xi and yi are input image and the corresponding label\gdef\task#1{\ \{S_#1, Q_#1\}\ } \begin{aligned} \text D &= \{ T_1 \cup T_2 \cup \ \dots \ \cup T_i\ \cup\ \dots\} \\ &= \{ \task{1} \cup \task{2} \cup \ \dots\ \cup \task{i} \cup\ \dots\} \\ &= \{ S_1\ \cup\ Q_1\ \cup\ S_2\ \cup\ Q_2\ \cup\ \dots\ \cup\ S_i\ \cup\ Q_i \cup\ \dots\}\\\\ D &= \{x_i,\ y_i\}_{1}^{m}\text {\enspace where, }x_i\ and\ y_i \text{ are input image and the corresponding label} \end{aligned}

As all these support and query set has labels, we can use it for training both feature extractor and classifier with cross entropy loss as our loss function. This dataset DDis divided into training and validation to train the model.

Fine-Tuning Stage

From https://arxiv.org/pdf/1904.04232.pdf

For this we use Novel class ( unseen during training ) data,Xn\boldsymbol {X_n}, to fine tune only the classifier weights. In the above figure, you can see that the feature extractor(fθ\boldsymbol {f_{\theta}}) is fixed. Only the classifier weights are learned with those few samples of Novel class data. We denote those weights as Wn\boldsymbol {W_n}as shown in figure.

D={T1∪T2∪ … ∪Ti ∪ … }={ {S1,Q1} ∪ {S2,Q2} ∪ … ∪ {Si,Qi} ∪ … }={S1 ∪ Q1 ∪ S2 ∪ Q2 ∪ … ∪ Si ∪ Qi∪ … }Dtrain={S1 ∪S2 ∪ … ∪ Si ∪ … }, andDtest={Q1 ∪Q2 ∪ … ∪ Qi ∪ … }\gdef\task#1{\ \{S_#1, Q_#1\}\ } \begin{aligned} \text D &= \{ T_1 \cup T_2 \cup \ \dots \ \cup T_i\ \cup\ \dots\} \\ &= \{ \task{1} \cup \task{2} \cup \ \dots\ \cup \task{i} \cup\ \dots\} \\ &= \{ S_1\ \cup\ Q_1\ \cup\ S_2\ \cup\ Q_2\ \cup\ \dots\ \cup\ S_i\ \cup\ Q_i \cup\ \dots\}\\\\ D_{train} &= \{ S_1\ \cup S_2\ \cup\ \dots\ \cup\ S_i\ \cup\ \dots \},\ and \\ D_{test} &= \{ Q_1\ \cup Q_2\ \cup\ \dots\ \cup\ Q_i\ \cup\ \dots \} \\ \end{aligned}

Here, training(fine-tuning) happens with data from support set and validated using query set data

Models

The authors of this paper came up with two different models that can be trained this way. Baseline and Baseline++. Both has same feature extractor but a different classifier.

C(.∣Wb)\boldsymbol {C(.\mid W_b)}in Baseline

From https://arxiv.org/pdf/1904.04232.pdf

Here, d\boldsymbol dis the output dimension from feature extractor(fθ\boldsymbol {f_{\theta}}), and c\boldsymbol cis the number of output classes. Remaining is self explanatory

C(.∣Wb)\boldsymbol {C(.\mid W_b)}in Baseline++

From https://arxiv.org/pdf/1904.04232.pdf

Here, Wj\bold {\small W_j}is the column vector and the weights in this column vector are connected to jth\boldsymbol {j^{th}}neuron in the classification layer, Where,WjϵRd\large {\bold {{\small W_j} { \Large \epsilon} } \mathbb{ R}^{d}}and ZiϵRd\large {\small Z_i} { \Large \epsilon} \mathbb{ R}^{d} where, zi=fθ(Xi)\large z_i = {f_{\theta}{\small (X_i)}}

For the ithi^{th}training exampleXi\small {X_i}, we calculate ziz_i. We calculate similarities (cosine similarity) between ziz_iand each of Wj\small W_j.Similarity scores,Si={sim(Xi, W1), sim(Xi, W2), … , sim(Xi, Wc),}S_i = \{ sim({\small X_i,\ W_1)},\ sim({\small X_i,\ W_2)},\ \dots\ ,\ sim({\small X_i,\ W_c)},\}.These are normalized using softmax to get the probabilities.

The learned weight vectors for each class can be thought of prototypes or representational vectors for each class. It classifies the input to the class based on similarity score for each class. For example, if the input is most similar to W2\small W_2, then the input is classified as class-2.

Pros and Cons

Pros

  • Easy to implement and train

  • For feature extractor, We can use some networks which are trained over large datasets like ImageNet.

  • Faster to train on GPU's as it is heavily parallelizable and train across multiple GPU's(ie, distributed training)

Cons

  • As we are training over large dataset, it is prone to overfitting.

  • There is a very good chance that the set of parameters (like optimizer, #epochs, learning rate etc.,.) might not work when we are fine tuning when we are dealing with Cross-Domain. It may so happen that the Tasks might come from different distribution ( novel class data ) and we might have to fine tune the classifier weights all-over again to get the desired performance. Again, It won't work if the data itself is a cross-domain like, few example from automobiles, and few examples from animals etc.

  • This requires huge amount of data in the beginning and it is highly impossible in few cases like when the data is collected in a robot exploration.

Source code in Pytorch (official)

References

Last updated

Was this helpful?