Skip to main content

ResNets

ResNets

ResNets were presented as an answer to "can stacking more layers enable the network to learn better" - the obstacle up to that point was vanishing / exploding gradients, and they were primarily solved for by normalized initialization and intermediate normalization layers which enabled networks with tens of layers to start converging for stochastic gradient descent with backprop. Degredation also proved to be an issue, where as network depth increased there was a saturation, and then rapid decrease, in accuracy. Adding more layers to a suitably deep model led to higher training error, and overfitting was not caused by this degredation. A shallower architecture was suggested, and using auxiliary layers consisting of identity mappings and others shallow model layers, but in practice this didn't help. Later on Deep Residual Learning took the charge.

ResNet Theory / Background

ResNet Architecture

Deep Residual Learning

Deep Residual Learning is useful in the ResNet sense because it allows very deep networks to be trained by reformulating the layers as learning residual functions with reference to the layer inputs, rather than directly learning unreferenced functions. This helps address the vanishing gradient problem and enables effective training of much deeper architectures

There was a huge issue of exploding / vanishing gradients, along with overall degredation of accuracy in these models, that were solved by utilizing normalization layers and residuals - ResNets usage of residual skip conncetions specifically addresses the degredation problems

The degredation problem ultimately motivated the usage of residual learning

  • In theory, if you add more layers to a NN the network should be able to at least match the performance of a shallower network, simply by learning the identity function
    • i.e. the extra layers don't actually do anything
  • In practice, deep networks often perform worse (measured by a higher training error) as more layers were added - this is the degredation problem
  • The reason for this problem is that it's surprisingly hard for standard (non-residual) deep networks to learn the identity mapping using many non-linear layers. Ultimately the optimization process just can't find this solution
  • Residual learning re-formulates the problem
    • Instead of learning H(x)H(x), some full mapping
    • It aims to learn the residual F(x)=H(x)xF(x) = H(x) - x
    • So H(x)=F(x)+xH(x) = F(x) + x
  • Why?
    • If the best thing for the extra layers to do is nothing (identity function of shallow network), the residual F(x)F(x) can just go to zero, which is easier for an optimizer to find
    • If the optimal function is not an identity function, it's often closer to identity thatn zero, so learning a small "perturbation" from identity is easier for the optimization to learn instead of doing it from scratch
      • If the optimal function starts to converge to a non-zero solution, then the middle mapping layers can be considered worthwhile, as they add some sort of new information
  • Altogether, Residual connections make it easier for deep networks to learn functions that are closer to identity mappings, which help to avoid the degredation problem and make optimization easier

This entire thought process created the problem statement for residual learning frameworks / layers inside deep NN's.

Residual Learning

The idea of residual learning is to replace the approximation of an underlying latent mapping H(x)H(x), which is approximated by a few stacked layers, with an approximation of residual functions F(x):=H(x)xF(x) := H(x) - x where xx denotes the inputs to the first of these few stacked layers - therefore H(x)F(x)+x H(x) \approx F(x) + x

Below, the F(x,Wi)F(x, {W_i}) is the residual mapping that is to be learned, an example is F=W2σ(W1x)F = {W_2} \sigma ({W_1} x) in which σ\sigma denotes the ReLU activation function - most experiments show that ID mapping is enough to solve the degradation problem Identity Mapping Residual Arch

EfficientNet

Before EfficientNet it was popular to scale only one of three dimensions - depth, width, or image size. Research papers and empirical studies, which ultimately led to EfficientNet, showed it's critical to balance all dimensions which can be achieved by scaling all 3 with a consistent ratio.

Compound Model Scaling

A function Yi=Fi(Xi)Y_i = F_i(X_i) with operator FiF_i, output tensor YiY_i, and input tensor XiX_i of shape (Hi,Wi,Ci)(H_i, W_i, C_i) spatial dimensions (Hi,Wi)(H_i, W_i) and channel dimension CiC_i is called a ConvNet Layer i

A ConvNet appears as a list of these composing layers N=Fk...F2F1(X1)N = F_k \odot ... \odot F_2 \odot F_1(X_1)

Effectively, these layers are often split / partitioned into multiple stages and all layers in each stage share the same architecture - an example is ResNet which has 5 stages (k=5k = 5), with all layers in each stage being the same convolutional type except the first layer which performs down-sampling.

Scaling all 3 is important as they'r all fairly linked - you cannot increase the resolution of an image without increasing it's depth and saturation (idk what this means). Therefore, a compound scaling method which uniformly scales network width, depth, and resolution is required

Contrastive Learning

We've looked into Contrastive Learning in another sub-document, and will copy this section over there, but this is a section specifically on Image based Contrastive Learning

SlimCLR was one of the first, and most known, contrastive learning frameworks - it's simple, highly accurate, well researched, and heavily utilized. The main idea is to have two copies of a single image, and use these to train two networks that are compared. A major con is that it doubls the overall storage size of the underlying dataset, but BLOB storage is cheap (in my opinion). Boostrap Your Own Latent was introduced to avoid making the double sized dataset.

Contrastive Learning Framework

Contrastive loss is used to learn a representation by maximizing the agreement between various augmented views of the same data example. To achieve this, there are 4 significant components:

  • A stochastic data augmentation module to create new augmentations of input
  • A neural network base encoder to take inputs, and augmentations, and will encode into dense vector
  • A small neural network projection head to take encoded vectors into projection space
  • A contrastive loss function that allows comparisons between projected vectors

Contrastive Arch

Stochastic Data Augmentation

A minibatch of NN examples is sampled randomly, and thee contrastive prediction task is defined on pairs of augmented examples - this results in 2N2N data points altogether

A memory bank isn't needed, as the training batch size varies from 256 to 8,192. Any given data example randomly returns two correlated views of the same example, denoted as xiˉ\bar{x_i} and xjˉ\bar{x_j} which is known as the positive pair. The negative pair are all other 2(N1)2(N-1) pairs. It's been shown that choosing different data augmentation techniques can reduce the complexity of previous contrastive learning frameworks. Some of the common ones are:

  • Spatial geometric transformations like cropping, resizing, roration, and cutouts
  • Appearance transformations like color distortion, brightness, contrast, saturation, Gaussian blur, or Sorbel filtering

Models tend to improve after composing augmentaitons together too, instead of only applying one single one

Neural Network Base Encoder

The NN Base Encoder f()f(\cdot) extracts multiple representation vectors from the augmented data examples - the commonly used ResNet was picked and gives hi=f(xiˉ)=ResNet(xiˉ)h_i = f(\bar{x_i}) = \text{ResNet}(\bar{x_i}) where hid\bold{h_i} \in \real^d is the output after the average pooling layer.

Small Neural Network Projection Head

A small neural network projection head g()g(\cdot) maps the representation to the space where the contrastive loss is applied. The importance of this layer was evaluated with:

  • Identity mapping
  • Linear projection
  • Default non-linear projection with an additional hidden layer and ReLU activation

The results showed the non-linear projection is better than linear, and both are much better than no transformation (identity)

They've used an MLP with one hidden layer to obtain zi=g(hi)=W2σ(W1(hi))z_i = g(\bold{h_i}) = W^2 \sigma(W^1(\bold{h_i})) where σ\sigma is a ReLU non-linear transformation

This is useful because defining the contrastive loss on ziz_i instead of hi\bold{h_i} wouldn't lead to a loss of information caused by contrastive loss, and is shown to maintain and form more information

Contrastive Loss Function

Given a set {xˉik}\{{\bar{x}_{ik}}\} including a positive pair of examples xiˉ\bar{x_i} and xjˉ\bar{x_j}, the contrastive prediction task aims to idntify xiˉ\bar{x_i} in {xˉi}ki\{{\bar{x}_{i}}\}_{k \neq i} for a given xiˉ\bar{x_i}. In the case of positive esxamples, the loss function is defined as

i,j=logexp(sim(zi,zj)τ)k=12NI[ki]exp(sim(zi,zk)τ)\ell_{i,j} = -\log \frac{\exp\left(\frac{\mathrm{sim}(z_i, z_j)}{\tau}\right)}{\sum_{k=1}^{2N} \mathbb{I}[k \neq i] \exp\left(\frac{\mathrm{sim}(z_i, z_k)}{\tau}\right)}

Where:

  • (i,j)(\ell_{i,j}) is the loss for the pair (i,j)(i, j)
  • sim(zi,zj)\mathrm{sim}(z_i, z_j) is the similarity between ziz_i and zjz_j
    • Typically sim(u,v)=uvu,v\mathrm{sim}(u, v) = \frac{u^\top v}{|u| , |v|} is a dot product between l2l_2 and normalized u,v\bold{u}, \bold{v}
  • τ\tau is the temperature parameter
  • I[ki]\mathbb{I}[k \neq i] is the indicator function (1 if (ki)(k \neq i), 0 otherwise)

The final loss is calculated across all positive pairs, both (i,j)(i, j) and (j,i)(j, i) in a mini-batch

This above was named NT-Xent as Normalized Temperature-scaled Cross Entropy. This was compared against other commonly used contrastive loss functions like logistic loss and margin loss, and NT-Xent outperformed with proper hyperparameter tuning

SlimCLR Framework

The ultimate goal of this framework was to describe a better approach to learning visual representations without human supervision.

SlimCLR outperforms previous work, is more straightforward, and does not require a memory banks

Significant components of the framework:

  • A constrastive prediction task requires combining multiple data augmentation operations, which results in effective representations
    • Unsupervised contrastive learning benefits from more significant data augmentation
    • In english, this means applying lots of different random changes (like cropping, flipping, rotating, color changes, etc) to images. The model is trained to recognize that these different augmentations are "the same"
  • The quality of learned representations can be substantially improved by introducing a learnable non-linear transformation between the representation and contrastive loss
    • Basically this means you encourage the model to make the representations (feature vectors) of different augmented views of the same image similar, while making representations of different images dissimilar
    • Contrastive loss will penalize the model is the two feature vectors of the same augmented image are far apart, and rewards them if they're similar
      • Common contrastive loss example is NT-Xent (Normalized Temperature-scaled Cross Entropy) loss
  • Representation learning with cross-entropy loss can be improved by normalizing embeddings and adjusting the temperature parameter appropriately
    • Temperature is a parameter in the contrastive loss function that controls how sharply the model distinguishes between similar and dissimilar pairs
      • A lower temperature makes the model focus more on making positive pairs very close, and negative very far apart
      • A higher temperature smooths out the differences, making the model less strict about separating pairs
    • Therefore, this equates to saying that adjusting the temperature to balance how hard the model pushes similar images together can improve the quality of the learned representations
  • Contrastive learning benefits from larger batch size and extended training periods compared to supervised counterpart
    • Larger batch size helps because it allows the model to compare more positives and negatives for each sample
    • Each batch is used to create positive and negative pairs, so the more examples inside of it the more comparisons!