Architecture of Midjourney

Mid-Journey is a style transfer model based on an image optimization technique and a deep neural network architecture. Style transfer is a computer vision technique that takes two images—a content image and a style reference image—and blends them together so that the resulting output image retains the core elements of the content image, but appears to be “painted” in the style of the style reference image.

Style transfer is a computer vision technique that allows us to recompose the content of an image in the style of another. If you've ever imagined what a photo might look like if it were painted by a famous artist, then style transfer is the computer vision technique that turns this into a reality.

Basically, in Neural Style Transfer we have two images- style and content. We need to copy the style from the style image and apply it to the content image. By, style we basically mean, the patterns, the brushstrokes, etc.

So intuitively, how does this work? To understand this, Here is a basic intro to how ConvNets work.

ConvNets work on the basic principle of convolution. Say, for example we have an image and a filter. We slide the filter over the image and take as output, the weighted sum of the inputs covered by the filter, transformed by a nonlinearity such as sigmoid or ReLU or tanh. Every filter has it’s own set of weights, which do not change during the convolution operation. This is depicted in the below images below;

Here, the blue grid is the input. You can see the 3x3 region covered by the filter slide across the input (dark blue region). The result of this convolution is called a feature map, which is depicted by the green colored grid.

Here are graphs of ReLU and Tanh activation functions;

ReLU activation function

Tanh activation function

So, in a ConvNet, the input image is convolved with several filters and filter maps are generated. These filter maps are then convolved with some more filters and some more feature maps are generated. This is illustrated through the diagram below;

You can also see the term maxpool, in the above image. A maxpoollayer is mainly used for the purpose of dimensionality reduction. In a maxpool operation, we simply slide a window, say of size 2x2, across the image and take as output, the maximum of values covered by the window. Here’s an example below;

Examine the image below carefully and examine the feature maps at different layers;

You can see that the maps in the lower layers look for low level features such as lines or blobs (gabor filters).

As we go to the higher layers, out features become increasingly complex. Intuitively we can think of it this way- the lower layers capture low level features such as lines and blobs, the layer above that builds up on these low level features and calculates slightly more complex features, and so on…

Thus, we can conclude that ConvNets develop a hierarchical representation of features. This property is the basis of style transfer.

While doing style transfer, we are not training a neural network. Rather, what we're doing is — we start from a blank image composed of random pixel values, and we optimize a cost function by changing the pixel values of the image. In simple terms, we start with a blank canvas and a cost function. Then we iteratively modify each pixel so as to minimize our cost function. To put it in another way, while training neural networks we update our weights and biases, but in style transfer, we keep the weights and biases constant, and instead, update our image.

Last updated