Home

Awesome

ArtCNN

Overview

ArtCNN is a collection of SISR CNNs optimised for anime content.

Two distinct architectures are currently offered:

The R architecture is offered in 2 sizes:

The C architecture is also offered in 2 sizes:

Regarding the suffixes:

You may occasionaly find some models under the "Experiments" directories. This is meant to serve as a testing grounds.

mpv Instructions

Add something like this to your mpv config:

vo=gpu-next
glsl-shader="path/to/shader/ArtCNN_C4F16_DS_CMP.glsl"

VapourSynth Instructions

ArtCNN is natively supported by vs-mlrt. Please follow the instructions found there.

Alternatively, can also run the GLSL shaders with vs-placebo.

Examples

ArtCNN Example

FAQ

Why these architectures?

The original C architecture is the result of the research I've conducted during my MSc. Starting from EDSR, I miniaturised and thoroughly simplified the model to maximise quality within a very constrained performance budget. The subsequent R architecture was born from various experiments trying to scale ArtCNN up while still attempting to balance quality and performance.

My goal is to keep ArtCNN as vanilla as possible, so don't expect bleeding-edge ideas to be adopted until they've stood the test of time.

Why ReLU activations?

Using ReLU instead of fancier options like the GELU or the SiLU is a deliberate choice. The quality gains from switching to different activations are difficult to justify when you take into account the performance penalty. ArtCNN is aimed at video playback and encoding, where speed matters. If you want to understand why the ReLU is faster, feel free to check ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models.

Why the L1 loss?

The L1 loss (mean absolute error) is still the standard loss used to train state-of-the-art distortion-based SISR models. Researchers have attempted to design more sophisticated losses to better match human perception of quality, but ultimately the best results are often obtained by combining the L1 loss with something else, just to help it get out of local minima. This is very well detailed in Loss Functions for Image Restoration with Neural Networks.

I've been playing around with structural-similarity and frequency-domain losses, but the results so far have not been conclusive.

Why AdamW?

AdamW's weight decay seems to help with generalisation. Models trained with Adam can often reduce the loss further, but test scores don't reflect that as an improvement. AdamW is also quickly becoming the new standard on recent papers.

Why this residual block design?

This was an entirely empirical choice as well. The usual Conv->ReLU->Conv->Add residual block from EDSR ended up slightly worse than Conv->ReLU->Conv->ReLU->Conv->Add even when employed on slightly larger models with a learning capacity advantage. I've experimented with deeper residual blocks, but they did not yield consistent improvements. SPAN has a similar residual block configuration if we exclude the attention mechanism, and the authors of NFNet found it to be an improvement as well.

I've also experimented with bottlenecked residual blocks and inverted residuals. However, the 1x1 convolution layers used to reduce or expand channel dimensions introduce additional sequential dependencies, slowing down the model even when the total parameter count remains similar. ArtCNN is probably just too small for this to be useful.

Why depth to space?

The depth to space operation is generally better than using transposed convolutions and it's the standard on SISR models in general. ArtCNN is also designed to have all of its convolution layers operating with LR feature maps, only upsampling them as the very last step. This is mostly done for speed, but it also provides great memory footprint benefits.

Why no channel attention?

The global average pooling layer required for channel attention is very slow. I'm not particularly against attention mechanisms though, and alternatives can be considered in the future if they show promising results. The last time I experimented with this, spatial attention was not only better but also faster.

Why no vision transformers?

CNNs have a stronger inductive bias to help solve image processing tasks, this means you don't need as much data or as big of a model to get good results.

Papers like A ConvNet for the 2020s and ConvNets Match Vision Transformers at Scale have also showed that CNNs are still competitive with transformers even at the scales in which they were designed to excel. This is likely also true against newer architectures like Mamba, see: The “it” in AI models is the dataset.

As an electrical engineer I also simply find CNNs more elegant.

Why Keras?

I'm just familiar with Keras. I've tried migrating to PyTorch a few times, but there was always something annoying enough about it for me to scrap the idea. As someone who really likes Numpy, naturally I also like JAX, and if I were to migrate away from Keras now I'd probably just go straight to Flax.