Home

Awesome

<h1 align="center">Mish: Self Regularized <br> Non-Monotonic Activation Function</h1> <p align="center"> <a href="LICENSE" alt="License"> <img src="https://img.shields.io/badge/License-MIT-brightgreen.svg" /></a> <a href="https://arxiv.org/abs/1908.08681v3" alt="ArXiv"> <img src="https://img.shields.io/badge/Paper-arXiv-blue.svg" /></a> <a href="https://scholar.googleusercontent.com/scholar.bib?q=info:j0C1gbodjP4J:scholar.google.com/&output=citation&scisdr=CgX0hbDMEOzUo74J6TM:AAGBfm0AAAAAX1QM8TNcu4tND6FEofKsXzM3cs1uCAAW&scisig=AAGBfm0AAAAAX1QM8Y5elaJ1IW-BKOuU1zFTYNp-QaNQ&scisf=4&ct=citation&cd=-1&hl=en" alt="Cite"> <img src="https://img.shields.io/badge/Cite-BibTex-blue.svg" /></a> <a href=" " alt="Citations"> <img src="https://img.shields.io/badge/Google Scholar-2111-lightgrey.svg" /></a> <a href="https://www.bmvc2020-conference.com/conference/papers/paper_0928.html" alt="Publication"> <img src="https://img.shields.io/badge/BMVC-2020-red.svg" /></a> <a href="https://console.paperspace.com/github/digantamisra98/Mish/blob/master/Layers_Acc.ipynb"> <img src="https://assets.paperspace.io/img/gradient-badge.svg" alt="Run on Gradient"/> </a> </p> <p align="center">BMVC 2020 <a href="https://www.bmvc2020-conference.com/assets/papers/0928.pdf" target="_blank">(Official Paper)</a></p> <br> <br> <details> <summary>Notes: (Click to expand)</summary> </details> <details> <summary>Changelogs/ Updates: (Click to expand)</summary> </details>

News/ Media Coverage:

<p float="center"> &emsp; &emsp; <a href="https://podcasts.apple.com/hu/podcast/mish-activation-function-with-diganta-misra-007/id1490681799?i=1000464407163" alt="Apple Podcasts"> <img src="podcast_logo/applepodcasts.png" width="150"/></a> <a href="https://open.spotify.com/episode/4sT9sxjSbAKtvJ6hTFg9zc" alt="Spotify"> <img src="https://github.com/digantamisra98/Mish/blob/master/podcast_logo/spotify.png" width="150"/></a> </p> <p float="center"> &emsp; &emsp; <a href="https://youtu.be/T2CRFROKcLM" alt="YouTube"> <img src="podcast_logo/yt1.png" width="100"/></a> </p> <p float="center"> &emsp; &emsp; <a href="https://www.youtube.com/watch?v=XRGu23hfzaQ" alt="YouTube"> <img src="podcast_logo/yt1.png" width="100"/></a> </p> <p float="center"> &emsp; &emsp; <a href="https://youtu.be/whOdg-yrgdI" alt="YouTube"> <img src="podcast_logo/yt1.png" width="100"/></a> </p> <p float="center"> &emsp; &emsp; <a href="https://www.youtube.com/watch?v=1U-7TWysqIg" alt="YouTube"> <img src="podcast_logo/yt1.png" width="100"/></a> <br> </p> <br> <details> <summary><a href="https://dlrl.ca/"><b>MILA/ CIFAR 2020 DLRLSS</b></a> (Click on arrow to view)</summary> <div style="text-align:center"><img src ="poster_landscape-1.png" width="1000"/></div> </details> <br> <details> <summary><b>Contents</b>: (Click to expand)</summary>
  1. Mish <br> a. Loss landscape
  2. ImageNet Scores
  3. MS-COCO
  4. Variation of Parameter Comparison<br> a. MNIST<br> b. CIFAR10<br>
  5. Significance Level <br>
  6. Results<br> a. Summary of Results (Vision Tasks)<br> b. Summary of Results (Language Tasks)<br>
  7. Try It!<br>
  8. Acknowledgements
  9. Cite this work
</details> <br>

Mish:

<p align="left"> <img width="500" src="Observations/Mish3.png"> </p> <p align="left"> <img src="https://latex.codecogs.com/gif.latex?f(x)&space;=&space;x\tanh&space;(softplus(x))&space;=&space;x\tanh(\ln&space;(1&space;&plus;&space;e^{x}))" title="f(x) = x\tanh (softplus(x)) = x\tanh(\ln (1 + e^{x}))" /></p>

Minimum of f(x) is observed to be ≈-0.30884 at x≈-1.1924<br> Mish has a parametric order of continuity of: C<sup></sup>

Derivative of Mish with respect to Swish and Δ(x) preconditioning:

<p align="left"> <img width="1000" src="Observations/Derivatives.png"> </p> <p align="left"> <img src="https://latex.codecogs.com/gif.latex?f'(x)&space;=&space;(sech^{2}(softplus(x)))(xsigmoid(x))&space;&plus;&space;\frac{f(x)}{x}" title="f'(x) = (sech^{2}(softplus(x)))(xsigmoid(x)) + \frac{f(x)}{x}" /></p>

Further simplifying:

<a href="https://www.codecogs.com/eqnedit.php?latex=f'(x)&space;=&space;\Delta(x)swish(x)&space;&plus;&space;\frac{f(x)}{x}" target="_blank"><img src="https://latex.codecogs.com/svg.latex?f'(x)&space;=&space;\Delta(x)swish(x)&space;&plus;&space;\frac{f(x)}{x}" title="f'(x) = \Delta(x)swish(x) + \frac{f(x)}{x}" /></a>

Alternative derivative form:

<a href="https://www.codecogs.com/eqnedit.php?latex=f'(x)&space;=&space;\frac{e^{x}\omega}{\delta^{2}}" target="_blank"><img src="https://latex.codecogs.com/svg.latex?f'(x)&space;=&space;\frac{e^{x}\omega}{\delta^{2}}" title="f'(x) = \frac{e^{x}\omega}{\delta^{2}}" /></a>

where:

<a href="https://www.codecogs.com/eqnedit.php?latex=\omega&space;=&space;4(x&plus;1)&plus;4e^{2x}&space;&plus;e^{3x}&space;&plus;e^{x}(4x&plus;6)" target="_blank"><img src="https://latex.codecogs.com/svg.latex?\omega&space;=&space;4(x&plus;1)&plus;4e^{2x}&space;&plus;e^{3x}&space;&plus;e^{x}(4x&plus;6)" title="\omega = 4(x+1)+4e^{2x} +e^{3x} +e^{x}(4x+6)" /></a>

<a href="https://www.codecogs.com/eqnedit.php?latex=\delta&space;=&space;2e^{x}&space;&plus;e^{2x}&space;&plus;2" target="_blank"><img src="https://latex.codecogs.com/svg.latex?\delta&space;=&space;2e^{x}&space;&plus;e^{2x}&space;&plus;2" title="\delta = 2e^{x} +e^{2x} +2" /></a>

We hypothesize the Δ(x) to be exhibiting the properties of a pre-conditioner making the gradient more smoother. Further details are provided in the paper.

Loss Landscape:

<div style="text-align:center"><img src ="llmish.gif" width="500" height="300"/></div>

To visit the interactive Loss Landscape visualizer, click here.

Loss landscape visualizations for a ResNet-20 for CIFAR 10 using ReLU, Mish and Swish (from L-R) for 200 epochs training:

<div style="text-align:center"><img src ="landscapes/d8v1.png" width="1000"/></div> <br>

Mish provides much better accuracy, overall lower loss, smoother and well conditioned easy-to-optimize loss landscape as compared to both Swish and ReLU. For all loss landscape visualizations please visit this readme.

We also investigate the output landscape of randomly initialized neural networks as shown below. Mish has a much smoother profile than ReLU.

<div style="text-align:center"><img src ="landscapes/landscape-1.png" width="1000"/></div>

ImageNet Scores:

PWC

For Installing DarkNet framework, please refer to darknet(Alexey AB)

For PyTorch based ImageNet scores, please refer to this readme

NetworkActivationTop-1 AccuracyTop-5 AccuracycfgWeightsHardware
ResNet-50Mish74.244%92.406%cfgweightsAWS p3.16x large, 8 Tesla V100
DarkNet-53Mish77.01%93.75%cfgweightsAWS p3.16x large, 8 Tesla V100
DenseNet-201Mish76.584%93.47%cfgweightsAWS p3.16x large, 8 Tesla V100
ResNext-50Mish77.182%93.318%cfgweightsAWS p3.16x large, 8 Tesla V100
NetworkActivationTop-1 AccuracyTop-5 Accuracy
CSPResNet-50Leaky ReLU77.1%94.1%
CSPResNet-50Mish78.1%94.2%
Pelee NetLeaky ReLU70.7%90%
Pelee NetMish71.4%90.4%
Pelee NetSwish71.5%90.7%
CSPPelee NetLeaky ReLU70.9%90.2%
CSPPelee NetMish71.2%90.3%

Results on CSPResNext-50:

MixUpCutMixMosaicBlurLabel SmoothingLeaky ReLUSwishMishTop -1 AccuracyTop-5 Accuracycfgweights
:heavy_check_mark:77.9%(=)94%(=)
:heavy_check_mark::heavy_check_mark:77.2%(-)94%(=)
:heavy_check_mark::heavy_check_mark:78%(+)94.3%(+)
:heavy_check_mark::heavy_check_mark:78.1%(+)94.5%(+)
:heavy_check_mark::heavy_check_mark:77.5%(-)93.8%(-)
:heavy_check_mark::heavy_check_mark:78.1%(+)94.4%(+)
:heavy_check_mark:64.5%(-)86%(-)
:heavy_check_mark:78.9%(+)94.5%(+)
:heavy_check_mark::heavy_check_mark::heavy_check_mark::heavy_check_mark:78.5%(+)94.8%(+)
:heavy_check_mark::heavy_check_mark::heavy_check_mark::heavy_check_mark:79.8%(+)95.2%(+)cfgweights

Results on CSPResNet-50:

CutMixMosaicLabel SmoothingLeaky ReLUMishTop -1 AccuracyTop-5 Accuracycfgweights
:heavy_check_mark:76.6%(=)93.3%(=)
:heavy_check_mark::heavy_check_mark::heavy_check_mark::heavy_check_mark:77.1%(+)94.1%(+)
:heavy_check_mark::heavy_check_mark::heavy_check_mark::heavy_check_mark:78.1%(+)94.2%(+)cfgweights

Results on CSPDarkNet-53:

CutMixMosaicLabel SmoothingLeaky ReLUMishTop -1 AccuracyTop-5 Accuracycfgweights
:heavy_check_mark:77.2%(=)93.6%(=)
:heavy_check_mark::heavy_check_mark::heavy_check_mark::heavy_check_mark:77.8%(+)94.4%(+)
:heavy_check_mark::heavy_check_mark::heavy_check_mark::heavy_check_mark:78.7%(+)94.8%(+)cfgweights

Results on SpineNet-49:

CutMixMosaicLabel SmoothingReLUSwishMishTop -1 AccuracyTop-5 Accuracycfgweights
:heavy_check_mark:77%(=)93.3%(=)--
:heavy_check_mark::heavy_check_mark:78.1%(+)94%(+)--
:heavy_check_mark::heavy_check_mark::heavy_check_mark::heavy_check_mark:78.3%(+)94.6%(+)--

MS-COCO:

PWC PWC

For PyTorch based MS-COCO scores, please refer to this readme

ModelMishAP50...95mAP50CPU - 90 Watt - FP32 (Intel Core i7-6700K, 4GHz, 8 logical cores) OpenCV-DLIE, FPSVPU-2 Watt- FP16 (Intel MyriadX) OpenCV-DLIE, FPSGPU-175 Watt- FP32/16 (Nvidia GeForce RTX 2070) DarkNet-cuDNN, FPS
CSPDarkNet-53 (512 x 512)42.4%64.5%3.51.2343
CSPDarkNet-53 (512 x 512):heavy_check_mark:43%64.9%--41
CSPDarkNet-53 (608 x 608):heavy_check_mark:43.5%65.7%--26
ArchitectureMishCutMixMosaicLabel SmoothingSizeAPAP50AP75
CSPResNext50-PANet-SPP512 x 51242.4%64.4%45.9%
CSPResNext50-PANet-SPP:heavy_check_mark::heavy_check_mark::heavy_check_mark:512 x 51242.3%64.3%45.7%
CSPResNext50-PANet-SPP:heavy_check_mark::heavy_check_mark::heavy_check_mark::heavy_check_mark:512 x 51242.3%64.2%45.8%
CSPDarkNet53-PANet-SPP:heavy_check_mark::heavy_check_mark::heavy_check_mark:512 x 51242.4%64.5%46%
CSPDarkNet53-PANet-SPP:heavy_check_mark::heavy_check_mark::heavy_check_mark::heavy_check_mark:512 x 51243%64.9%46.5%

Credits to AlexeyAB, Wong Kin-Yiu and Glenn Jocher for all the help with benchmarking MS-COCO and ImageNet.

Variation of Parameter Comparison:

MNIST:

To observe how increasing the number of layers in a network while maintaining other parameters constant affect the test accuracy, fully connected networks of varying depths on MNIST, with each layer having 500 neurons were trained. Residual Connections were not used because they enable the training of arbitrarily deep networks. BatchNorm was used to lessen the dependence on initialization along with a dropout of 25%. The network is optimized using SGD on a batch size of 128, and for fair comparison, the same learning rates for each activation function was maintained. In the experiments, all 3 activations maintained nearly the same test accuracy for 15 layered Network. Increasing number of layers from 15 gradually resulted in a sharp decrease in test accuracy for Swish and ReLU, however, Mish outperformed them both in large networks where optimization becomes difficult.

The consistency of Mish providing better test top-1 accuracy as compared to Swish and ReLU was also observed by increasing Batch Size for a ResNet v2-20 on CIFAR-10 for 50 epochs while keeping all other network parameters to be constant for fair comparison.

<p float="left"> <img src="Observations/layersacc.png" width="400"/> <img src="Observations/batchacc.png" width="400"/> </p>

Gaussian Noise with varying standard deviation was added to the input in case of MNIST classification using a simple conv net to observe the trend in decreasing test top-1 accuracy for Mish and compare it to that of ReLU and Swish. Mish mostly maintained a consistent lead over that of Swish and ReLU (Less than ReLU in just 1 instance and less than Swish in 3 instance) as shown below. The trend for test loss was also observed following the same procedure. (Mish has better loss than both Swish and ReLU except in 1 instance)

<p float="left"> <img src="Observations/noise.png" width="400"/> <img src="Observations/noise1.png" width="400"/> </p>

CIFAR10:

<p float="left"> <img src="Observations/initc10.png" width="400"/> <img src="Observations/densec10.png" width="400"/> </p>

Significance Level:

The P-values were computed for different activation functions in comparison to that of Mish on terms of Top-1 Testing Accuracy of a Squeeze Net Model on CIFAR-10 for 50 epochs for 23 runs using Adam Optimizer at a Learning Rate of 0.001 and Batch Size of 128. It was observed that Mish beats most of the activation functions at a high significance level in the 23 runs, specifically it beats ReLU at a high significance of P < 0.0001. Mish also had a comparatively lower standard deviation across 23 runs which proves the consistency of performance for Mish.

Activation FunctionMean AccuracyMean LossStandard Deviation of AccuracyP-valueCohen's d Score95% CI
Mish87.48%4.13%0.3967---
Swish-187.32%4.22%0.414P = 0.19730.386-0.3975 to 0.0844
E-Swish (β=1.75)87.49%4.156%0.411P = 0.90750.034444-0.2261 to 0.2539
GELU87.37%4.339%0.472P = 0.40030.250468-0.3682 to 0.1499
ReLU86.66%4.398%0.584P < 0.00011.645536-1.1179 to -0.5247
ELU(α=1.0)86.41%4.211%0.3371P < 0.00012.918232-1.2931 to -0.8556
Leaky ReLU(α=0.3)86.85%4.112%0.4569P < 0.00011.47632-0.8860 to -0.3774
RReLU86.87%4.138%0.4478P < 0.00011.444091-0.8623 to -0.3595
SELU83.91%4.831%0.5995P < 0.00017.020812-3.8713 to -3.2670
SoftPlus(β = 1)83.004%5.546%1.4015P < 0.00014.345453-4.7778 to -4.1735
HardShrink(λ = 0.5)75.03%7.231%0.98345P < 0.000116.601747-12.8948 to -12.0035
Hardtanh82.78%5.209%0.4491P < 0.000111.093842-4.9522 to -4.4486
LogSigmoid81.98%5.705%1.6751P < 0.00014.517156-6.2221 to -4.7753
PReLU85.66%5.101%2.2406P = 0.00041.128135-2.7715 to -0.8590
ReLU686.75%4.355%0.4501P < 0.00011.711482-0.9782 to -0.4740
CELU(α=1.0)86.23%4.243%0.50941P < 0.00012.741669-1.5231 to -0.9804
Sigmoid74.82%8.127%5.7662P < 0.00013.098289-15.0915 to -10.2337
Softshrink(λ = 0.5)82.35%5.4915%0.71959P < 0.00018.830541-5.4762 to -4.7856
Tanhshrink82.35%5.446%0.94508P < 0.00017.083564-5.5646 to -4.7032
Tanh83.15%5.161%0.6887P < 0.00017.700198-4.6618 to -3.9938
Softsign82.66%5.258%0.6697P < 0.00018.761157-5.1493 to -4.4951
Aria-2(β = 1, α=1.5)81.31%6.0021%2.35475P < 0.00013.655362-7.1757 to -5.1687
Bent's Identity85.03%4.531%0.60404P < 0.00014.80211-2.7576 to -2.1502
SQNL83.44%5.015%0.46819P < 0.00019.317237-4.3009 to -3.7852
ELisH87.38%4.288%0.47731P = 0.42830.235784-0.3643 to 0.1573
Hard ELisH85.89%4.431%0.62245P < 0.00013.048849-1.9015 to -1.2811
SReLU85.05%4.541%0.5826P < 0.00014.883831-2.7306 to -2.1381
ISRU (α=1.0)86.85%4.669%0.1106P < 0.00015.302987-4.4855 to -3.5815
Flatten T-Swish86.93%4.459%0.40047P < 0.00011.378742-0.7865 to -0.3127
SineReLU (ε = 0.001)86.48%4.396%0.88062P < 0.00011.461675-1.4041 to -0.5924
Weighted Tanh (Weight = 1.7145)80.66%5.985%1.19868P < 0.00017.638298-7.3502 to -6.2890
LeCun's Tanh82.72%5.322%0.58256P < 0.00019.551812-5.0566 to -4.4642
Soft Clipping (α=0.5)55.21%18.518%10.831994P < 0.00014.210373-36.8255 to -27.7154
ISRLU (α=1.0)86.69%4.231%0.5788P < 0.00011.572874-1.0753 to -0.4856

Values rounded up which might cause slight deviation in the statistical values reproduced from these tests

Results:

PWC PWC

News: Ajay Arasanipalai recently submitted benchmark for CIFAR-10 training for the Stanford DAWN Benchmark using a Custom ResNet-9 + Mish which achieved 94.05% accuracy in just 10.7 seconds in 14 epochs on the HAL Computing Cluster. This is the current fastest training of CIFAR-10 in 4 GPUs and 2nd fastest training of CIFAR-10 overall in the world.

Summary of Results (Vision Tasks):

Comparison is done based on the high priority metric, for image classification the Top-1 Accuracy while for Generative Networks and Image Segmentation the Loss Metric. Therefore, for the latter, Mish > Baseline is indicative of better loss and vice versa. For Embeddings, the AUC metric is considered.

Activation FunctionMish > Baseline ModelMish < Baseline Model
ReLU5520
Swish-15322
SELU261
Sigmoid240
TanH240
HardShrink(λ = 0.5)230
Tanhshrink230
PReLU(Default Parameters)232
Softsign221
Softshrink (λ = 0.5)221
Hardtanh212
ELU(α=1.0)217
LogSigmoid204
GELU193
E-Swish (β=1.75)197
CELU(α=1.0)185
SoftPlus(β = 1)177
Leaky ReLU(α=0.3)178
Aria-2(β = 1, α=1.5)162
ReLU6168
SQNL131
Weighted TanH (Weight = 1.7145)121
RReLU1211
ISRU (α=1.0)111
Le Cun's TanH102
Bent's Identity105
Hard ELisH91
Flatten T-Swish93
Soft Clipping (α=0.5)93
SineReLU (ε = 0.001)94
ISRLU (α=1.0)94
ELisH73
SReLU76
Hard Sigmoid10
Thresholded ReLU(θ=1.0)10

Summary of Results (Language Tasks):

Comparison is done based on the best metric score (Test accuracy) across 3 runs.

Activation FunctionMish > Baseline ModelMish < Baseline Model
Penalized TanH50
ELU50
Sigmoid50
SReLU40
TanH41
Swish32
ReLU23
Leaky ReLU23
GELU12

Try It!

TorchDarkNetJuliaFastAITensorFlowKerasCUDA
SourceSourceSourceSourceSourceSourceSource
<details> <summary><b>Acknowledgments:</b> (Click to expand)</summary>

Thanks to all the people who have helped and supported me massively through this project who include:

  1. Sparsha Mishra
  2. Alexandra Deis
  3. Alexey Bochkovskiy
  4. Chien-Yao Wang
  5. Thomas Brandon
  6. Less Wright
  7. Manjunath Bhat
  8. Ajay Uppili Arasanipalai
  9. Federico Lois
  10. Javier Ideami
  11. Ioannis Anifantakis
  12. George Christopoulos
  13. Miklos Toth

And many more including the Fast AI community, Weights and Biases Community, TensorFlow Addons team, SpaCy/Thinc team, Sicara team, Udacity scholarships team to name a few. Apologies if I missed out anyone.

</details>

Cite this work:

@article{misra2019mish,
  title={Mish: A self regularized non-monotonic neural activation function},
  author={Misra, Diganta},
  journal={arXiv preprint arXiv:1908.08681},
  year={2019}
}