【論文メモ】Swish: a Self-Gated Activation Function

論文名

Prajit Ramachandran, Barret Zoph, Quoc V. Le: Swish: a Self-Gated Activation Function, arXiv:1710.05941 [cs.NE], 20171016.
[1710.05941] Searching for Activation Functions

どんなもの？

現在主流のReLUに替わる、 ${f(x)=x\sigma(x)}$ という簡易な活性化関数を提案
特に深層のニューラルネットワークに関して、複数の難しいDatasetで、ReLUを上回る性能を示した

upura.hatenablog.com
upura.hatenablog.com

先行研究と比べてどこがすごい？

現状ReLUが標準になっている
ReLUに取って代わろうとする活性化関数も多く提案されているが、複数のDatasetで完全にReLUを上回ることはできていない
Swishは画像分類・機械翻訳など複数のタスクに関して、深層ニューラルネットワークモデルで、ReLUを上回る性能を示した

技術や手法のキモはどこ？

unbounded above
- Unboundedness is desirable because it avoids saturation
bounded below
- Functions that approach zero in the limit induce even larger regularization effects because large negative inputs are “forgotten”
non-monotonic
- it produces negative outputs for small negative inputs
- The non-monotonicity of Swish increases expressivity and improves gradient flow
- This property may also provide some robustness to different initializations and learning rates
smooth
- smoothness plays a beneficial role in optimization and generalization

どうやって有効だと検証した？

さまざまな条件・設定でReLUやその他の活性化関数と比較（モデルやハイパーパラメータはReLUでの設定のまま）

ReLUとの比較
- DNNの学習効率
- バッチサイズごとのロバスト性
（ReLU含む）その他の活性化関数との比較
- CIFAR
- ImageNet
- 機械翻訳

議論はある？

Swish向けのモデルやハイパーパラメータを探索したい

次に読むべき論文は？

ReLUの論文
Richard HR Hahnloser, Rahul Sarpeshkar, Misha A Mahowald, Rodney J Douglas, and H Sebastian Seung.
Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature, 405(6789):
947, 2000.