twitterで流れてきたGoogleの論文が、最近のKaggleでも頻繁に使われる「Pseudo Labeling」を拡張した興味深いものでした。本記事では、簡単にこの論文を紹介します。

Last week we released the checkpoints for SOTA ImageNet models trained by NoisyStudent. Due to popular demand, we’ve also opensourced an implementation of NoisyStudent. The code uses SVHN for demonstration purposes.

Link: https://t.co/t3YK6Aiu5Q

Paper: https://t.co/ZYDaef6sdp pic.twitter.com/Ol1s1XcP7k
— Quoc Le (@quocleix) February 17, 2020

論文リンク

[1911.04252] Self-training with Noisy Student improves ImageNet classification
Qizhe Xie, Minh-Thang Luong, Eduard Hovy, Quoc V. Le
Submitted on 11 Nov 2019 (v1), last revised 7 Jan 2020 (this version, v2)

Pseudo Labeling とは？

テストデータに対する予測値を目的変数の値とみなし、学習データに加えて再度学習する技法です*1。詳細は脚注1や、脚注1で引用されている記事*2をご覧ください。

本論文の概要

本論文の概要は下図に示されています。

f:id:upura:20200218124212p:plain

Pseudo Labeling の拡張という観点で見たときに、大きく2つの差分があります。

ラベルを予測した追加データに適切にノイズを載せることでロバスト性を高める
イテレーションの繰り返しで生じる性能の向上を定量的に確認している

ラベルを予測した追加データに適切にノイズを載せることでロバスト性を高める

「4.1. The Importance of Noise in Self-training」に議論があります。ノイズとして乗せたaugmentation, stochastic depth, dropoutの有無に応じた性能の変化を確認しています。

f:id:upura:20200218142202p:plain

Pseudo Labelingにおいて、ラベルを予測したtestデータをそのままtrainデータに追加した場合を考えます。そのとき改めてtestデータを予測をする際、全く同じデータがtrainデータに含まれているので損失が過度に小さくなり、学習が正常に進まない事態が発生し得ます。

本論文では、trainデータとして追加するtestデータにノイズを加える（augmentation, stochastic depth, dropout）ことで、上述の事態を避ける狙いがあります。

イテレーションの繰り返しで生じる性能の向上を定量的に確認している

「4.2. A Study of Iterative Training」に議論があります。下の表では、イテレーションを重ねるごとにAUCが高まっている様子が記載されています。

f:id:upura:20200218143308p:plain

Pseudo Labelingを繰り返すことで性能が向上する事例があることは、Kaggleなど機械学習コンペの参加者の経験則として報告されていました*3。本記事の実験結果もあくまで一例ではありますが、論文の形でまとめられている点は珍しいと思います。

おわりに

本記事では、Pseudo Labelingを拡張したGoogleの論文「Self-training with Noisy Student improves ImageNet classification」を紹介しました。根幹のアイデア自体は多くのKagglerにとってある意味当たり前な、実体験とも合うような内容だと感じます。

Pseudo Labelingの細かな部分は、機械学習コンペの課題やデータセットに応じて、いろいろな派生をしている印象があります。例えば「Kaggle本」には「品質を保つため予測確率が高いデータのみを追加する」「データのグループ分けを考慮して追加する」などの手法が紹介されています*4。本論文も、その派生の一つとして手持ちの案に蓄えておくと良さそうです。

本論文の「Appendix A.2.」には、8つの観点で気付きが掲載されています。こちらも機械学習コンペの中でPseudo Labelingを試す上での指針になる話だと思います。

Finding #1: Using a large teacher model with better performance leads to better results.

Finding #2: A large amount of unlabeled data is necessary for better performance.

Finding #3: Soft pseudo labels work better than hard pseudo labels for out-of-domain data in certain cases.

Finding #4: A large student model is important to enable the student to learn a more powerful model.

Finding #5: Data balancing is useful for small models.

Finding #6: Joint training on labeled data and unlabeled data outperforms the pipeline that first pretrains with unlabeled data and then finetunes on labeled data.

Finding #7: Using a large ratio between unlabeled batch size and labeled batch size enables models to train longer on unlabeled data to achieve a higher accuracy.

Finding #8: Training the student from scratch is sometimes better than initializing the student with the teacher and the student initialized with the teacher still requires a large number of training epochs to perform well.

*1:門脇ら, Kaggleで勝つデータ分析の技術, 技術評論社, p. 266, 2019.

*2:

*3:

pseudo labelingやればやるほどスコアが良くなっていく🤔実際は2回しかやってなかったんだけど, 写真のように7回繰り返したらその度にどんどんスコアが良くなった. ただやっぱり繰り返し使うのはなんとなく恐怖感があるんよな...回帰でpseudo labelingを繰り返しても効くのちゃんと理解したいな. pic.twitter.com/3crGot2uux
— いのいち (@inoichan) November 24, 2019