多クラス分類の不均衡データのdownsampling

問題設計

例えば4クラス分類で、比率が[0.5, 0.25, 0.15, 0.10]のような問題を考えます。

from sklearn.datasets import make_classification

args = {
    'n_samples': 100000,
    'n_features': 10,
    'n_informative': 3,
    'n_redundant': 0,
    'n_repeated': 0,
    'n_classes': 4,
    'n_clusters_per_class': 1,
    'weights': [0.5, 0.25, 0.15, 0.10],
    'random_state': 42,
}

X, y = make_classification(**args)

c = collections.Counter(list(y))
print(c)

Counter({0: 49727, 1: 25021, 2: 15094, 3: 10158})

RandomUnderSampler

このような多クラス分類の不均衡データをdownsamplingする場合、下記の記事で2値分類のdownsamplingに使った「imblearn.under_samplingのRandomUnderSampler」が、同様に利用できます。

upura.hatenablog.com

sampler = RandomUnderSampler(random_state=42)
# downsampling
X_resampled, y_resampled = sampler.fit_resample(X, y)
collections.Counter(list(y_resampled))

Counter({0: 10158, 1: 10158, 2: 10158, 3: 10158})

数が最も少ないラベル3の10158に揃えて、その他のラベルのデータがdownsamplingされています。

make_imbalance

より細かい設定でdownsamplingしたい場合は「imblearn.datasetsのmake_imbalance」が利用できます。

imbalanced-learn.readthedocs.io

downsamplingの方針を辞書型で定義することで、方針通りにデータを抽出できます。意図的に不均衡にも抽出可能です。

from imblearn.datasets import make_imbalance

key = [0, 1, 2, 3]
val = [500, 1000, 1500, 2000]
strategy = dict(zip(key, val))


X_r, y_r = make_imbalance(X, y,
                      sampling_strategy=strategy,
                      random_state=123)

collections.Counter(list(y_r))

Counter({0: 500, 1: 1000, 2: 1500, 3: 2000})

おわりに

本記事では、多クラス分類の不均衡データを題材に、pythonでのdownsamplingの方法を2つ紹介しました。

実装はGitHubで公開しました。

github.com