問題文

nlp100.github.io

問題の概要

言語処理100本ノック 2020「61. 単語の類似度」 - u++の備忘録と言語処理100本ノック 2020「64. アナロジーデータでの実験」 - u++の備忘録の組み合わせです。

import numpy as np
import pandas as pd
from gensim.models import KeyedVectors
from tqdm import tqdm


def culcCosSim(row):
    global model
    return model(row['Word 1'], row['Word 2'])


tqdm.pandas()
model = KeyedVectors.load_word2vec_format('ch07/GoogleNews-vectors-negative300.bin', binary=True)
df = pd.read_csv('ch07/wordsim353/combined.csv')
df['cosSim'] = df.progress_apply(culcCosSim, axis=1)

print(df[['Human (mean)', 'cosSim']].corr(method='spearman'))