gensim チュートリアル 1 – コーパスとベクトル空間

gensim のチュートリアル1を日本語にしてみました。このチュートリアルのコードサンプルを GitHub: gensim-learning で公開しています。

コーパスとベクトル空間

準備

ログイベントを表示するには次のコードを実行してください。

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

1 2	import logging logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

文字列からベクトルへ

まずはドキュメントを作ります。

from gensim import corpora
documents = [
"Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"
]

from gensim import corpora

documents = [

"Human machine interface for lab abc computer applications",

"A survey of user opinion of computer system response time",

"The EPS user interface management system",

"System and human system engineering testing of EPS",

"Relation of user perceived response time to error measurement",

"The generation of random binary unordered trees",

"The intersection graph of paths in trees",

"Graph minors IV Widths of trees and well quasi ordering",

"Graph minors A survey"

]

この文書をトークン化し、一般的な単語と一度だけしか出現しない単語を取り除きます。 (一般的な単語は簡単な単語のリストを用いて取り除きます。)

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [
[word for word in document.lower().split() if word not in stoplist]
for document in documents
]
# remove words that appear only once
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
for token in text:
frequency[token] += 1
texts = [
[token for token in text if frequency[token] > 1]
for text in texts
]

# remove common words and tokenize

stoplist = set('for a of the and to in'.split())

texts = [

[word for word in document.lower().split() if word not in stoplist]

for document in documents

]

# remove words that appear only once

from collections import defaultdict

frequency = defaultdict(int)

for text in texts:

for token in text:

frequency[token] += 1

texts = [

[token for token in text if frequency[token] > 1]

for text in texts

]

確認のために表示する場合は次のようにします。

from pprint import pprint  # pretty-printer
pprint(texts)

1 2	from pprint import pprint # pretty-printer pprint(texts)

ここでは次のようなアウトプットとなります。

[['human', 'interface', 'computer'],
['survey', 'user', 'computer', 'system', 'response', 'time'],
['eps', 'user', 'interface', 'system'],
['system', 'human', 'system', 'eps'],
['user', 'response', 'time'],
['trees'],
['graph', 'trees'],
['graph', 'minors', 'trees'],
['graph', 'minors', 'survey']]

[['human', 'interface', 'computer'],

['survey', 'user', 'computer', 'system', 'response', 'time'],

['eps', 'user', 'interface', 'system'],

['system', 'human', 'system', 'eps'],

['user', 'response', 'time'],

['trees'],

['graph', 'trees'],

['graph', 'minors', 'trees'],

['graph', 'minors', 'survey']]

ドキュメントを処理する方法は様々ですが、ここでは Deerwester et al. のもとのLSA記事にあるやり方を真似るために、スペースで区切りすべてを小文字にするというシンプルだが非効率的な方法をとっています。ここで扱うコーパスは、 Deerwester et al. (1990) の論文 Indexing by Latent Semantic Analysis の表2にあるものと同じものです。

ドキュメントを処理する方法は実に様々で、アプリケーションや言語に依存するため、いかなるインターフェースによっても制限されないようにしました。その代わりに、ドキュメントはその特徴によって表され、表面的な文字列形式では表されないものとします。その特徴をどのように取得するかはあなた次第で、下に例として一般的なアプローチ方法 (Bag-of-words) を示しますが、異なるアプリケーションドメインでは別の特徴を採用することもあります。いつものように、ガーベジイン・ガーベジアウトです(意味のある入力でないと、意味のある出力が得られないということ)。

ドキュメントをベクトルに変換するため、ここでは Bag-of-words というドキュメントの表現方法を使います。この表現方法では、ひとつのベクトルのそれぞれの要素が次のような質問・回答のペアになります。

“system” という単語が何度出現したか : 1回

この利点は質問をそのIDのみで表せることです。そして、質問と回答のペアはディクショナリと呼ばれます。

dictionary = corpora.Dictionary(texts)
dictionary.save('/tmp/deerwester.dict')  # store the dictionary, for future reference

1 2	dictionary = corpora.Dictionary(texts) dictionary.save('/tmp/deerwester.dict') # store the dictionary, for future reference

次のようにして確認できます。

print(dictionary)
# Dictionary(12 unique tokens)

1 2	print(dictionary) # Dictionary(12 unique tokens)

ここではクラス gensim.corpora.dictionary.Dictionary を用いて、出現したすべての単語にIDを付与しています。すべてのテキストを判別して出現頻度と関連する統計情報を算出します。最終的に、コーパス内に12の異なる単語があることがわかり、それぞれのドキュメントは12個の数(12次元ベクトル)で表されることがわかります。単語とIDの対応を見るには次のようにします。

print(dictionary.token2id)
# {'minors': 11, 'graph': 10, 'system': 5, 'trees': 9,
# 'eps': 8, 'computer': 0, 'survey': 4, 'user': 7,
# 'human': 1, 'time': 6, 'interface': 2, 'response': 3}

print(dictionary.token2id)

# {'minors': 11, 'graph': 10, 'system': 5, 'trees': 9,

# 'eps': 8, 'computer': 0, 'survey': 4, 'user': 7,

# 'human': 1, 'time': 6, 'interface': 2, 'response': 3}

次のようにして、トークン化されたドキュメントをベクトルにします。

new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)
# [(0, 1), (1, 1)]

new_doc = "Human computer interaction"

new_vec = dictionary.doc2bow(new_doc.lower().split())

print(new_vec)

# [(0, 1), (1, 1)]

関数 doc2bow() は単純にそれぞれの単語の出現回数を計算し、単語は id に変換してその結果を疎ベクトルとして返します。疎ベクトルとはほとんどの成分が 0 のベクトルです。疎ベクトル [(0, 1), (1, 1)] は “Human computer interaction” というドキュメントの中に、 id が 0 の単語 “computer” が 0回出現し、 id が 1 の単語 “human” が 1回出現することを意味しています。他のディクショナリ内の単語は 0回出現します(1回も出現しません)。ディクショナリにない単語は、ベクトルでは表現されません。上の例では “interaction” がそれにあたります。

corpus = [dictionary.doc2bow(text) for text in texts]
>>> corpora.MmCorpus.serialize('/tmp/deerwester.mm', corpus)  # store to disk, for later use
>>> print(corpus)
# [(0, 1), (1, 1), (2, 1)]
# [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
# [(2, 1), (5, 1), (7, 1), (8, 1)]
# [(1, 1), (5, 2), (8, 1)]
# [(3, 1), (6, 1), (7, 1)]
# [(9, 1)]
# [(9, 1), (10, 1)]
# [(9, 1), (10, 1), (11, 1)]
# [(4, 1), (10, 1), (11, 1)]

corpus = [dictionary.doc2bow(text) for text in texts]

>>> corpora.MmCorpus.serialize('/tmp/deerwester.mm', corpus) # store to disk, for later use

>>> print(corpus)

# [(0, 1), (1, 1), (2, 1)]

# [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]

# [(2, 1), (5, 1), (7, 1), (8, 1)]

# [(1, 1), (5, 2), (8, 1)]

# [(3, 1), (6, 1), (7, 1)]

# [(9, 1)]

# [(9, 1), (10, 1)]

# [(9, 1), (10, 1), (11, 1)]

# [(4, 1), (10, 1), (11, 1)]

id=0 のベクトル成分は「ドキュメントの中に単語 “graph” は何回出現するか」を表していて、はじめの6つのドキュメントでは0回、残る3つのドキュメントでは1回という結果になっています。

この結果をコーパスにして、後で使用するためにファイルに保存しておきます。コーパスとはテキストや発話を大量に集めてデータベース化した言語資料のことです。

コーパスストリーミング – 1度にひとつのドキュメント

上のコーパスは Python のリストとして完全にメモリ内で扱われます。とてもシンプルな例だったのでそれ自体問題にはなりませんでした。ではコーパス内に100万のドキュメントがあったらどうでしょうか。ドキュメントすべてを RAM の中に保存するのは無理でしょう。ドキュメントはハードディスク内のファイルに、1行に1ドキュメントが記述されていると考えます。 Gensim で使用するためには、コーパスの各行それぞれに1つのドキュメントベクトルを記述する必要があります。

class MyCorpus(object):
def __iter__(self):
for line in open('mycorpus.txt'):
# assume there's one document per line, tokens separated by whitespace
yield dictionary.doc2bow(line.lower().split())

class MyCorpus(object):

def __iter__(self):

for line in open('mycorpus.txt'):

# assume there's one document per line, tokens separated by whitespace

yield dictionary.doc2bow(line.lower().split())

ここで使用している mycorpus.txt は次のような内容です。

Human machine interface for lab abc computer applications
A survey of user opinion of computer system response time
The EPS user interface management system
System and human system engineering testing of EPS
Relation of user perceived response time to error measurement
The generation of random binary unordered trees
The intersection graph of paths in trees
Graph minors IV Widths of trees and well quasi ordering
Graph minors A survey

Human machine interface for lab abc computer applications

A survey of user opinion of computer system response time

The EPS user interface management system

System and human system engineering testing of EPS

Relation of user perceived response time to error measurement

The generation of random binary unordered trees

The intersection graph of paths in trees

Graph minors IV Widths of trees and well quasi ordering

Graph minors A survey

1つのファイル内に、1行ずつ各ドキュメントが書かれているという仮定は重要ではありません。どのようなフォーマットであれ、 __iter__ 関数をそのフォーマットに合うように変更できます。ディレクトリを操作する、XMLを解析する、ネットワークにアクセスする… 各ドキュメントのトークンリストを取得するために入力を解析することで、ディクショナリを利用してトークンをidに変換し、__iter__の内部で疎ベクトルを生成することができます。

corpus_memory_friendly = MyCorpus()  # doesn't load the corpus into memory!
print(corpus_memory_friendly)
<__main__.MyCorpus object at 0x10d5690>

corpus_memory_friendly = MyCorpus() # doesn't load the corpus into memory!

print(corpus_memory_friendly)

<__main__.MyCorpus object at 0x10d5690>

コーパスはオブジェクトで、アウトプットする方法を定義していませんから、オブジェクトのメモリ上のアドレスが表示されます。あまり使いやすくないので、構成要素のベクトルを見るために、コーパスの中のそれぞれのドキュメントベクトルを1つ1行で順次表示してみましょう。1回のイテレートで、1つのベクトルデータを読み込み表示します。

for vector in corpus_memory_friendly:  # load one vector into memory at a time
print(vector)
# [(0, 1), (1, 1), (2, 1)]
# [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
# [(2, 1), (5, 1), (7, 1), (8, 1)]
# [(1, 1), (5, 2), (8, 1)]
# [(3, 1), (6, 1), (7, 1)]
# [(9, 1)]
# [(9, 1), (10, 1)]
# [(9, 1), (10, 1), (11, 1)]
# [(4, 1), (10, 1), (11, 1)]

for vector in corpus_memory_friendly: # load one vector into memory at a time

print(vector)

# [(0, 1), (1, 1), (2, 1)]

# [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]

# [(2, 1), (5, 1), (7, 1), (8, 1)]

# [(1, 1), (5, 2), (8, 1)]

# [(3, 1), (6, 1), (7, 1)]

# [(9, 1)]

# [(9, 1), (10, 1)]

# [(9, 1), (10, 1), (11, 1)]

# [(4, 1), (10, 1), (11, 1)]

出力結果は普通の Python のリストの場合と同じですが、 RAMの中には1度に1つのベクトルしか読み込まれないので、コーパスはずっとメモリにやさしい設計です。これでコーパスはどれだけでも大きくすることができます。

同様に、すべてのテキストを読み込まずにディクショナリを生成するのが次の方法です。

from six import iteritems
# collect statistics about all tokens
dictionary = corpora.Dictionary(
line.lower().split() for line in open('mycorpus.txt'))
# remove stop words and words that appear only once
stop_ids = [
dictionary.token2id[stopword] for stopword in stoplist
if stopword in dictionary.token2id
]
once_ids = [
tokenid for tokenid, docfreq in iteritems(dictionary.dfs)
if docfreq == 1
]
dictionary.filter_tokens(stop_ids + once_ids)  # remove stop words and words that appear only once
dictionary.compactify()  # remove gaps in id sequence after words that were removed
print(dictionary)
# Dictionary(12 unique tokens)

from six import iteritems

# collect statistics about all tokens

dictionary = corpora.Dictionary(

line.lower().split() for line in open('mycorpus.txt'))

# remove stop words and words that appear only once

stop_ids = [

dictionary.token2id[stopword] for stopword in stoplist

if stopword in dictionary.token2id

]

once_ids = [

tokenid for tokenid, docfreq in iteritems(dictionary.dfs)

if docfreq == 1

]

dictionary.filter_tokens(stop_ids + once_ids) # remove stop words and words that appear only once

dictionary.compactify() # remove gaps in id sequence after words that were removed

print(dictionary)

# Dictionary(12 unique tokens)

以上がすべてです。少なくとも Bag-of-words の表現については。もちろん、それぞれのコーパスで何をするかが次の課題です。まだ、各単語の出現頻度を計算することがどのように役に立つかはわかっていません。しかしいずれにせよ、意味のあるドキュメントとドキュメントの類似点を計算できるようになる前には、まず最初に単純な変換をする必要があります。変換については次のチュートリアルで扱いますが、その前に、コーパスの永続性に注目してみましょう。

コーパスの形式

ベクトル空間コーパス(ベクトルの配列)をハードディスクに保存するフォーマットはいくつかあります。 Gensim は先に述べたストリーミングコーパスの方式を利用してそれらの実装をしています。その方法では、ドキュメントは一度にメモリ内に読み込まれず、1度にひとつのドキュメントを読み込みます。

最も注目するべきファイルフォーマットは Market Matrix フォーマットです。 Market Matrix フォーマットでコーパスを保存するには次のようにします。

# create a toy corpus of 2 documents, as a plain Python list
corpus = [[(1, 0.5)], []]  # make one document empty, for the heck of it
corpora.MmCorpus.serialize('/tmp/corpus.mm', corpus)

# create a toy corpus of 2 documents, as a plain Python list

corpus = [[(1, 0.5)], []] # make one document empty, for the heck of it

corpora.MmCorpus.serialize('/tmp/corpus.mm', corpus)

他のフォーマットには、 Joachim’s SVMlight フォーマット、 Blei’s LDA-C フォーマット、 GibbsLDA++ フォーマットがあります。

corpora.SvmLightCorpus.serialize('/tmp/corpus.svmlight', corpus)
corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)
corpora.LowCorpus.serialize('/tmp/corpus.low', corpus)

corpora.SvmLightCorpus.serialize('/tmp/corpus.svmlight', corpus)

corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)

corpora.LowCorpus.serialize('/tmp/corpus.low', corpus)

逆に、 Matrix Market フォーマットからコーパスを読み込む方法は次の通りです。

corpus = corpora.MmCorpus('/tmp/corpus.mm')

1	corpus = corpora.MmCorpus('/tmp/corpus.mm')

コーパスオブジェクトはストリームで、基本的にそれらを直接表示することはできません。

print(corpus)
# MmCorpus(2 documents, 2 features, 1 non-zero entries)

1 2	print(corpus) # MmCorpus(2 documents, 2 features, 1 non-zero entries)

代わりに、コーパスの内容を表示します。

# one way of printing a corpus: load it entirely into memory
print(list(corpus))  # calling list() will convert any sequence to a plain Python list
# [[(1, 0.5)], []]

# one way of printing a corpus: load it entirely into memory

print(list(corpus)) # calling list() will convert any sequence to a plain Python list

# [[(1, 0.5)], []]

または

# another way of doing it: print one document at a time, making use of the streaming interface
for doc in corpus:
print(doc)
# [(1, 0.5)]
# []

# another way of doing it: print one document at a time, making use of the streaming interface

for doc in corpus:

print(doc)

# [(1, 0.5)]

# []

2つめの方法は明らかにメモリにやさしい方法です。しかしテストや開発の目的においては、リストを(コーパス)を呼び出すこと以上に単純な方法はありません。

Matrix Market フォーマットのストリームを Blei’s LDA-C フォーマットで保存してみます。

corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)

1	corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)

このように、 gensim はメモリ効率を考えた入出力フォーマット変換ツールとしても利用できます。 1つのフォーマットのドキュメントストリームを読み込み、直ちに別のフォーマットで保存できます。新しいフォーマットを追加するのもとても簡単で、 SVMlight のコーパスのコードを参考にしてください。

NumPy と SciPy の互換性

Gensim は NumPy の行列と相互に変換するための効果的なツールも含んでいます。

import gensim
import numpy as np
numpy_matrix = np.random.randint(10, size=[5,2])  # random matrix as an example
corpus = gensim.matutils.Dense2Corpus(numpy_matrix)
numpy_matrix = gensim.matutils.corpus2dense(
corpus, num_terms=number_of_corpus_features)

import gensim

import numpy as np

numpy_matrix = np.random.randint(10, size=[5,2]) # random matrix as an example

corpus = gensim.matutils.Dense2Corpus(numpy_matrix)

numpy_matrix = gensim.matutils.corpus2dense(

corpus, num_terms=number_of_corpus_features)

また、 scipy.sparse の行列との相互変換も可能です。

import scipy.sparse
scipy_sparse_matrix = scipy.sparse.random(5,2)  # random sparse matrix as example
corpus = gensim.matutils.Sparse2Corpus(scipy_sparse_matrix)
scipy_csc_matrix = gensim.matutils.corpus2csc(corpus)

import scipy.sparse

scipy_sparse_matrix = scipy.sparse.random(5,2) # random sparse matrix as example

corpus = gensim.matutils.Sparse2Corpus(scipy_sparse_matrix)

scipy_csc_matrix = gensim.matutils.corpus2csc(corpus)

ディクショナリを小さなサイズに整える、コーパスと NumPy, SciPy との変換を効率化するといったことを含む、完全なリファレンスについてはAPIドキュメントご覧ください。次のチュートリアルはトピックと変換についてです。