Elasticsearchをさわってみた

こんにちは。熊本事務所のYです。

業務でAWSのAmazon Elasticsearch Serviceをさわる機会があったのですが、設定まわりでよくわからないことがあったので、簡単にまとめてみようと思います。

1. Elasticsearchとは？

Elastic社が開発しているオープンソースの全文検索エンジン
RestAPI経由で強力な検索やソート機能を提供
ログデータ解析/可視化ツール「kibana」^※1を合わせて使える

2. 基本用語

よくRDBと対比した説明がされている。ざっくりいうとこんな感じ。

RDB	Elasticsearch
データベース	インデックス
テーブル	タイプ
レコード	ドキュメント

3. ドキュメントを登録してみる

kibanaのDev Toolsを使ってRestAPI経由でElasticsearchにアクセスできる。

試しに百人一首の作者と句をプロパティにもつドキュメントを登録。

PUT test-index1/_doc/1
{
  "author": "天智天皇",
  "ku": "秋の田の かりほの庵の とまをあらみ わが衣手は 露にぬれつつ"
}

すると「successful: 1」と返ってくる。（成功！）

{
  "_index" : "test-index1",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

同じようにあと2つ登録。

PUT test-index1/_doc/2
{
  "author": "持統天皇",
  "ku": "春過ぎて 夏来にけらし 白妙の 衣ほすてふ 天の香具山"
}

PUT test-index1/_doc/3
{
  "author": "柿本人麿",
  "ku": "足引きの 山鳥の尾の しだり尾の ながながし夜を ひとりかもねむ"
}

4. 検索をためしてみる

Elasticsearchの検索には、Query DSLという問い合わせ言語を使うことができる。

例えば、最初に登録した下記のデータを取りたい場合は、

{
  "author": "天智天皇",
  "ku": "秋の田の かりほの庵の とまをあらみ わが衣手は 露にぬれつつ"
}

「match」を使うことで部分一致でデータを拾ってみる。

GET test-index1/_search
{
  "query": {
    "match": {
      "ku": "秋"
    }
  }
}

うまく拾ってこれた。（成功！）

{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "test-index1",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "author" : "天智天皇",
          "ku" : "秋の田の かりほの庵の とまをあらみ わが衣手は 露にぬれつつ"
        }
      }
    ]
  }
}

同じように「秋の田の」でも拾ってこれるはず！　しかし…

GET test-index1/_search
{
  "query": {
    "match": {
      "ku": "秋の田の"
    }
  }
}

なぜかたくさんヒット…それに下２つは「秋の田の」なんて入ってないのに…

{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 1.5490575,
    "hits" : [
      {
        "_index" : "test-index1",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.5490575,
        "_source" : {
          "author" : "天智天皇",
          "ku" : "秋の田の かりほの庵の とまをあらみ わが衣手は 露にぬれつつ"
        }
      },
      {
        "_index" : "test-index1",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.97369325,
        "_source" : {
          "author" : "柿本人麿",
          "ku" : "足引きの 山鳥の尾の しだり尾の ながながし夜を ひとりかもねむ"
        }
      },
      {
        "_index" : "test-index1",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.7911257,
        "_source" : {
          "author" : "持統天皇",
          "ku" : "春過ぎて 夏来にけらし 白妙の 衣ほすてふ 天の香具山"
        }
      }
    ]
  }
}

どうやら公式ドキュメントにこのように記載されている。

Analysis is the process of converting text, like the body of any email, into tokens or terms which are added to the inverted index for searching

意訳: 「Analysis」はメールの本文のようなテキストをトークンとか単語に変換するプロセスだ。それらは検索用の転置インデックスとして追加される。

どうやら、自分が登録したドキュメントは細かく分割されてから登録されているようだ。

百聞は一見にしかず！

どんな感じで登録されているのかanalyzeAPI で確認。

GET test-index1/_analyze
{
  "field": "ku",
  "text": "秋の田の かりほの庵の とまをあらみ わが衣手は 露にぬれつつ"
}

{
  "tokens" : [
    {
      "token" : "秋",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "の",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<HIRAGANA>",
      "position" : 1
    },
    {
      "token" : "田",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "の",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<HIRAGANA>",
      "position" : 3
    },
    {
      "token" : "か",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<HIRAGANA>",
      "position" : 4
    },
以下続く...
  ]
}

1文字ずつ分割されている…

公式ドキュメント

Usually, the same analyzer should be applied at index time and at search time, to ensure that the terms in the query are in the same format as the terms in the inverted index.

意訳: 普通インデックスをはる時と検索する時には同じanalyzerが適用される。クエリ内の単語が転置インデックス内の単語と同じ形式なのを保証するために。

GET test-index1/_validate/query?explain
{
  "query": {
    "match": {
      "ku": "秋の田の"
    }
  }
}

explanationに注目すると「秋」「の」「田」「の」で分割されていることがわかる。

{
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "valid" : true,
  "explanations" : [
    {
      "index" : "test-index1",
      "valid" : true,
      "explanation" : "ku:秋 ku:の ku:田 ku:の"
    }
  ]
}

なるほど、ドキュメントは分割されて登録されて、

そして検索時にも、検索ワードが分割されて照合されるということか！！

5. ドキュメント分割を変更してみる

ドキュメントの登録、検索ワードがどのように分割されるかは、事前に設定をする事ができるらしい。

公式ドキュメントによると

Mapping is the process of defining how a document, and the fields it contains, are stored and indexed

意訳: マッピングっていうプロセスがあって、ドキュメントとその中のフィールドがどんな感じで登録されて、インデックスが貼られるかを定義できます。

試しに日本語用の形態素解析ライブラリをセット。

PUT test-index2
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_kuromoji_analyzer": {
          "type": "custom",
          "tokenizer": "kuromoji_tokenizer"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "author": {
        "type": "text"
      },
      "ku": {
        "type": "text",
        "analyzer": "my_kuromoji_analyzer"
      }
    }
  }
}

先ほどのように部分検索をしてみる。

GET test-index2/_analyze
{
  "field": "ku",
  "text": "秋の田の かりほの庵の とまをあらみ わが衣手は 露にぬれつつ"
}

よく眺めると、分割のされ方が変わっている！！

{
  "tokens" : [
    {
      "token" : "秋",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    }
中略...
    {
      "token" : "わが",
      "start_offset" : 19,
      "end_offset" : 21,
      "type" : "word",
      "position" : 14
    },
    {
      "token" : "衣手",
      "start_offset" : 21,
      "end_offset" : 23,
      "type" : "word",
      "position" : 15
    },
    {
      "token" : "は",
      "start_offset" : 23,
      "end_offset" : 24,
      "type" : "word",
      "position" : 16
    },
    {
      "token" : "露",
      "start_offset" : 25,
      "end_offset" : 26,
      "type" : "word",
      "position" : 17
    },
    {
      "token" : "に",
      "start_offset" : 26,
      "end_offset" : 27,
      "type" : "word",
      "position" : 18
    },
    {
      "token" : "ぬれ",
      "start_offset" : 27,
      "end_offset" : 29,
      "type" : "word",
      "position" : 19
    },
    {
      "token" : "つつ",
      "start_offset" : 29,
      "end_offset" : 31,
      "type" : "word",
      "position" : 20
    }
  ]
}

6. どのようにマッピングを定義する？

上の例では日本語用の形態素解析ライブラリを使用したが、実際の業務で上記のような設定だけでは問題もある。

英語の場合は下記のように単語間にあるスペースで区切る事ができる。

We  /  can  / do  /  it

しかし、日本語にスペースはなく、助詞や助動詞などで言葉が繋がれている。ゆえに、ドキュメントの文脈や背景事情など様々な要因で、人が見ると不自然な分割をしてしまう事がある。

例えば、「京都」を検索したいのに「東京都」がヒットしてしまう事もありうる。

極端な例だと「うらにわにはにわとりがいる」をどうやって区切れば良いのかという問題もある。

これらの問題をどのように対応すれば良いか、正解はまだ発見できていないので、引き続き調査していきたい。

7. まとめ

Elasticsearchはデータ登録時にドキュメントを分割する。また、検索時には検索ワードも分割する。どのように分割するかはマッピングを設定することで定義できるが、日本語独特の難しさに注意して設計する必要がある。

8. 参考

※1 kibana https://www.elastic.co/jp/products/kibana

目次