Elasticsearch配置IK分词器

2022-6-16 2931 0

Elasticsearch 是一个分布式、RESTful 风格的搜索和数据分析引擎，能够解决不断涌现出的各种用例。作为 Elastic Stack 的核心，它集中存储您的数据，帮助您发现意料之中以及意料之外的情况。

IK Analysis for Elasticsearch下载与文档，这里是国内代理节点

https://gitcode.net/mirrors/medcl/elasticsearch-analysis-ik

https://github.com/medcl/elasticsearch-analysis-ik

IK分词器扩展词停用词

http://www.javacui.com/tool/631.html

The IK Analysis plugin integrates Lucene IK analyzer (http://code.google.com/p/ik-analyzer/) into elasticsearch, support customized dictionary.

Analyzer: ik_smart , ik_max_word , Tokenizer: ik_smart , ik_max_word

这里下载了elasticsearch-analysis-ik-8.2.0版本，将下载文件解压到elasticsearch-8.2.2\plugins下，新建一个文件夹ik，将文件放到里面。

注意，因为前面我一直用的ES是8.2.2，因此需要修改下IK的配置来适配，修改配置文件plugin-descriptor.properties

# 'elasticsearch.version' version of elasticsearch compiled against
# You will have to release a new version of the plugin for each new
# elasticsearch release. This version is checked when the plugin
# is loaded so Elasticsearch will refuse to start in the presence of
# plugins with the incorrect elasticsearch.version.
elasticsearch.version=8.2.2

修改为ES安装的版本

重启后测试一下，在kibana工具中输入

POST /_analyze 
{
  "text": "你好，我知道了，中国是一个伟大的祖国，我爱你",
  "analyzer": "ik_smart"
}

返回结果

{
  "tokens" : [
    {
      "token" : "你好",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "我",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "知道了",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "中",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "国是",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "一个",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "伟大",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "的",
      "start_offset" : 15,
      "end_offset" : 16,
      "type" : "CN_CHAR",
      "position" : 7
    },
    {
      "token" : "祖国",
      "start_offset" : 16,
      "end_offset" : 18,
      "type" : "CN_WORD",
      "position" : 8
    },
    {
      "token" : "我爱你",
      "start_offset" : 19,
      "end_offset" : 22,
      "type" : "CN_WORD",
      "position" : 9
    }
  ]
}

它已经正常工作，但是有个问题，中文的词汇分析并不友好和智能，这里就需要自行扩展自己的词汇库了。

在IK分词器扩展词停用词文章的后面，已经说了如何把搜狗的官方词汇导入为自己的词汇库，可以参考。

END

上一篇: Elasticsearch配置IK分词器-扩展词库和热更新

下一篇: kibana Windows上安装