Elasticsearch 是一个分布式、RESTful 风格的搜索和数据分析引擎,能够解决不断涌现出的各种用例。 作为 Elastic Stack 的核心,它集中存储您的数据,帮助您发现意料之中以及意料之外的情况。
IK Analysis for Elasticsearch下载与文档,这里是国内代理节点
https://gitcode.net/mirrors/medcl/elasticsearch-analysis-ik
https://github.com/medcl/elasticsearch-analysis-ik
IK分词器扩展词停用词
http://www.javacui.com/tool/631.html
The IK Analysis plugin integrates Lucene IK analyzer (http://code.google.com/p/ik-analyzer/) into elasticsearch, support customized dictionary.
Analyzer: ik_smart , ik_max_word , Tokenizer: ik_smart , ik_max_word
这里下载了elasticsearch-analysis-ik-8.2.0版本,将下载文件解压到elasticsearch-8.2.2\plugins下,新建一个文件夹ik,将文件放到里面。
注意,因为前面我一直用的ES是8.2.2,因此需要修改下IK的配置来适配,修改配置文件plugin-descriptor.properties
# 'elasticsearch.version' version of elasticsearch compiled against # You will have to release a new version of the plugin for each new # elasticsearch release. This version is checked when the plugin # is loaded so Elasticsearch will refuse to start in the presence of # plugins with the incorrect elasticsearch.version. elasticsearch.version=8.2.2
修改为ES安装的版本
重启后测试一下,在kibana工具中输入
POST /_analyze { "text": "你好,我知道了,中国是一个伟大的祖国,我爱你", "analyzer": "ik_smart" }
返回结果
{ "tokens" : [ { "token" : "你好", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 0 }, { "token" : "我", "start_offset" : 3, "end_offset" : 4, "type" : "CN_CHAR", "position" : 1 }, { "token" : "知道了", "start_offset" : 4, "end_offset" : 7, "type" : "CN_WORD", "position" : 2 }, { "token" : "中", "start_offset" : 8, "end_offset" : 9, "type" : "CN_CHAR", "position" : 3 }, { "token" : "国是", "start_offset" : 9, "end_offset" : 11, "type" : "CN_WORD", "position" : 4 }, { "token" : "一个", "start_offset" : 11, "end_offset" : 13, "type" : "CN_WORD", "position" : 5 }, { "token" : "伟大", "start_offset" : 13, "end_offset" : 15, "type" : "CN_WORD", "position" : 6 }, { "token" : "的", "start_offset" : 15, "end_offset" : 16, "type" : "CN_CHAR", "position" : 7 }, { "token" : "祖国", "start_offset" : 16, "end_offset" : 18, "type" : "CN_WORD", "position" : 8 }, { "token" : "我爱你", "start_offset" : 19, "end_offset" : 22, "type" : "CN_WORD", "position" : 9 } ] }
它已经正常工作,但是有个问题,中文的词汇分析并不友好和智能,这里就需要自行扩展自己的词汇库了。
在IK分词器扩展词停用词文章的后面,已经说了如何把搜狗的官方词汇导入为自己的词汇库,可以参考。
END