<!-- # [Elasticsearch 5.5] 测试 smartcn 分析器 --> <!-- elasticsearch-55-test-smartcn-analyzer --> 通过 *curl* 工具可以测试 *smartcn* 分析器的分词效果(如何安装 *smartcn* 分析器请参考 [这篇博客][3])。 *curl* Windows 版下载地址:[https://curl.haxx.se/download.html][2] 如下是 Linux 下的命令,Windows 下是不支持单引号参数的,而且也不支持换行。 ```sh curl -XGET 'localhost:9200/_analyze?pretty' -H 'Content-Type: application/json' -d' { "analyzer" : "ik_smart", "text" : ["学然后知不足", "教然后知困"] } ' ``` 需要修改成双引号的形式,去掉换行,且字符串中的双引号需要添加转义符 `\` 。 ```bash curl -XGET "localhost:9200/_analyze?pretty" -H "Content-Type: application/json" -d"{ \"analyzer\" : \"smartcn\", \"text\" : [\"学然后知不足\", \"教然后知困\"] }" ``` 执行会报如下错误: ```json { "error" : { "root_cause" : [ { "type" : "illegal_argument_exception", "reason" : "Failed to parse request body" } ], "type" : "illegal_argument_exception", "reason" : "Failed to parse request body", "caused_by" : { "type" : "json_parse_exception", "reason" : "Invalid UTF-8 start byte 0xba\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@53b7c81e; line: 1, column: 43]" } }, "status" : 400 } ``` 上面的命令格式是对的,如果 *text* 值为英文的话是可以正确执行的。 原因在于中文字符的编码格式不是 *UTF-8* 的。 可以通过 *右键命令行窗口标题 -> 属性* 查看命令行窗口的编码。  通过 *chcp 65001* 命令可以修改为使用 *UTF-8* 编码,但还是没能成功解析。 将 *-d* 参数的 *json* 内容保存到 *smartcn-test.json* 文件, 文件的编码格式必须为 *UTF-8*(可以通过 *记事本 -> 另存为 -> 编码* 来确认)。  命令修改为如下格式并执行: ```bash curl -XGET "localhost:9200/_analyze?pretty" -H "Content-Type: application/json" -d@smartcn-test.json > smartcn-test-result.json ``` *smartcn-test.json* 文件内容: ```json { "analyzer": "smartcn", "text": [ "学然后知不足", "教然后知困" ] } ``` 执行后 *smartcn-test-result.json* 文件内容: ```json { "tokens" : [ { "token" : "学", "start_offset" : 0, "end_offset" : 1, "type" : "word", "position" : 0 }, { "token" : "然后", "start_offset" : 1, "end_offset" : 3, "type" : "word", "position" : 1 }, { "token" : "知", "start_offset" : 3, "end_offset" : 4, "type" : "word", "position" : 2 }, { "token" : "不足", "start_offset" : 4, "end_offset" : 6, "type" : "word", "position" : 3 }, { "token" : "教", "start_offset" : 7, "end_offset" : 8, "type" : "word", "position" : 4 }, { "token" : "然后", "start_offset" : 8, "end_offset" : 10, "type" : "word", "position" : 5 }, { "token" : "知", "start_offset" : 10, "end_offset" : 11, "type" : "word", "position" : 6 }, { "token" : "困", "start_offset" : 11, "end_offset" : 12, "type" : "word", "position" : 7 } ] } ``` 为了对比,测试一下默认的分析器。打开如下链接即可测试默认的分析器的分词结果: [http://localhost:9200/_analyze?text=学然后知不足%20教然后知困](http://localhost:9200/_analyze?text=%E5%AD%A6%E7%84%B6%E5%90%8E%E7%9F%A5%E4%B8%8D%E8%B6%B3%20%E6%95%99%E7%84%B6%E5%90%8E%E7%9F%A5%E5%9B%B0) ```json { "tokens": [ { "token": "学", "start_offset": 0, "end_offset": 1, "type": "<IDEOGRAPHIC>", "position": 0 }, { "token": "然", "start_offset": 1, "end_offset": 2, "type": "<IDEOGRAPHIC>", "position": 1 }, { "token": "后", "start_offset": 2, "end_offset": 3, "type": "<IDEOGRAPHIC>", "position": 2 }, { "token": "知", "start_offset": 3, "end_offset": 4, "type": "<IDEOGRAPHIC>", "position": 3 }, { "token": "不", "start_offset": 4, "end_offset": 5, "type": "<IDEOGRAPHIC>", "position": 4 }, { "token": "足", "start_offset": 5, "end_offset": 6, "type": "<IDEOGRAPHIC>", "position": 5 }, { "token": "教", "start_offset": 7, "end_offset": 8, "type": "<IDEOGRAPHIC>", "position": 6 }, { "token": "然", "start_offset": 8, "end_offset": 9, "type": "<IDEOGRAPHIC>", "position": 7 }, { "token": "后", "start_offset": 9, "end_offset": 10, "type": "<IDEOGRAPHIC>", "position": 8 }, { "token": "知", "start_offset": 10, "end_offset": 11, "type": "<IDEOGRAPHIC>", "position": 9 }, { "token": "困", "start_offset": 11, "end_offset": 12, "type": "<IDEOGRAPHIC>", "position": 10 } ] } ``` <!-- 参考 --> [1]: https://elasticsearch.cn/question/1451 (你好,请教es5.2 如何设置默认分词器为ik?谢谢) [2]: https://curl.haxx.se/download.html (curl / Download) [3]: https://www.liujiajia.me/2019/10/8/elasticsearch-55-chinese-analyzer-smartcn ([ElasticSearch 5.5] 使用 analysis-smartcn 插件实现中文分词) Loading... 版权声明:本文为博主「佳佳」的原创文章,遵循 CC 4.0 BY-NC-SA 版权协议,转载请附上原文出处链接及本声明。 原文链接:https://www.liujiajia.me/2019/10/9/elasticsearch-55-test-smartcn-analyzer 提交