Elasticsearch 5.5 使用 analysis-smartcn 插件实现中文分词

Elasticsearch 默认的分词是英文的，这会导致中文会以字为单位分组，而不是以词语来分组。

若要实现中文的分词，则需要借助中文的分析器插件。这里使用的是 analysis-smartcn 插件。

安装 analysis-smartcn 插件：

bash

bin\elasticsearch-plugin install analysis-smartcn

注意：本文皆是 Windows 下执行的命令，Linux 下命令稍有区别。

移除 analysis-smartcn 插件：

bash

bin\elasticsearch-plugin remove analysis-smartcn

使用 smartcn 分析器的字段结构如下：

json

"description": {
    "analyzer": "smartcn",
    "type": "text"
},

NEST 代码示例和完整的索引结构参考后面的 附 1. 使用 NEST 创建索引（.NET Core） 和 附 2. 索引的完整结构。

以 学然后知不足 为例，使用 smartcn 时会自动被分词为如下四个词组（默认分词时则是每个汉字一组）

学
然后
知
不足

使用 match、term、wildcard 查询时需使用整个词组（如然后），单使用然或后时查询不到数据。

使用分词后可能会使查询结果变少，但查询结果会更精确。

附 1. 使用 NEST 创建索引（.NET Core）

安装 NEST 包

powershell

Install-Package NEST -Version 5.5.0

Program.cs

csharp

using Nest;
using System;

namespace NestSample
{
    class Program
    {
        static void Main(string[] args)
        {
            var settings = new ConnectionSettings(new Uri("http://localhost:9200"))
                .DefaultIndex("people");

            var client = new ElasticClient(settings);

            var person = new Person
            {
                Id = 1,
                FirstName = "佳佳",
                LastName = "刘",
                Description = "学然后知不足",
            };

            var createIndexResponse = client.CreateIndex("people", c => c
                .Mappings(ms => ms
                    .Map<Person>(m => m.AutoMap())
                )
            );

            var indexResponse = client.Index(person);
        }
    }
}

Person.cs

csharp

using Nest;

namespace NestSample
{
    [ElasticsearchType(Name = "person")]
    public class Person
    {
        public int Id { get; set; }
        public string FirstName { get; set; }
        public string LastName { get; set; }
        [Text(Analyzer = "smartcn")]
        public string Description { get; set; }
    }
}

附 2. 索引的完整结构

json

{
    "state": "open",
    "settings": {
        "index": {
            "creation_date": "1570518070722",
            "number_of_shards": "5",
            "number_of_replicas": "1",
            "uuid": "bt-ghp4UTPqb1b6mrfEkoQ",
            "version": {
                "created": "5050099"
            },
            "provided_name": "people"
        }
    },
    "mappings": {
        "person": {
            "properties": {
                "firstName": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "ignore_above": 256,
                            "type": "keyword"
                        }
                    }
                },
                "lastName": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "ignore_above": 256,
                            "type": "keyword"
                        }
                    }
                },
                "description": {
                    "analyzer": "smartcn",
                    "type": "text"
                },
                "id": {
                    "type": "long"
                }
            }
        }
    },
    "aliases": [],
    "primary_terms": {
        "0": 1,
        "1": 1,
        "2": 1,
        "3": 1,
        "4": 1
    },
    "in_sync_allocations": {
        "0": [
            "c_GjRliYSuyqunST3L2fLw"
        ],
        "1": [
            "lPnj8fxhSyy97oSDEnlapA"
        ],
        "2": [
            "FAeXWNHdRjWoAeAbUQJeCA"
        ],
        "3": [
            "PAfwYeabTSKf06EWYfWqtA"
        ],
        "4": [
            "UbCDcsqTQC-gUInZQacMyg"
        ]
    }
}

Layout Switch

Page Layout Max Width

Content Layout Max Width

Spotlight

Spotlight Styles

Elasticsearch 5.5 使用 analysis-smartcn 插件实现中文分词

Page Layout Max Width

Adjust the maximum width of the page layout

Content Layout Max Width

Adjust the maximum width of the content layout

Layout Switch

Page Layout Max Width

Content Layout Max Width

Spotlight

Spotlight Styles

Elasticsearch 5.5 使用 analysis-smartcn 插件实现中文分词 ​

Page Layout Max Width

Adjust the maximum width of the page layout

Content Layout Max Width

Adjust the maximum width of the content layout

Elasticsearch 5.5 使用 analysis-smartcn 插件实现中文分词