Skip to content

Elasticsearch 5.5 使用 analysis-smartcn 插件实现中文分词

🏷️ Elasticsearch

Elasticsearch 默认的分词是英文的,这会导致中文会以字为单位分组,而不是以词语来分组。

若要实现中文的分词,则需要借助中文的分析器插件。这里使用的是 analysis-smartcn 插件。

安装 analysis-smartcn 插件:

bash
bin\elasticsearch-plugin install analysis-smartcn

注意:本文皆是 Windows 下执行的命令,Linux 下命令稍有区别。

移除 analysis-smartcn 插件:

bash
bin\elasticsearch-plugin remove analysis-smartcn

使用 smartcn 分析器的字段结构如下:

json
"description": {
    "analyzer": "smartcn",
    "type": "text"
},

NEST 代码示例和完整的索引结构参考后面的 附 1. 使用 NEST 创建索引(.NET Core)附 2. 索引的完整结构

学然后知不足 为例,使用 smartcn 时会自动被分词为如下四个词组(默认分词时则是每个汉字一组)

  • 然后
  • 不足

使用 matchtermwildcard 查询时需使用整个词组(如 然后),单使用 时查询不到数据。

使用分词后可能会使查询结果变少,但查询结果会更精确。

附 1. 使用 NEST 创建索引(.NET Core)

安装 NEST 包

powershell
Install-Package NEST -Version 5.5.0

Program.cs

csharp
using Nest;
using System;

namespace NestSample
{
    class Program
    {
        static void Main(string[] args)
        {
            var settings = new ConnectionSettings(new Uri("http://localhost:9200"))
                .DefaultIndex("people");

            var client = new ElasticClient(settings);

            var person = new Person
            {
                Id = 1,
                FirstName = "佳佳",
                LastName = "刘",
                Description = "学然后知不足",
            };

            var createIndexResponse = client.CreateIndex("people", c => c
                .Mappings(ms => ms
                    .Map<Person>(m => m.AutoMap())
                )
            );

            var indexResponse = client.Index(person);
        }
    }
}

Person.cs

csharp
using Nest;

namespace NestSample
{
    [ElasticsearchType(Name = "person")]
    public class Person
    {
        public int Id { get; set; }
        public string FirstName { get; set; }
        public string LastName { get; set; }
        [Text(Analyzer = "smartcn")]
        public string Description { get; set; }
    }
}

附 2. 索引的完整结构

json
{
    "state": "open",
    "settings": {
        "index": {
            "creation_date": "1570518070722",
            "number_of_shards": "5",
            "number_of_replicas": "1",
            "uuid": "bt-ghp4UTPqb1b6mrfEkoQ",
            "version": {
                "created": "5050099"
            },
            "provided_name": "people"
        }
    },
    "mappings": {
        "person": {
            "properties": {
                "firstName": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "ignore_above": 256,
                            "type": "keyword"
                        }
                    }
                },
                "lastName": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "ignore_above": 256,
                            "type": "keyword"
                        }
                    }
                },
                "description": {
                    "analyzer": "smartcn",
                    "type": "text"
                },
                "id": {
                    "type": "long"
                }
            }
        }
    },
    "aliases": [],
    "primary_terms": {
        "0": 1,
        "1": 1,
        "2": 1,
        "3": 1,
        "4": 1
    },
    "in_sync_allocations": {
        "0": [
            "c_GjRliYSuyqunST3L2fLw"
        ],
        "1": [
            "lPnj8fxhSyy97oSDEnlapA"
        ],
        "2": [
            "FAeXWNHdRjWoAeAbUQJeCA"
        ],
        "3": [
            "PAfwYeabTSKf06EWYfWqtA"
        ],
        "4": [
            "UbCDcsqTQC-gUInZQacMyg"
        ]
    }
}