Skip to content
欢迎扫码关注公众号

Elasticsearch 5.5 使用 analysis-smartcn 插件实现中文分词

Elasticsearch 默认的分词是英文的,这会导致中文会以字为单位分组,而不是以词语来分组。

若要实现中文的分词,则需要借助中文的分析器插件。这里使用的是 analysis-smartcn 插件。

安装 analysis-smartcn 插件:

bash
bin\elasticsearch-plugin install analysis-smartcn

注意:本文皆是 Windows 下执行的命令,Linux 下命令稍有区别。

移除 analysis-smartcn 插件:

bash
bin\elasticsearch-plugin remove analysis-smartcn

使用 smartcn 分析器的字段结构如下:

json
"description": {
    "analyzer": "smartcn",
    "type": "text"
},

NEST 代码示例和完整的索引结构参考后面的 附 1. 使用 NEST 创建索引(.NET Core)附 2. 索引的完整结构

学然后知不足 为例,使用 smartcn 时会自动被分词为如下四个词组(默认分词时则是每个汉字一组)

  • 然后
  • 不足

使用 matchtermwildcard 查询时需使用整个词组(如 然后),单使用 时查询不到数据。

使用分词后可能会使查询结果变少,但查询结果会更精确。

附 1. 使用 NEST 创建索引(.NET Core)

安装 NEST 包

powershell
Install-Package NEST -Version 5.5.0

Program.cs

csharp
using Nest;
using System;

namespace NestSample
{
    class Program
    {
        static void Main(string[] args)
        {
            var settings = new ConnectionSettings(new Uri("http://localhost:9200"))
                .DefaultIndex("people");

            var client = new ElasticClient(settings);

            var person = new Person
            {
                Id = 1,
                FirstName = "佳佳",
                LastName = "刘",
                Description = "学然后知不足",
            };

            var createIndexResponse = client.CreateIndex("people", c => c
                .Mappings(ms => ms
                    .Map<Person>(m => m.AutoMap())
                )
            );

            var indexResponse = client.Index(person);
        }
    }
}

Person.cs

csharp
using Nest;

namespace NestSample
{
    [ElasticsearchType(Name = "person")]
    public class Person
    {
        public int Id { get; set; }
        public string FirstName { get; set; }
        public string LastName { get; set; }
        [Text(Analyzer = "smartcn")]
        public string Description { get; set; }
    }
}

附 2. 索引的完整结构

json
{
    "state": "open",
    "settings": {
        "index": {
            "creation_date": "1570518070722",
            "number_of_shards": "5",
            "number_of_replicas": "1",
            "uuid": "bt-ghp4UTPqb1b6mrfEkoQ",
            "version": {
                "created": "5050099"
            },
            "provided_name": "people"
        }
    },
    "mappings": {
        "person": {
            "properties": {
                "firstName": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "ignore_above": 256,
                            "type": "keyword"
                        }
                    }
                },
                "lastName": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "ignore_above": 256,
                            "type": "keyword"
                        }
                    }
                },
                "description": {
                    "analyzer": "smartcn",
                    "type": "text"
                },
                "id": {
                    "type": "long"
                }
            }
        }
    },
    "aliases": [],
    "primary_terms": {
        "0": 1,
        "1": 1,
        "2": 1,
        "3": 1,
        "4": 1
    },
    "in_sync_allocations": {
        "0": [
            "c_GjRliYSuyqunST3L2fLw"
        ],
        "1": [
            "lPnj8fxhSyy97oSDEnlapA"
        ],
        "2": [
            "FAeXWNHdRjWoAeAbUQJeCA"
        ],
        "3": [
            "PAfwYeabTSKf06EWYfWqtA"
        ],
        "4": [
            "UbCDcsqTQC-gUInZQacMyg"
        ]
    }
}

Page Layout Max Width

Adjust the exact value of the page width of VitePress layout to adapt to different reading needs and screens.

Adjust the maximum width of the page layout
A ranged slider for user to choose and customize their desired width of the maximum width of the page layout can go.

Content Layout Max Width

Adjust the exact value of the document content width of VitePress layout to adapt to different reading needs and screens.

Adjust the maximum width of the content layout
A ranged slider for user to choose and customize their desired width of the maximum width of the content layout can go.