• 云途科技成立于2010年 - 专注全球跨境电商服务器租赁托管!
  • 帮助中心

    您可以通过下方搜索框快速查找您想知道的问题

    elasticsearch ik 中文分词 安装配置

      in  unix      Tags: 

    elasticsearch自带有中文分词,但是特别的傻,后面会做对比,在这里推荐analysis ik,用es来做全文检索工具的人员80%-90%会用这个中文分词工具,一直在更新维护。

    1,elasticsearch分词器(analyzers)说明

    elasticsearch中,内置了很多分词器(analyzers),例如standard (标准分词器)、english (英文分词)和chinese (中文分词)。

    其中standard 就是无脑的一个一个词(汉字)切分,所以适用范围广,但是精准度低;

    english 对英文更加智能,可以识别单数负数,大小写,过滤stopwords(例如“the”这个词)等;

    2,安装maven

    $ brew search maven      //mac
    # apt-get install maven  //Ubuntu
    # yum install maven      //CentOS or redhat
    
    $ mvn -v
    apache Maven 3.5.0 (ff8f5e7444045639af65f6095c62210b5713f426; 2017-04-04T03:39:06+08:00)
    Maven home: /usr/local/Cellar/maven/3.5.0/libexec
    Java version: 1.8.0_112, vendor: Oracle Corporation
    Java home: /Library/Java/JavaVirtualMachines/jdk1.8.0_112.jdk/Contents/Home/jre
    Default locale: zh_CN, platform encoding: UTF-8
    OS name: "mac os x", version: "10.12.6", arch: "x86_64", family: "mac"

    3,下载analysis ik插件

    $ git clone https://github.com/medcl/elasticsearch-analysis-ik.git
    $ cd elasticsearch-analysis-ik
    $ git branch -a    //根据不同的es版本,进行git checkout
    * master //主分支是6.2.3的
     remotes/origin/2.x
     remotes/origin/5.3.x
     remotes/origin/5.x
     remotes/origin/6.1.x
     remotes/origin/HEAD -> origin/master
     remotes/origin/arkxu-master
     remotes/origin/master
     remotes/origin/revert-80-patch-1
     remotes/origin/rm
     remotes/origin/wyhw-ik_lucene4
    
    $ mvn package  //打包
    
    $ ll target/releases/
    total 4400
    drwxr-xr-x 3 zhangying staff 102 4 24 13:46 ./
    drwxr-xr-x 11 zhangying staff 374 4 24 13:32 ../
    -rw-r--r-- 1 zhangying staff 4501993 4 24 13:32 elasticsearch-analysis-ik-6.2.3.zip
    
    //在releases目录会生成一个zip文件,将其解压
    $ cd target/releases/ && unzip elasticsearch-analysis-ik-6.2.3.zip

    4,安装analysis ik插件

    $ brew info elasticsearch
    elasticsearch: stable 6.2.3, HEAD
    Distributed search & analytics engine
    
    https://www.elastic.co/products/elasticsearch
    
    /usr/local/Cellar/elasticsearch/6.2.3 (112 files, 30.8MB) *
     Built from source on 2018-04-24 at 14:17:01
    From: https://github.com/Homebrew/homebrew-core/blob/master/Formula/elasticsearch.rb
    ==> Requirements
    Required: java = 1.8 ✔
    ==> Options
    --HEAD
     Install HEAD version
    ==> Caveats
    Data: /usr/local/var/lib/elasticsearch/elasticsearch_zhangying/
    Logs: /usr/local/var/log/elasticsearch/elasticsearch_zhangying.log
    Plugins: /usr/local/var/elasticsearch/plugins/  //插件路径
    Config: /usr/local/etc/elasticsearch/
    
    To have launchd start elasticsearch now and restart at login:
     brew services start elasticsearch
    Or, if you don't want/need a background service you can just run:
     elasticsearch
    
    //将刚才解压出来目录,移动plugins下面
    $ mv elasticsearch /usr/local/var/elasticsearch/plugins/ik

    在这里要注意,不要在elasticsearch.yml文件中加index:analysis:analyzer:,老版支持,但是es6.x尝试了几种办法都没有成功,会报以下错误:

    node settings must not contain any index level settings

    5,启动elasticsearch

    $ elasticsearch  //启动

    如果出现以下内容就说成功了

    analysis-ik 中文分词

    analysis-ik 中文分词

    6,测试中文分词

    //创建索引
    $ curl -XPUT "http://127.0.0.1:9200/tank?pretty" 
    
    //创建mapping
    $ curl -XPOST "http://127.0.0.1:9200/tank/chinese/_mapping?pretty" -H "Content-Type: application/json" -d '
    {
        "chinese": {
                "_all":{
                  "enabled":false //禁止全字段全文检索
                },
                "properties": {
                    "id": {
                        "type": "integer"
                    },
                    "username": {
                        "type": "text",
                        "analyzer": "ik_max_word" //精确分词模式
                    },
                    "description": {
                        "type": "text",
                        "analyzer": "ik_max_word"
                    }
                }
            }
      }
    '
    //插入二条数据
    $ curl -XPOST "http://127.0.0.1:9200/tank/chinese/?pretty"  -H "Content-Type: application/json" -d '
    {
        "id" : 1,
        "username" :  "中国高铁速度很快",
        "description" :  "如果要修改一个字段的类型"
    }'
    
    $ curl -XPOST "http://127.0.0.1:9200/tank/chinese/?pretty"  -H "Content-Type: application/json" -d '
    {
        "id" : 2,
        "username" :  "动车和复兴号,都属于高铁",
        "description" :  "现在想要修改为string类型"
    }'
    
    //搜索
    $ curl -XPOST "http://127.0.0.1:9200/tank/chinese/_search?pretty"  -H "Content-Type: application/json"  -d '
    > {
    >     "query": {
    >         "match": {
    >             "username": "中国高铁"
    >         }
    >     }
    > }
    > '
    {
      "took" : 188,
      "timed_out" : false,
      "_shards" : {
        "total" : 5,
        "successful" : 5,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : 2,
        "max_score" : 0.8630463,
        "hits" : [
          {
            "_index" : "tank",
            "_type" : "chinese",
            "_id" : "oJfx_2IBVvjz0l6TkJ6K",
            "_score" : 0.8630463,  //权重越高,匹配度越大
            "_source" : {
              "id" : 1,
              "username" : "中国高铁速度很快",
              "description" : "如果要修改一个字段的类型"
            }
          },
          {
            "_index" : "tank",
            "_type" : "chinese",
            "_id" : "oZfx_2IBVvjz0l6Tpp64",
            "_score" : 0.5753642,
            "_source" : {
              "id" : 2,
              "username" : "动车和复兴号,都属于高铁",
              "description" : "现在想要修改为string类型"
            }
          }
        ]
      }
    }

    7,elasticsearch内置中文分词和ik分词对比

    $ curl -XPOST 'http://localhost:9200/tank/_analyze?pretty=true' -H 'Content-Type: application/json' -d '
    > {
    > "analyzer":"ik_smart",  //简短分词
    > "text":"感叹号"
    > }'
    {
     "tokens" : [
     {
     "token" : "感叹号",
     "start_offset" : 0,
     "end_offset" : 3,
     "type" : "CN_WORD",
     "position" : 0
     }
     ]
    }
    
    $ curl -XPOST 'http://localhost:9200/tank/_analyze?pretty=true' -H 'Content-Type: application/json' -d '
    > {
    > "analyzer":"standard",  //es自带分词
    > "text":"感叹号"
    > }'
    {
     "tokens" : [
     {
     "token" : "感",
     "start_offset" : 0,
     "end_offset" : 1,
     "type" : "<IDEOGRAPHIC>",
     "position" : 0
     },
     {
     "token" : "叹",
     "start_offset" : 1,
     "end_offset" : 2,
     "type" : "<IDEOGRAPHIC>",
     "position" : 1
     },
     {
     "token" : "号",
     "start_offset" : 2,
     "end_offset" : 3,
     "type" : "<IDEOGRAPHIC>",
     "position" : 2
     }
     ]
    }
    
    $ curl -XPOST 'http://localhost:9200/tank/_analyze?pretty=true' -H 'Content-Type: application/json' -d '
    > {
    > "analyzer":"ik_max_word",  //精确分词
    > "text":"感叹号"
    > }'
    {
     "tokens" : [
     {
     "token" : "感叹号",
     "start_offset" : 0,
     "end_offset" : 3,
     "type" : "CN_WORD",
     "position" : 0
     },
     {
     "token" : "感叹",
     "start_offset" : 0,
     "end_offset" : 2,
     "type" : "CN_WORD",
     "position" : 1
     },
     {
     "token" : "叹号",
     "start_offset" : 1,
     "end_offset" : 3,
     "type" : "CN_WORD",
     "position" : 2
     }
     ]
    }


    • 外贸虚拟主机

      1GB硬盘

      2个独立站点

      1000M带宽

      不限制流量

      美国外贸专用虚拟主机,cPanel面板,每天远程备份.
      服务器配置:2*E5 32核,96GB 内存,4*2TB 硬盘 RAID10 阵列.

      ¥180/年

    • 美国/荷兰外贸VPS

      2核CPU

      1G内存

      30硬盘

      10M带宽

      美国/荷兰外贸云服务器,专注外贸服务器行业12年.
      服务器配置:2*E5 32核,96GB 内存,4*2TB 硬盘 RAID10 阵列.

      ¥99/月

    • 全球外贸服务器

      8核CPU

      32G内存

      1TB硬盘

      1000M带宽

      已部署数据中心:美国洛杉矶/亚特兰大、荷兰、加拿大、英国伦敦、德国、拉脱维亚、瑞典、爱沙尼亚
      自有机柜(全球九大数据中心),稳定在线率:99.9%

      ¥999/月 原价1380

    7*24小时 在线提交工单

    如果您的问题没有得到解决,推荐您在线提交工单,我们的客服人员会第一时间为您解决问题

    展开