mysql 中文分词 查找_中文分词技术

本文介绍了如何在MySQL环境下实现中文分词,并利用Sphinx搭建全文索引。首先,详细讲述了Sphinx的编译安装过程,包括安装mmseg中文分词插件。接着,配置Sphinx数据源和索引,最后通过`indexer`生成索引文件,解决依赖问题,并在命令行进行查询测试。
摘要由CSDN通过智能技术生成

# MySQL中文分词

全文索引大体分为两个过程:

* 索引创建(indexer):将现实世界中所有的结构化数据和非结构化数据提取信息,创建索引的过程

* 搜索索引(search):就是得到用户的查询请求,搜索创建的索引,然后返回结果的过程

## 编译安装 sphinx+mmsg

### 0. 安装编译依赖工具包

```

yum install make gcc gcc-c++ libtool autoconf automake imake mysql-devel libxml2-devel expat-devel

```

### 下载稳定版源码包并解压

```

[root@localhost.localdomain /usr/local/src]

# wget http://www.coreseek.cn/uploads/csft/3.2/coreseek-3.2.14.tar.gz

[root@localhost.localdomain /usr/local/src]

# tar xf coreseek-3.2.14.tar.gz

[root@localhost.localdomain /usr/local/src]

# cd coreseek-3.2.14

[root@localhost.localdomain /usr/local/src/coreseek-3.2.14]

# ls

csft-3.2.14(sphinx) mmseg-3.2.14 README.txt testpack

其中-- csft-4.1是修改适应了中文环境后的sphinx

Mmseg 是中文分词插件

Testpack是测试用的软件包

```

### [安装 mmseg](http://www.coreseek.cn/products/products-install/install_on_bsd_linux/)

#### cd mmseg

```

[root@localhost.localdomain /usr/local/src/coreseek-3.2.14]

# cd mmseg-3.2.14/

```

#### 执行bootstrap脚本

```

[root@localhost.localdomain /usr/local/src/coreseek-3.2.14/mmseg-3.2.14]

# ./bootstrap

```

#### ./configure --prefix=/usr/local/mmseg

```

[root@localhost.localdomain /usr/local/src/coreseek-3.2.14/mmseg-3.2.14]

# ./configure --prefix=/usr/local/mmseg

```

#### make && make install

```

[root@localhost.localdomain /usr/local/src/coreseek-3.2.14/mmseg-3.2.14]

# make && make install

```

### 安装coreseek

```

[root@localhost.localdomain /usr/local/src/coreseek-3.2.14/csft-3.2.14]

# ./buildconf.sh

[root@localhost.localdomain /usr/local/src/coreseek-3.2.14/csft-3.2.14]

# ./configure --prefix=/usr/local/coreseek --without-unixodbc --with-mmseg --with-mmseg-includes=/usr/local/mmseg/include/mmseg/ --with-mmseg-libs=/usr/local/mmseg/lib/ --with-mysql

[root@localhost.localdomain /usr/local/src/coreseek-3.2.14/csft-3.2.14]

# make && make install

```

## Sphinx的使用

> 1. 数据源---要让sphinx知道,查哪些数据,即针对哪些数据做索引(可以定义多个源)

> 2. 索引配置---针对哪个源做索引, 索引文件放在哪个目录?? 等等

> 3. 搜索服务器---sphinx可以在某个端口(默认9312),以其自身的协议,与外部程序做交互.

**配置数据源**

```

[root@localhost.localdomain /usr/local/coreseek/etc]

# cp sphinx.conf.dist sphinx.conf

[root@localhost.localdomain /usr/local/coreseek/etc]

# vim sphinx.conf

```

如下配置:

source src1 {

type = mysql

sql_host = localhost

sql_user = root

sql_pass = aaaaaa

sql_db = test

sql_query_pre = set names utf8

sql_query_pre = set session query_cache_type=off

sql_query = `select a_id as id,cat_id,title,simtitle,seotitle,tags,source,description,content,dateline,editdateline from article`

sql_attr_uint = a_id

sql_attr_uint = cat_id

sql_attr_timestamp = dateline

sql_attr_timestamp = editdateline

sql_query_info = `SELECT * FROM article WHERE a_id=$id`

}

**索引典型配置**

> index test1 {

> source = test

> path = /usr/local/sphinx/var/data/test1 # 生成索引放在哪

> # stopwords = G:\data\stopwords.txt

> # wordforms = G:\data\wordforms.txt

> # exceptions = /data/exceptions.txt

> charset_dictpath = /usr/local/mmseg/etc/

> charset_type = zh_cn.utf-8

> }

**生成索引文件**

```

[root@localhost.localdomain /usr/local/coreseek/etc]

# /usr/local/coreseek/bin/indexer -c /usr/local/coreseek/etc/sphinx.conf test1 (test1为索引名称)

Coreseek Fulltext 3.2 [ Sphinx 0.9.9-release (r2117)]

Copyright (c) 2007-2011,

Beijing Choice Software Technologies Inc (http://www.coreseek.com)

using config file '/usr/local/coreseek/etc/sphinx.conf'...

indexing index 'test1'...

collected 8122 docs, 47.6 MB

sorted 8.7 Mhits, 100.0% done

total 8122 docs, 47596333 bytes

total 17.782 sec, 2676636 bytes/sec, 456.75 docs/sec

total 5 reads, 0.011 sec, 4559.8 kb/call avg, 2.3 msec/call avg

total 58 writes, 0.429 sec, 903.8 kb/call avg, 7.3 msec/call avg

```

> **Error 注意:**

> /usr/local/coreseek/bin/indexer: error while loading shared libraries: **libmysqlclient.so.18**: cannot open shared object file: No such file or directory

> 发现**sphinx**的`indexer`依赖库`ibmysqlclient.so.18`找不到,通过编辑此文件来修复这个错误 `/etc/ld.so.conf`

> `vi /etc/ld.so.conf `

> 将下面这句加到文件到尾部,并保存文件

> `/usr/local/mysql/lib `

> 然后运行下面这个命令即可

> `ldconfig`

在命令行测试查询

````

[root@localhost.localdomain /usr/local/coreseek]

# ./bin/search -c etc/sphinx.conf 留学

```

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值