字符串根据词性拆分分词
在工作中有的时候需要对人工填写的地址与数据库的地址进行匹配,有的时候需要对地址进行拆分,将地址规整。但多数情况下,人工填写是不规范的,因此一个根据词性拆分的方法,以便在拆分时能起到辅助作用。 如:南塘浜路11-15号 (虚拟)
序号 分词 词性 位置 1 南塘浜路 cn 1 2 11 num 5 3 - symbol 7 4 15 num 8 5 号 cn 10
通过拆分,可以获得“-”前后的数字和文本,这样也便于将字符串规整为连续型地址,如:南塘浜路11号,南塘浜路12号…南塘浜路15号等。注:本文仅介绍拆分成分词的方法。
1/思路
(拿中文文本举例) 如下表所示,对每个字符的词性进行判断,如果该字符不是中文(则不算其位置),用字符原来的位置-判断字符为中文词性后的位置,得出来的结果就可以将字符串归类并拼接成分词。
字符 原顺序 中文字符再次排序 分词归类 南 1 1 1(=1-1+1) 塘 2 2 1(=2-2+1) 浜 3 3 1(=3-3+1) 路 4 4 1(=4-4+1) 1 5 - - 1 6 - - - 7 - - 1 8 - - 5 9 - - 号 10 5 6(=10-5+1)
2/创建表值函数-生成序列表
create function [ dbo] . [ fn_serial_numer]
( @n int )
returns table as
return
with t1 as ( select 1 n union all select 1 )
, t2 as ( select 1 n from t1, t1 a, t1 b, t1 c)
, t3 as ( select 1 n from t2, t2 a, t2 b, t2 c)
, t4 as ( select 1 n from t3, t3 a)
select top ( @n ) row_number( ) over ( order by ( select 1 ) ) n from t4
order by n
select substring( '南塘浜路11-15号' , n, 1 ) ch, n
from fn_serial_numer( len ( '南塘浜路11-15号' ) )
3/创建表值函数-字符串根据词性拆分成分词
create function [ dbo] . [ fn_split_by_property]
(
@str varchar ( max)
)
returns @table table (
pkid int
, keys varchar ( max)
, property varchar ( 32 )
, indexs int )
as
begin
declare @tmptable table ( pkid int , keys varchar ( max) , property varchar ( 32 ) , indexs int )
; with t0 as (
select substring( @str , n, 1 ) ch, n
from fn_serial_numer( len ( @str ) )
)
, t1 as (
select ch
, n
, id= row_number( ) over ( order by n)
, rid= n- row_number( ) over ( order by n) + 1
, 'cn' property
from t0
where unicode( ch) between 19968 and 40869
)
, t2 as (
select ch, n
, id= row_number( ) over ( order by n)
, rid= n- row_number( ) over ( order by n) + 1
, 'num' property
from t0
where ch like '%[0-9]%'
)
, t3 as (
select ch, n
, id= row_number( ) over ( order by n)
, rid= n- row_number( ) over ( order by n) + 1
, 'en' property
from t0
where ch like '%[a-zA-Z]%'
)
, t4 as (
select ch, n
, id= row_number( ) over ( order by n)
, rid= n- row_number( ) over ( order by n) + 1
, 'symbol' property
from t0 a
where not exists ( select * from t1 where a. ch= ch)
and not exists ( select * from t2 where a. ch= ch)
and not exists ( select * from t3 where a. ch= ch)
)
, tmp as (
select * from t1 union
select * from t2 union
select * from t3 union
select * from t4
)
insert into @tmptable
( pkid, keys , property, indexs )
select pkid, keys , property, indexs
from (
select row_number( ) over ( order by ( select 0 ) ) pkid
, rid
, keys = ( select '' + ch from tmp b where b. rid= a. rid and b. property= a. property order by b. n for xml path( '' ) )
, a. property
, 0 indexs
from tmp a
group by rid, a. property
) tmp
declare @i int , @str_tmp varchar ( max) , @keys varchar ( max) , @index int
set @i = 1
set @str_tmp = @str
set @keys = ''
set @index = 1
while @i <= ( select max ( pkid) from @tmptable )
begin
set @keys = ( select keys from @tmptable where pkid= @i )
set @index = charindex( @keys , @str_tmp )
update @tmptable
set indexs= @index
where pkid= @i
set @str_tmp = stuff ( @str_tmp , @index , len ( @keys ) , {fn repeat ( ' ' , len ( @keys ) ) })
set @i = @i + 1
end
insert @table
( pkid, keys , property, indexs )
select dense_rank( ) over ( order by indexs) pkid
, keys , property, indexs
from @tmptable
return
end
select * from dbo. fn_split_by_property( '南塘浜路9009号,南塘浜路9999号' )
序号 分词 词性 位置 1 南塘浜路 cn 1 2 9009 num 5 3 号 cn 9 4 , symbol 10 5 南塘浜路 cn 11 6 9999 num 15 7 号 cn 19