ICTCLAS分词系统研究（六）--得到初分结果

最新推荐文章于 2023-10-20 08:59:46 发布

princes_fan

最新推荐文章于 2023-10-20 08:59:46 发布

阅读量1.1k

点赞数

分类专栏：读ICTCLAS 文章标签： im

读ICTCLAS 专栏收录该内容

33 篇文章 0 订阅

订阅专栏

原文地址：http://blog.csdn.net/sinboy/article/details/1637327

仍然以“他说的确实在理”为例，经过NshortPath的处理后，我们可以得到N条最短二叉分词路径，如下：

初次生成的分词图表：

	1	2	3	4	5	6	7	8	9
0	始##始
1		他
2			说
3				的	的确
4					确	确实
5						实	实在
6							在	在理
7								理
8									末##末

初次生成的二叉分词图表：

	1	2	3	4	5	6	7	8	9	10	11	12
0	始##始@他
1		他@说
2			说@的	说@的确
3					的@确	的@确实
4							的确@实	的确@实在
5							确@实	确@实在
6									确实@在	确实@在理
7									实@在	实@在理
8											实在@理
9											在@理
10												在理@末##末
11												理@末##末

初次生成的二叉分词路径：

序号	二叉分词路径
0	0 1 2 3 6 9 11 12
1	0 1 2 4 7 9 11 12
2	0 1 2 3 5 7 9 11 12

0 1 2 3 6 9 11 12 指的是针对上图二叉分词图表，得出的分词路径的列下标，其实图表中的列对应的是@后面的词，行对应的是@前面的词在分词图表中的位置。得到了二叉分词路径，其实我们就可以得到真正的分词路径，只需要根据分词图表和二叉分词图表之间的对应关系进行一个简单的转换即可。

源代码中是通过这一段代码来实现的：

while (i < m_nSegmentCount)

{

//把二叉分词路径转成分词路径

BiPath2UniPath(nSegRoute[i]);

//根据分词路径生成分词结果

GenerateWord(nSegRoute,i);

i++;

}

初次生成的分词结果：

序号	分词结果
0	他/ 说/ 的/ 确实/ 在/ 理/
1	他/ 说/ 的确/d 实/ 在/ 理/
2	他/ 说/ 的/ 确/ 实/ 在/ 理/

需要注意的是，在generateWord()函数里对一些特殊情况做一些处理，然后再生成分词结果。主要是对涉及到数字、时间、日期的结果进行合并、拆分，

// Generate Word according the segmentation route

bool CSegment::GenerateWord( int ** nSegRoute, int nIndex)

{

unsigned int i=0,k=0;

int j,nStartVertex,nEndVertex,nPOS;

char sAtom[WORD_MAXLENGTH],sNumCandidate[100],sCurWord[100];

ELEMENT_TYPE fValue;

while(nSegRoute[nIndex][i]!=-1&&nSegRoute[nIndex][i+1]!=-1&&nSegRoute[nIndex][i]<nSegRoute[nIndex][i+1])

{

nStartVertex=nSegRoute[nIndex][i];

j=nStartVertex;//Set the start vertex

nEndVertex=nSegRoute[nIndex][i+1];//Set the end vertex

nPOS=0;

m_graphSeg.m_segGraph.GetElement(nStartVertex,nEndVertex,&fValue,&nPOS);

sAtom[0]=0;

while(j<nEndVertex)

{//Generate the word according the segmentation route

strcat(sAtom,m_graphSeg.m_sAtom[j]);

j++;

}

m_pWordSeg[nIndex][k].sWord[0]=0;//Init the result ending

strcpy(sNumCandidate,sAtom);

//找出连续的数字串

while(sAtom[0]!=0&&(IsAllNum((unsigned char *)sNumCandidate)||IsAllChineseNum(sNumCandidate)))

{//Merge all seperate continue num into one number

//sAtom[0]!=0: add in 2002-5-9

strcpy(m_pWordSeg[nIndex][k].sWord,sNumCandidate);

//Save them in the result segmentation

i++;//Skip to next atom now

sAtom[0]=0;

while(j<nSegRoute[nIndex][i+1])

{//Generate the word according the segmentation route

strcat(sAtom,m_graphSeg.m_sAtom[j]);

j++;

}

strcat(sNumCandidate,sAtom);

}

unsigned int nLen=strlen(m_pWordSeg[nIndex][k].sWord);

if(nLen==4&&CC_Find("第上成±—＋∶·．／",m_pWordSeg[nIndex][k].sWord)||nLen==1&&strchr("+-./",m_pWordSeg[nIndex][k].sWord[0]))

{//Only one word

strcpy(sCurWord,m_pWordSeg[nIndex][k].sWord);//Record current word

i--;

}

else if(m_pWordSeg[nIndex][k].sWord[0]==0)//Have never entering the while loop

{

strcpy(m_pWordSeg[nIndex][k].sWord,sAtom);

//Save them in the result segmentation

strcpy(sCurWord,sAtom);//Record current word

}

else

{//It is a num

if(strcmp("－－",m_pWordSeg[nIndex][k].sWord)==0||strcmp("—",m_pWordSeg[nIndex][k].sWord)==0||m_pWordSeg[nIndex][k].sWord[0]=='-'&&m_pWordSeg[nIndex][k].sWord[1]==0)//The delimiter "－－"

{

nPOS=30464;//'w'*256;Set the POS with 'w'

i--;//Not num, back to previous word

}

else

{//Adding time suffix

char sInitChar[3];

unsigned int nCharIndex=0;//Get first char

sInitChar[nCharIndex]=m_pWordSeg[nIndex][k].sWord[nCharIndex];

if(sInitChar[nCharIndex]<0)

{

nCharIndex+=1;

sInitChar[nCharIndex]=m_pWordSeg[nIndex][k].sWord[nCharIndex];

}

nCharIndex+=1;

sInitChar[nCharIndex]='

princes_fan

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
ICTCLAS分词系统研究（六）--得到初分结果

原文地址：http://blog.csdn.net/sinboy/article/details/1637327仍然以“他说的确实在理”为例，经过NshortPath的处理后，我们可以得到N条最短二叉分词路径，如下：初次生成的分词图表： 1234567890始##始
复制链接

扫一扫