PIES(Personal information Extraction System) -(Developing)

Summery:

 

Part 1: try to integrate ICTCLAS30 part-of-speech toolkit and CRFPP analysis toolkit.

Part 2: Named Entities relation extraction.

Part 1:

Input and output:

The input: text-free paragraph.

The output: Named Entities

 

Pipeline

 Training:

 

It relates to two open-sources Toolkit: ICTCLAS and CRF++.

ICTCLAS is an Chinese lexical analysis system ( Institute of Computing Technology , Chinese Lexical Analysis System) .It is based on multi-layer HMM. Part-Of-Speech tagging’ precision is showed much higher than other. Chinese person names achieve nearly 98%.

 

Another Toolkit is CRF++. An open source implementation of Conditional Random Fields (CRFs) for segmenting/labeling sequential data.

Processing:

 

 

 

Example:

Input : 陈文骥毕业于北京大学数学系。现为犹科公司副行政总裁

Output:

 

We can see person name (陈文骥) and organization(北京大学数学系) are extracted.

Both of them will be the candidates of the next task (Named entities relation extraction) source.

 

Because of limitation of the training Dataset and some new patterns, 犹科公司 (organization)

and 副行政总裁(Position) do not be recognized.

       O

           O

公司       O

The expected output is:

      B_ORG                              

           I_ORG                               

公司       I_ORG    

 

B_ORG  :     Begin of the organization term 

I_ORG  :      Begin of the organization term 

O  :              not relative to term 

 

Now we can output the n-best result (probability of the all possible output tags)

 

 

  

We can see the correct tag ranks after the “O” and much bigger than the third one.

Though having limitation of Training Dataset, we can improve the system using n-best candidate.

Some heuristic method will smooth the result.

 

Accuracy of Person name in ICTCLAS (HMM-base) is high, but as for the location and organization are poor .So we take advantage of ICTCLAS and CRF and CRF only forces on organization name and address. 

 

Part 2:

 

For Named Entities relation extraction, I think there two three ways to dead with it.

 

 

First, Keyword .Detect the keyword between the each named entities pair.

 

Second, using the previous named entity output tags as input for CRF(need another training Dataset)     

 

Third, Levenshtein distance. Using this approach to calculate similarity of sentence ( Temple is needed).

 

“In information theory and computer science, the Levenshtein distance is a metric for measuring the amount of difference between two sequences (i.e., the so called edit distance). The Levenshtein distance between two strings is given by the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character.”

-           http://en.wikipedia.org/wiki/Levenshtein_distance

 

 

上面好像很多语法错误 -_-!

其实觉得精度一定计较差,因为不是只对一个方面,这是个系统,技术含量不高

 

人家单单研究怎样提高识别机构名就可以发表论文了,所以了.....不要BS这个PIES 它只是个整合型系统罢了

    

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值