Summery:
Part 1: try to integrate ICTCLAS30 part-of-speech toolkit and CRFPP analysis toolkit.
Part 2: Named Entities relation extraction.
Part 1:
Input and output:
The input: text-free paragraph.
The output: Named Entities
Pipeline:
Training:
It relates to two open-sources Toolkit: ICTCLAS and CRF++.
ICTCLAS is an Chinese lexical analysis system ( Institute of Computing Technology , Chinese Lexical Analysis System) .It is based on multi-layer HMM. Part-Of-Speech tagging’ precision is showed much higher than other. Chinese person names achieve nearly 98%.
Another Toolkit is CRF++. An open source implementation of Conditional Random Fields (CRFs) for segmenting/labeling sequential data.
Processing:
Example:
Input : 陈文骥毕业于北京大学数学系。现为犹科公司副行政总裁
Output:
We can see person name (陈文骥) and organization(北京大学数学系) are extracted.
Both of them will be the candidates of the next task (Named entities relation extraction) source.
Because of limitation of the training Dataset and some new patterns, 犹科公司 (organization)
and 副行政总裁(Position) do not be recognized.
犹 O
科 O
公司 O
The expected output is:
犹 B_ORG
科 I_ORG
公司 I_ORG
B_ORG : Begin of the organization term
I_ORG : Begin of the organization term
O : not relative to term
Now we can output the n-best result (probability of the all possible output tags)
We can see the correct tag ranks after the “O” and much bigger than the third one.
Though having limitation of Training Dataset, we can improve the system using n-best candidate.
Some heuristic method will smooth the result.
Accuracy of Person name in ICTCLAS (HMM-base) is high, but as for the location and organization are poor .So we take advantage of ICTCLAS and CRF and CRF only forces on organization name and address.
Part 2:
For Named Entities relation extraction, I think there two three ways to dead with it.
First, Keyword .Detect the keyword between the each named entities pair.
Second, using the previous named entity output tags as input for CRF(need another training Dataset)
Third, Levenshtein distance. Using this approach to calculate similarity of sentence ( Temple is needed).
“In information theory and computer science, the Levenshtein distance is a metric for measuring the amount of difference between two sequences (i.e., the so called edit distance). The Levenshtein distance between two strings is given by the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character.”
- http://en.wikipedia.org/wiki/Levenshtein_distance
上面好像很多语法错误 -_-!
其实觉得精度一定计较差,因为不是只对一个方面,这是个系统,技术含量不高
人家单单研究怎样提高识别机构名就可以发表论文了,所以了.....不要BS这个PIES 它只是个整合型系统罢了