Spear Parser(二) 树库Token读取类EdgeLexer

最新推荐文章于 2020-11-21 03:15:27 发布

weixin_34067102

最新推荐文章于 2020-11-21 03:15:27 发布

阅读量113

点赞数

原文链接：http://blog.51cto.com/snowteng17/541517

版权

滨州树库标注实例

句法模型训练最基础的一步，就是从树库中抽取规则。而规则是由一些非终结符，词汇等信息组成的，所以Training第一步是要能提取这些信息。滨州树库(Penn Tree Bank) WSJ mrg标注风格的树库是这样的。

 
( (S 
(NP-SBJ 
(NP (NNP Pierre) (NNP Vinken) ) 
(, ,) 
(ADJP 
(NP (CD 61) (NNS years) ) 
(JJ old) ) 
(, ,) ) 
(VP (MD will) 
(VP (VB join) 
(NP (DT the) (NN board) ) 
(PP-CLR (IN as) 
(NP (DT a) (JJ nonexecutive) (NN director) )) 
(NP-TMP (NNP Nov.) (CD 29) ))) 
(. .) )) 
   

很明显这种树的标记由三类不同的符号组成。左括号(,右括号),以及像S、NP-SBJ、director这样的字符串。

树库Token读取类EdgeLexer

Spear中提供了一个类EdgeLexer来读取这三种Token,并且从文件角度考虑加入了一个终止的Token。这个类在模型训练的时候，复杂读取这四种Token,并且在Token是字符串的情况下，返回读取的内容。

EdgeLexer的声明如下所示。

1: class EdgeLexer 2: { 3: public: 4: /*几种不同的Token*/ 5: static const int TOKEN_EOF = 0; /*终止符*/ 6: static const int TOKEN_STRING = 1; /*字符串*/ 7: static const int TOKEN_LP = 2; /*左括号*/ 8: static const int TOKEN_RP = 3; /*右括号*/ 9: EdgeLexer(IStream &); 10: /*核心函数*/ 11: int lexem(String &); 12: int getLineCount() const { return _lineCount; }; 13: private: 14: /** The stream */ 15: IStream & _stream; 16: /** Line count */ 17: int _lineCount; 18: /** Advance over white spaces */ 19: void skipWhiteSpaces(); 20: bool isSpace(Char c) const; 21: };

从代码的声明可以看出，EdgeLexer完成了行数统计，空白符判断，Token读取的三种行为。EdgeLexer的实现如下所示。

1: /**构造函数*/ 2: EdgeLexer::EdgeLexer(IStream & stream) 3: : _stream(stream), _lineCount(1){} 4: /**判断一个字符是否是空白符 5: *@ c 要判断的字符 6: */ 7: bool EdgeLexer::isSpace(Char c) const 8: { 9: if(c != W( ' ') && 10: c != W( '\t') && 11: c != W( '\n') && 12: c != W( '\r')){ 13: return false; 14: } 15: return true; 16: } 17: /**跳过空白符**/ 18: void EdgeLexer::skipWhiteSpaces() 19: { 20: Char c; 21: while((c = _stream.get()) != EOF && isSpace(c)){ 22: if(c == W( '\n')){ 23: _lineCount ++; 24: } 25: } 26: _stream.unget(); 27: } 28: /**判断是否是空白、左右括号的宏*/ 29: #define STRING_CHAR(c) ( \ 30: c != W( '(') && \ 31: c != W( ')') && \ 32: ! isSpace(c) \ 33: ) 34: /** 35: *读一个词条，并且返回词的类型，将词的内容存到text中 36: *如果不是STRING,而是括号，EOF，则只返回类型，不返回内容 37: *@text 存储返回的字符串 38: *如果终止则返回TOKEN_EOF,如果为字符串则返回TOKEN_STRING 39: */ 40: int EdgeLexer::lexem(String & text) 41: { 42: skipWhiteSpaces(); 43: Char c = _stream.get(); 44: if(c == EOF){ 45: return TOKEN_EOF; 46: } else if(c == W( '(')){ 47: return TOKEN_LP; 48: } else if(c == W( ')')){ 49: return TOKEN_RP; 50: } else if(STRING_CHAR(c)){ 51: OStringStream buffer; 52: buffer << c; 53: while((c = _stream.get()) != TOKEN_EOF && STRING_CHAR(c)){ 54: buffer << c; 55: } 56: text = buffer.str(); 57: _stream.unget(); 58: return TOKEN_STRING; 59: } 60: // should never get here 61: return TOKEN_EOF; 62: }

EdgeLexer的使用实例

读取TreeBank文件，输出所有的非终结符和词汇信息。

1: #include <fstream> 2: using namespace std; 3: int main( int argc, char **argv) 4: { 5: if(argc!=2){ 6: printf( "[Usage]:%s [treebank]\n",argv[0]); 7: exit(0); 8: } 9: ifstream is(argv[1]); 10: EdgeLexer lex(is); 11: string text; 12: int l; 13: while((l = lex.lexem(text)) != EdgeLexer::TOKEN_EOF){ 14: //cout << l; 15: if(l == EdgeLexer::TOKEN_STRING){ //输出字符串 16: cout << " " << text; 17: cout <<endl; 18: } 19: //cout << endl; 20: } 21: }

转载于:https://blog.51cto.com/snowteng17/541517

weixin_34067102

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spear Parser(二) 树库Token读取类EdgeLexer

滨州树库标注实例句法模型训练最基础的一步，就是从树库中抽取规则。而规则是由一些非终结符，词汇等信息组成的，所以Training第一步是要能提取这些信息。滨州树库(Penn Tree Bank) WSJ mrg标注风格的树库是这样的。 1: ( (S 2: (NP-SBJ 3: (NP (N...
复制链接

扫一扫