中文词性标注

最新推荐文章于 2024-07-26 21:18:06 发布

pku_zzy

最新推荐文章于 2024-07-26 21:18:06 发布

阅读量9.1k

点赞数 4

分类专栏： Machine Learing

本文链接：https://blog.csdn.net/pku_zzy/article/details/56678445

版权

中文词性标注

最近我想练习一下中文词性标注，所以找了一个数据集，人民日报PKU数据集。

数据集

数据集来自北大计算语言所，对1998年1月《人民日报》中的句子进行词性标注，语料格式为:

19980101-01-001-001/m 迈向/v 充满/v 希望/n 的/u 新/a 世纪/n ——/w 一九九八年/t 新年/t 讲话/n （/w 附/v 图片/n １/m 张/q ）/w

这个数据集中总共19484个句子提供训练测试。我对语料集做了特殊处理，去掉了不规范的不含’/'的标注，使得标注更加规范。

工具脚本

由于数据集内并不附赠脚本，所以我自己实现了一个工具脚本tool.py

tool.py

'''
	tool.py
	This is a tool for scoring posTag algorithm.
	Just scoring an algorithm with:
		> driver(trainLines, testLines, trainFunction, posTagFunction):
	And this tool will print a report of accuracy.
	Note: There are 19484 sentences in data set 'data.txt'.
	
	Zhang Zhiyuan ,EECS Peking Univ. 2017/02/23
'''

class dataType:
	def __init__(self):
		self.__fd = open('data.txt'