ScorePP-用标准C++实现的自动分词评测程序_const std::string &postfix-CSDN博客

本文链接：https://blog.csdn.net/Harry_lyc/article/details/7458515

ScorePP是作者用C++重写的一款分词评测程序，源于Perl脚本的Score。该程序专注于分词效果的评估，包括准确性、召回率等指标，并能分析错误原因。此外，文中提及作者在生日当天发布了这款工具，希望能对他人有所帮助，以此作为自己的生日礼物。

摘要由CSDN通过智能技术生成

中文分词是自然语言处理的基础性关键问题，近一年来一直在进行着分词方面的研究。一开始用的是Sighan backoff 提供的用Perl脚本编写的分词打分程序Score。为了把用C++写的分词程序和评测程序无缝的结合在一起，同时也为了自动的分析分词中的错误原因，参考Score改写了在C++下的评测程序，我自己称它为ScorePP。

为了分词很长时间都没有休息了。中午走出自动化所的食堂，不知道怎么的，突然想起来看了日历，发现生日就在今天。回想着半年的时间，天天都在跟文字打交道，天天为了提高分词的效果，绞尽脑汁。过的把时间都忘了。今年的生日又是跟往常一样，自己一个人过了。为了纪念这个重要的时间，把自己改写的分词评测程序，发布出来。如果对大家有所帮助，也算是很欣慰的一件事。也算是献给自己的生日礼物。

本分词评测程序有以下几点要注意的：

1 提供分词结果检查，即切分的数据总量要和标准的相同，不能多切，也不能少切。

2 有完善的评测指标接口：准确度、召回率、F值、未登录词比例、未登录词召回率、登录词召回率。

3 分词切分标志可以是空格，也可以是斜杠。

以下为程序源代码：

Score.h

/********************************************************************
* Copyright (C) 2012 Li Yachao
* Contact: liyc7711@gmail.com or harry_lyc@foxmail.com
*
* Permission to use, copy, modify, and distribute this software for
* any non-commercial purpose is hereby granted without fee, provided
* that the above copyright notice appear in all copies and that both
* that copyright notice. 
* It is provided "as is" without express or implied warranty.
*
* Version: 0.1
* Last update: 2012-4-13
*********************************************************************/
/*********************************************************************
用于分析分词结果
*********************************************************************/
#ifndef SCORE_H
#define SCORE_H

#include <iomanip>
#include <iostream>
#include <fstream>
#include <vector>
#include <set>
#include <string>

namespace MyUtility
{
	struct ScoreItem
	{
		int GoldTotal;
		int TestTotal;
		int TestCorrect;
	};
	class Score
	{
	public:
		/*如果报告、词典文件输入为空，则表示不用*/
		Score(const std::string& gold_file,const std::string& test_file,
			  const std::string& dic_file ,const std::string& report_file="");
		Score();
		~Score();
		void Clear();
		double GetRecall();
		double GetPrecise();
		double GetFMeasure();
		int GetTrueWords();
		int GetTestWords();
		double GetTestOOVRate();/*测试语料未登录词比例*/
		double GetOOVRecallRate();/*未登录词的召回率*/
		double GetIVRecallRate();/*登录词的召回率*/
	private:
		std::ofstream fout;/*输出文件流*/
		//std::ofstream fout1;/*输出文件流*/
		std::string reportFile;/*报告结果件路径*/
		std::string goldFile;/*标准文件路径*/
		std::string testFile;/*测试文件路径*/
		std::string dictionaryFile;/*词典文件路径*/
		int totalOOVTokens ;/*未登录tokens数量*/
		int totalOOVCorrectTokens;/*正确的未登录tokens数量*/
		int totalIVCorrectTokens;/*正确的登录词tokens数量*/
		std::vector<std::string>goldLines;/*标准文件的文本行*/
		std::vector<std::string>testLines;/*测试文件的文本行*/
		std::vector<struct ScoreItem> listScore;/*每行的评测结果*/
		std::set<std::string> dictionaryList;/*词典数据结构*/
		/*************************************************/
		bool IsPrefix(const std::string &src, const std::string &prefix);
		bool Postfix(const std::string &src, const std::string &postfix);
		bool Init();
		bool InitDict(const std::string& filePath);/*初始化词典*/
		bool IsEntryExist(const char * entry);
		bool Parse(const std::vector<std::string>& gold_tokens,const std::vector<std::string>& test_tokens,struct ScoreItem& score);
		bool FileReader(const std::string& path,std::vector<std::string>& lines);
		void SplitByTokens(std::vector<std::string> &vecstr, const std::string &str, const std::string tokens[],const  int tokensnumber, bool withtoken);
		
	};
}
#endif

Score.cpp

#include "Score.h"
namespace MyUtility
{
	Score::Score()
	{
		
	}
	Score::~Score()
	{