[cpp,algorithm,fpTree] FPTree_1_从文件中读取数据集并将其解析

最新推荐文章于 2023-01-18 17:55:44 发布

Aimer1027

最新推荐文章于 2023-01-18 17:55:44 发布

阅读量929

点赞数

分类专栏：算法 fpTree C++ 文章标签： C++ algorithm

本文链接：https://blog.csdn.net/inuyasha1027/article/details/41605023

版权

C++ 同时被 3 个专栏收录

48 篇文章 0 订阅

订阅专栏

算法

16 篇文章 0 订阅

订阅专栏

fpTree

3 篇文章 0 订阅

订阅专栏

本文档介绍了如何使用C++实现FP-Tree的初步步骤，重点在于从数据文件中读取项集并存储到内存中的数据结构。数据文件格式包含多个项集，每个项集由若干物品组成。内容涵盖了数据结构的设计（如item、itemset和transaction_database）以及数据读取的流程。实验中指出了使用vector作为事务对象的局限性，建议使用set或者自定义数据结构以支持排序和剪枝操作。最后，讨论了实验的不足和改进方向。

摘要由CSDN通过智能技术生成

实验目的：

使用 cpp实现 fpTree

实验一：

将数据从存放数据的文件中读取到程序中，并将其存放到固定的数据结构中。

数据文件格式：

"item1":"apple,banana,beaf"(回车换行符)

"item2":"bread,coke,beer,coffee"(回车换行符)

...

"itemi":"food1,food2...foodk"(无回车)

实验构思：

首先我们知道，fptree 中所涉及到的成员有： 1. 项集(itemset) 2,事务(transaction) 3.事务数据库(transaction_database)

所说的项集(itemset) 就是上述数据文件单独的，如：apple， coke ，beerl 等等；而事务就是 item1, item2 后面的多个项集组成的对象；

事务数据库对应的是存放数据的整个文件，不过在这里需要注意的是，我们所要求的是该文件总的频繁项集，并不关心的是事务的名称，

所以在文件读入的时候，并不对事务的名称进行记录。

我们知道，fp-tree 在构建之前应该做的是将数据文件导入到数据库中，也就是本文章主要介绍的内容。

不过在此之前，应该知道的是从文件导入到数据库(transaction_database)中的数据应该是以事务(transaction)为单位记录的。

而 fp-tree 的构建过程中，一共会访问两次数据库：

第一次是读取数据库中的所有的项集，统计其个数 total，还要统计数据库中的每个项集出现的次数。

然后,使用 total * min_threshold (最小阈值) 用来将得出最小支持度，并将读取出的项集出现次数小于该最小支持度的

项集通过 "剪枝" 将其删除。将通过剪枝处理之后的项集创建一个对应的数组，并将数组中的元素根据各项集在数据库

(transaction_database : 因为该数据库是由项集组成，所以也称之为是项集数据库) 中出现的频度从大到小进行降序排序。

第二次读取数据库是以事务(transaction) 为单位来遍历数据库(transaction_database) 中的每一项，

并根据事务中的数据来创建一棵 fp-tree.

在知道上述条件之后，应该将数据文件读入至内存之后，以数据库的格式进行存储(transaction_database) ,

并且在数据库(transaction_database) 中存放的单位应该是事务 (transaction) 的格式才对。

由此，可以设计上述的数据结构为:

item: string

itemset -> transaction

transaction-set -> transaction_database

所以，使用 cpp 的话可以表示为如下格式：

item : string

transaction : vector<string>

transaction_database : vector< vector<string> >

暂时先定义为这种格式，由于item 还要设计到计数，并且 transaction 还要涉及到排序，在以后进行进一步的修改，

在这里主要演示的是，如何将上述格式的数据文件读入到程序的数据结构中，至于读取之后，是使用 map，set

只需要经过一些遍历即可，所以并不影响后续在相关数据结构上面做出的改动。

实验环境：

vs 2012 , c++

因为该模块在整个项目中起到的作用就相当于文件的解析器一样，所以将其命名为 parserFile

//parserFile.h

#ifndef FILEPARSER_H
#define FILEPARSER_H

#include <exception>
#include <string>
#include <vector>

namespace  parser
{

	typedef std::vector<std::string> transaction;
	typedef std::vector< transaction > transaction_database ;


	transaction_database parseFile ( const std::string &filename );

	class file_open_exception : public std::exception
	{
	private :
		const std::string message ;
	public :
		file_open_exception ( const std::string &filename ) :
		   message( std::string("failed open file :")+ filename)
		   {}
		   virtual ~file_open_exception () throw () {}
		   virtual const char *what () const throw ()
		   {
			   return message.c_str() ;
		   }

	};
}

#endif //FILEPARSER_H

//parserFile.cpp

#include <fstream>
#include <sstream>
#include <iostream>

#include "fileParser.h"

using namespace std ;

parser::transaction_database parser::parseFile (const string &filename )
{
	parser::transaction trans;
	parser::transaction_database db ;
	string line ;

	ifstream file (filename) ;

	if ( !file.is_open ())
			throw file_open_exception (filename) ;

	while (getline (file , line ))
	{
		cout<<line<<endl ;
		
		int from = 0 , end ;
		string sub ;

		while ( line[from++] != ':') ;
		from +=  1 ;

		while ( 1 )
		{
			//"item1":"a,b,c,e" , line[from] = '"' , line[from+1] ->from
				end = from ;		
				while ( line[end] != ','  && line[end] != '"')  //when stor line[end-1] =','
				{
					end++ ;
				}
				
				sub = line.substr(from , end - from) ;		
				cout<<sub<<endl ;

				trans.push_back(sub) ;		
				
				if ( line[end] != '"')
					from = end+1 ;
				else
					break ;
		} //while 1 

	 db.push_back (move(trans)) ;
	}

	return db ;
}

//Main.cpp

#include <cstdio>
#include <cstdlib>
#include <iostream>

#include "fileParser.h"


using namespace std ;

int main ( void )
{
	string filename ("d:\\test\\a.dat") ;
	parser::transaction_database database ;
	database = parser::parseFile(filename) ;

 cout<<endl<<"here are all the items in transaction database "<<endl;

	for ( const auto &i : database )  //traverse vector < vector <string> >  ;  i   is  type of  vector<string>
		for (const auto &j : i )			//traverse vector <string >	;  j is the type of string  , out  put directly is ok 
			cout<<j<<endl ;

	system("pause") ;
	return 0 ;
}

//简单的数据测试文件：a.dat

"Item1":"assert,beer,c"
"item2":"bread,deer,eraser"
"item3":"water,e-books,books"

如果运行没有错误的话，运行结果如下：

实验不足与改进：

不足之处，是在实现 fp-tree 的时候，是不能够使用vector 来作为事务(transaction) 对象的数据结构的,

因为需要对其中的数据集合进行排序，所以通常使用的是 set 的方式。但是最好的方式是通过

类继承一个 STL 的数据对象，然后重写 STL中的某些方法(insert , iterator , traverse , erase ..) 来实现，后面的实验中会介绍。

同样，将数据从文件中（磁盘上）读入到内存中（程序中的数据结构对象中）仅仅完成了预处理的过程，

首次操作是对是数据库中的事务(transaction) 中的所有项集进行统计计数，并且还涉及到剪枝处理，

所以，需要在 item 的基础上附加额外计数的字段，这个功能可以通过创建一个 struct 来将当前的 item 对象附加进去即可。

而对于实现剪枝功能，可以通过创建STL + 类继承来实现。

Aimer1027

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录