[cpp,algorithm] FpTree_2_首次扫描数据库抽取项集_并根据支持度剪枝

本文链接：https://blog.csdn.net/inuyasha1027/article/details/41621445

实验目的：

接上一次的实验，1_从文件中读取数据集并将其解析

在实验1中，将数据从数据文件中导入到程序的事务数据库(transaction_database) 数据结构对象中,

本次实验目的是对数据库进行扫描，从中选出数据库中的所有项集(item)组成项集集合(itemset ).

实验思路：

接下来细化一下将要实现的功能是，fp-tree 算法中首次扫描数据库，从中抽取出数据库(transaction_database)

中所有的项集(item)，将其存放值项集集合 (itemset) 数据结构对象中，并将项集总数 itemset.size 统计出来,

进而可以求出系统中的频繁项集的最小出现频度次数：min_frequency = min_sup * itemset.size ,

(其中min_sup 是数据集中频繁项集的最小支持度百分比)

将求出的最小出现次数（最小频度）作为用来筛选一个项集是否会成为是频繁项集的阈值条件，

把itemset 中出现频度 < 最小频度阈值的项集通过剪枝(prune) 的手段将其移除。

然后，对剩余的项集集合(itemset) 中的项集(item )根据其出现的频度计数从大到小进行排序。

实验方法：

根据上述信息进行数据结构的设计

1) . 项集集合 itemset

itemset 具有对其中重复出现的 item 项进行计数，

并且会根据计数的大小来对其中的数据进行排序和删除。

所以将 itemset 的数据结构定为 unordered_map<item , size_t >,

并在即内部定义数据结构体

struct item_with_counter

{

item _item ;

size_t counter ;

} ;

并在结构体的基础上，定义一个用于排列顺序的stl 数据类型：

typedef set<item_with_counter> ordered_set ;

同样为了使得 set 可以实现根据项集 item 出现次数的大小从大到小进行排序的功能，

需要对item_with_counter 中的 < ，即小于号进行运算符重写

bool operator< (const item_wiith_counter & a, const item_with_counter &b )

{

if (a.counter != b.counter )

return a.counter > b.counter ; //if not equal , descending by counter

else

return a.item < b.item ; //if counters equals to each other ,ascending by item itself

//item is in type of char * , so need another ’operator< ‘ overload by strcmp(a.item , b.item )

}

2). 事务对象 transaction

由于一个事务transaction中可能包含着多个 item，即一个transaction对应的是多个items

所以使用 unordered_set< item > 来作为transaction的变量类型

typedef unordered_set <item> transaction ;

3).事务数据库 transaction_database

我们知道的是，在数据库中数据都是以记录(元组) 的方式来一行行的存放的，

而在这里我们将事务抽象成关系数据库中的记录来看待。

( 不过在这里应该清楚地了解到，二者区别很多，一个事务中的项集(item)的个数和种类都是不固定的，

但是在关系数据库中每条记录(record) 或是元组 (tuple) 中的属性字段( 域，field )的个数是固定的，并且属性类型固定 )

所以，在事务数据库的数据结构对象的设计中仿照关系数据库中的每张表(table) 对元组(tuple) 进行组织的方式，

我们将其设定为是：

typedef vector<transaction> transaction_database ;

通过上述的一系列定义我们可以的到下面这样一个结构：

|_transaction_database_| :

|_transaction_|________items__________|

transaction[0] : { apple, banana }

transaction[1] : {water , paper , book , e-book}

....

transaction [n] : {book , beer, food1 ,coffee, food2 ,food3}

和一个用来统计在该 transaction_database 中所出现的所有项的出现次数的数据结构

|_itemset_| :

|_item_|_counter_|

...

实验代码：

在昨天（确切的说应该是今天凌晨）实现的 fileParser.h fileParser.cpp 的基础之上进行修改的。

由于数据结构发生了变化，所以对于存放文件解析出来的数据结构和变量也有相应的变动。

//item.h

#ifndef ITEM_H
#define ITEM_H

#include <string>

namespace fptSpace
{
	    typedef  std::string  item_type ;
		 
}
 
#endif //ITEM_H

//itemset.h

#ifndef ITEMSET_H
#define ITEMSET_H

#include <unordered_map>
#include <set>
#include <cstddef>
#include <cassert>

#include "item.h"

namespace fptSpace
{
	struct item_with_counter
	{
		item_type _item ;
		std::size_t counter ;
		 
		item_with_counter() 
		{}
		item_with_counter( item_type i , std::size_t c  ):
			_item(std::string (i ) ) , counter( c )  
		{}
	} ;


	//Operators overload ing for set's get order
	bool operator< ( const item_with_counter & , const item_with_counter&) ;

	typedef std::set<item_with_counter>  ordered_itemset ;

	//class of itemset is the object used for analyzing the frequency of
	//each stored in transaction_database 

	class itemset : private std::unordered_map< fptSpace::item_type , std::size_t >
	{
	 //here we define an type with the name of container_type
		//with it ,we could use the father's method 
	typedef  std::unordered_map<item_type, std::size_t > container_type ;

	public :
		void prune (std::size_t min_sup ) ;
	 void insert ( fptSpace::item_type i )
		{
			++(*this)[i] ;
		}
	 
		ordered_itemset get_ordered()  const ;
		using container_type::iterator ;
		using container_type::const_iterator ;
		using container_type::begin ;
		using container_type::end ;
		using container_type::size ; 
	} ;
	
}

//we implements the operator overloading method here 

inline bool fptSpace::operator< ( const item_with_counter &a , const item_with_counter &b )
{
	if ( a.counter != b.counter)
		return a.counter > b.counter ;
	else
		return a._item< b._item ;
}


#endif //ITEMSET_H

//itemset.cpp

#include "itemset.h"

using fptSpace::itemset ;


void itemset::prune ( std::size_t min_sup )
{
	for ( auto i = begin () ; i != end () ; )
		if( i->second < min_sup)
			i = erase (i) ;
		else
			i++ ;
}

fptSpace::ordered_itemset itemset::get_ordered() const
{
	ordered_itemset order_set ;
	item_with_counter* it ;

	for ( const auto &i : (*this ) )
	{
		it = new item_with_counter(i.first , i.second) ;
		order_set.insert( *it ) ;
	}
	return order_set ;
}

//transaction.h

#ifndef	TRANSACTION_H
#define TRANSACTION_H

#include <unordered_set>
#include <vector>
#include "item.h"
#include "itemset.h"

namespace fptSpace
{
	typedef std::unordered_set<item_type> transaction ;

	//here is the transaction_database which is parsed from the data file 
	class transaction_database : private std::vector<transaction>
	{
	public :
		void insert ( fptSpace::transaction trans ) ;
		fptSpace::itemset  extract_itemset () const ;

		//here are the unoverloaded functions from type vector<transaction>
		using std::vector<transaction>::iterator ;
		using std::vector<transaction>::const_iterator;
		using std::vector<transaction>::begin ;
		using std::vector<transaction>::end ;
		using std::vector<transaction>::size ;
	};

}

#endif //TRANSACTION_H

//transaction.cpp

#include <utility>
#include "transaction.h"

using fptSpace::transaction_database ;

void transaction_database::insert ( fptSpace::transaction trans )
{
	push_back( std::move(trans)) ;	
}

fptSpace::itemset transaction_database::extract_itemset() const
{
	fptSpace::itemset items ;

	for ( const transaction &i : (*this ) )
		for ( const item_type &j : i )
			items.insert( j ) ;

	return items ;
}

//parser.h

#ifndef PARSER_H
#define PARSER_H

#include <exception>
#include <string>
#include <vector>
#include "transaction.h"

 
namespace fptSpace
{
	fptSpace::transaction_database parseFromFile ( const std::string &filename ) ;

	class file_open_exception : public std::exception 
	{
	private :
		const std::string message ;
	public :
		file_open_exception ( const std::string &filename ) :
			message( std::string("fail to open fiel : ")+filename  )
		{}
		virtual ~file_open_exception () throw() {}
		virtual const char *what () const throw () 
		{
			return message.c_str () ;
		}
	};
} 

#endif //PARSER_H

//parser.cpp

#include <fstream>
#include <sstream>
#include <iostream>

#include "parser.h"

using namespace std ;

fptSpace::transaction_database fptSpace::parseFromFile ( const string &filename )
{
	fptSpace::transaction trans ;
	fptSpace::transaction_database transaction_db ;
	string line ;

	ifstream file ( filename ) ;

	if ( !file.is_open() )
		throw file_open_exception( filename ) ;

	while ( getline (file , line ) )
	{
		cout <<line <<endl ;

		int from = 0 , end ;
		string substr ;

		while ( line[from++] != ':') ;
		from += 1 ;

		while ( 1 )
		{
			end = from ;
			while ( line [end] != ',' && line[end] != '"' )
				end++ ;
			substr = line.substr(from , end - from ) ;

			cout <<substr <<endl;

			trans.insert(substr) ;

			if ( line[end] != '"')
				from = end+1 ;
			else
				break ;
		}//while 1 
		transaction_db.insert(move(trans)) ;
	}

	return transaction_db ;
}

//Main.cpp

#include <cstdio>
#include <cstdlib>
#include <iostream>

//#include "fileParser.h"
#include "parser.h"

using namespace std ;

int main ( void )
{
	string filename ("d:\\test\\a.dat") ;

	/*parser::transaction_database database ;
	database = parser::parseFile(filename) ;

 <span style="white-space:pre">	</span>cout<<endl<<"here are all the items in transaction database "<<endl;

	for ( const auto &i : database ) <span style="white-space:pre">	</span> //traverse vector < vector <string> >  ;  i   is  type of  vector<string>
		for (const auto &j : i )	//traverse vector <string >	;  j is the type of string  , out  put directly is ok 
			cout<<j<<endl ;
<span style="white-space:pre">	</span>*/
	fptSpace::transaction_database trans_db = fptSpace::parseFromFile(filename) ;

	//trans_db is a vector of transaction  , i is an transaction element in trans_db 
	//transaction is an unordered_set<string> with the element with the type of  string 
	//so , j is the iterator of  
	for ( const auto &i : trans_db ) 
		for ( const auto &j : i )
		 cout<<"name "<<j<<endl ;

	fptSpace::itemset itemSet = trans_db.extract_itemset() ;

	//itemset is the type of the unordered_map<item_type , size_t >
	//item_type is in type of the string 

	cout<<"itemset unordered " <<endl ;
	for ( const auto &i : itemSet)
		cout<<"name   "<<i.first<<"   frequency:   "<<i.second<<endl ;
		
	int total_item_counter = itemSet.size() ;
	double min_support ;
	int sup_threshold ;
	cout<<"input min support (min_sup < 1 )"<<endl;
	cin >> min_support ;

	sup_threshold = (int) (min_support*total_item_counter );
	cout<<"after calculation we got the threshold   "<<sup_threshold<<endl ;
	cout<<"itemset after  pruning" <<endl ;
        itemSet.prune(sup_threshold) ;

  <span style="white-space:pre">	</span> for ( const auto &i : itemSet)
		cout<<"name   "<<i.first<<"   frequency:   "<<i.second<<endl ;

  <span style="white-space:pre">	</span> cout<<"itemset after sorting "<<endl ;
  <span style="white-space:pre">	</span> fptSpace::ordered_itemset ordered_item_set = itemSet.get_ordered() ;

   <span style="white-space:pre">	</span>for ( const auto &i : ordered_item_set )
	   cout<<"name :  "<<i._item<<" frequency :"<<i.counter<<endl ;

	system("pause") ;
	return 0 ;
}

//a.dat

"Item1":"assert,beer,c"
"item2":"bread,deer,eraser,c"
"item3":"water,e-books,books,beer"
"item4":"water,e-books,books,beer"
"item5":"books"
"item6":"beer"
"item7":"beer,books"

实验运行结果：

如果整个程序运行没有错误的话，将会显示如下的结果：