GSP算法详解与实现-CSDN博客

参考资料：http://blog.csdn.net/zone_programming/article/details/42032309

更多数据挖掘代码：https://github.com/linyiqun/DataMiningAlgorithm

介绍

GSP算法是序列模式挖掘算法的一种，他是一种类Apriori的一种，整个过程与Apriori算法比较类似，不过在细节上会略有不同，在下面的描述中，将会有所描述。GSP在原有的频繁模式定义的概念下，增加了3个的概念。

1、加入时间约束min_gap，max_gap，要求原来的连续变为只要满足在规定的min_gap到max_gap之间即可。

2、加入time_windows_size，只要在windows_size内的item，都可以被认为是同一ItemSet。

3、加入分类标准。

以上3点新的中的第一条特征将会在后面的算法中着重展现。

算法原理

1、根据所输入的序列，找出所有的单项集，即1频繁模式，这里会经过最小支持度阈值的判断。

2、根据1频繁模式进行连接运算，产生2频繁模式，这里会有进行最小阈值的判断。

3、根据2频繁模式连接产生3频繁模式，会经过最小支持度判断和剪枝操作，剪枝操作的原理在于判断他的所有子集是否也全是频繁模式。

4、3频繁模式不断的挖掘知道不能够产生出候选集为止。

连接操作的原理

2个序列，全部变为item列表的形式，如果a序列去掉第1个元素后，b序列去掉最后1个序列，2个序列的item完全一致，则代表可以连接，由b的最后一个元素加入到a中，至于是以独立项集的身份加入还是加入到a中最后1个项集中取决于b中的最后一个元素所属项集是否为单项项集。

时间约束计算

这个是用在支持度计数使用的，GSP算法的支持度计算不是那么简单，比如序列判断<2, <3, 4>>是否在序列<(1,5), 2 , <3, 4>, 2>，这就不能仅仅判断序列中是否只包含2,<3, 4>就行了，还要满足时间间隔约束，这就要把2，和<3,4>的所有出现时间都找出来，然后再里面找出一条满足时间约束的路径就算包含。时间的定义是从左往右起1.2,3...继续，以1个项集为单位，所有2的时间有2个分别为t=2和t=4，然后同理，因为<3,4>在序列中只有1次，所以时间为t=3，所以问题就变为了下面一个数组的问题

2 4

从时间数组的上往下，通过对多个时间的组合，找出1条满足时间约束的方案，这里的方案只有2-3,4-3,然后判断时间间隔，如果存在这样的方式，则代表此序列支持所给定序列，支持度值加1,这个算法在程序的实现中是比较复杂的。

算法的代码实现

测试数据输入(格式：事务ID item数 item1 item2.....):

[java]view plaincopyprint? 
   



















最后组成的序列为：

<(1,5) 2 3 4>

<1 3 4 (3,5)>

<1 2 3 4 5>

<1 3 5>

<4 5>

也就是说同一序列都是同事务的。下面是关键的类

Sequence.java:

[java]view plaincopyprint? 
   
 packageDataMining_GSP; 
  
 importjava.util.ArrayList; 
  
 /** 
 *序列，每个序列内部包含多组ItemSet项集 
 * 
 *@authorlyq 
 * 
 */ 
 publicclassSequenceimplementsComparable<Sequence>,Cloneable{ 
 //序列所属事务ID 
 privateinttrsanctionID; 
 //项集列表 
 privateArrayList<ItemSet>itemSetList; 
  
 publicSequence(inttrsanctionID){ 
 this.trsanctionID=trsanctionID; 
 this.itemSetList=newArrayList<>(); 
 } 
  
 publicSequence(){ 
 this.itemSetList=newArrayList<>(); 
 } 
  
 publicintgetTrsanctionID(){ 
 returntrsanctionID; 
 } 
  
 publicvoidsetTrsanctionID(inttrsanctionID){ 
 this.trsanctionID=trsanctionID; 
 } 
  
 publicArrayList<ItemSet>getItemSetList(){ 
 returnitemSetList; 
 } 
  
 publicvoidsetItemSetList(ArrayList<ItemSet>itemSetList){ 
 this.itemSetList=itemSetList; 
 } 
  
 /** 
 *取出序列中第一个项集的第一个元素 
 * 
 *@return 
 */ 
 publicIntegergetFirstItemSetNum(){ 
 returnthis.getItemSetList().get(0).getItems().get(0); 
 } 
  
 /** 
 *获取序列中最后一个项集 
 * 
 *@return 
 */ 
 publicItemSetgetLastItemSet(){ 
 returngetItemSetList().get(getItemSetList().size()-1); 
 } 
  
 /** 
 *获取序列中最后一个项集的最后一个一个元素 
 * 
 *@return 
 */ 
 publicIntegergetLastItemSetNum(){ 
 ItemSetlastItemSet=getItemSetList().get(getItemSetList().size()-1); 
 intlastItemNum=lastItemSet.getItems().get( 
 lastItemSet.getItems().size()-1); 
  
 returnlastItemNum; 
 } 
  
 /** 
 *判断序列中最后一个项集是否为单一的值 
 * 
 *@return 
 */ 
 publicbooleanisLastItemSetSingleNum(){ 
 ItemSetlastItemSet=getItemSetList().get(getItemSetList().size()-1); 
 intsize=lastItemSet.getItems().size(); 
  
 returnsize==1?true:false; 
 } 
  
 @Override 
 publicintcompareTo(Sequenceo){ 
 //TODOAuto-generatedmethodstub 
 returnthis.getFirstItemSetNum().compareTo(o.getFirstItemSetNum()); 
 } 
  
 @Override 
 protectedObjectclone()throwsCloneNotSupportedException{ 
 //TODOAuto-generatedmethodstub 
 returnsuper.clone(); 
 } 
  
 /** 
 *拷贝一份一模一样的序列 
 */ 
 publicSequencecopySeqence(){ 
 SequencecopySeq=newSequence(); 
 for(ItemSetitemSet:this.itemSetList){ 
 copySeq.getItemSetList().add(newItemSet(itemSet.copyItems())); 
 } 
  
 returncopySeq; 
 } 
  
 /** 
 *比较2个序列是否相等，需要判断内部的每个项集是否完全一致 
 * 
 *@paramseq 
 *比较的序列对象 
 *@return 
 */ 
 publicbooleancompareIsSame(Sequenceseq){ 
 booleanresult=true; 
 ArrayList<ItemSet>itemSetList2=seq.getItemSetList(); 
 ItemSettempItemSet1; 
 ItemSettempItemSet2; 
  
 if(itemSetList2.size()!=this.itemSetList.size()){ 
 returnfalse; 
 } 
 for(inti=0;i<itemSetList2.size();i++){ 
 tempItemSet1=this.itemSetList.get(i); 
 tempItemSet2=itemSetList2.get(i); 
  
 if(!tempItemSet1.compareIsSame(tempItemSet2)){ 
 //只要不相等，直接退出函数 
 result=false; 
 break; 
 } 
 } 
  
 returnresult; 
 } 
  
 /** 
 *生成此序列的所有子序列 
 * 
 *@return 
 */ 
 publicArrayList<Sequence>createChildSeqs(){ 
 ArrayList<Sequence>childSeqs=newArrayList<>(); 
 ArrayList<Integer>tempItems; 
 SequencetempSeq=null; 
 ItemSettempItemSet; 
  
 for(inti=0;i<this.itemSetList.size();i++){ 
 tempItemSet=itemSetList.get(i); 
 if(tempItemSet.getItems().size()==1){ 
 tempSeq=this.copySeqence(); 
  
 //如果只有项集中只有1个元素，则直接移除 
 tempSeq.itemSetList.remove(i); 
 childSeqs.add(tempSeq); 
 }else{ 
 tempItems=tempItemSet.getItems(); 
 for(intj=0;j<tempItems.size();j++){ 
 tempSeq=this.copySeqence(); 
  
 //在拷贝的序列中移除一个数字 
 tempSeq.getItemSetList().get(i).getItems().remove(j); 
 childSeqs.add(tempSeq); 
 } 
 } 
 } 
  
 returnchildSeqs; 
 } 
  
 } 

ItemSet.java:

[java]view plaincopyprint? 
   
 packageDataMining_GSP; 
  
 importjava.util.ArrayList; 
  
 /** 
 *序列中的子项集 
 * 
 *@authorlyq 
 * 
 */ 
 publicclassItemSet{ 
 /** 
 *项集中保存的是数字项数组 
 */ 
 privateArrayList<Integer>items; 
  
 publicItemSet(String[]itemStr){ 
 items=newArrayList<>(); 
 for(Strings:itemStr){ 
 items.add(Integer.parseInt(s)); 
 } 
 } 
  
 publicItemSet(int[]itemNum){ 
 items=newArrayList<>(); 
 for(intnum:itemNum){ 
 items.add(num); 
 } 
 } 
  
 publicItemSet(ArrayList<Integer>itemNum){ 
 this.items=itemNum; 
 } 
  
 publicArrayList<Integer>getItems(){ 
 returnitems; 
 } 
  
 publicvoidsetItems(ArrayList<Integer>items){ 
 this.items=items; 
 } 
  
 /** 
 *判断2个项集是否相等 
 * 
 *@paramitemSet 
 *比较对象 
 *@return 
 */ 
 publicbooleancompareIsSame(ItemSetitemSet){ 
 booleanresult=true; 
  
 if(this.items.size()!=itemSet.items.size()){ 
 returnfalse; 
 } 
  
 for(inti=0;i<itemSet.items.size();i++){ 
 if(this.items.get(i)!=itemSet.items.get(i)){ 
 //只要有值不相等，直接算作不相等 
 result=false; 
 break; 
 } 
 } 
  
 returnresult; 
 } 
  
 /** 
 *拷贝项集中同样的数据一份 
 * 
 *@return 
 */ 
 publicArrayList<Integer>copyItems(){ 
 ArrayList<Integer>copyItems=newArrayList<>(); 
  
 for(intnum:this.items){ 
 copyItems.add(num); 
 } 
  
 returncopyItems; 
 } 
 } 

GSPTool.java(算法工具类)：

[java]view plaincopyprint? 
   
 packageDataMining_GSP; 
  
 importjava.io.BufferedReader; 
 importjava.io.File; 
 importjava.io.FileReader; 
 importjava.io.IOException; 
 importjava.util.ArrayList; 
 importjava.util.Collections; 
 importjava.util.HashMap; 
 importjava.util.Map; 
  
 /** 
 *GSP序列模式分析算法 
 * 
 *@authorlyq 
 * 
 */ 
 publicclassGSPTool{ 
 //测试数据文件地址 
 privateStringfilePath; 
 //最小支持度阈值 
 privateintminSupportCount; 
 //时间最小间隔 
 privateintmin_gap; 
 //时间最大间隔 
 privateintmax_gap; 
 //原始数据序列 
 privateArrayList<Sequence>totalSequences; 
 //GSP算法中产生的所有的频繁项集序列 
 privateArrayList<Sequence>totalFrequencySeqs; 
 //序列项数字对时间的映射图容器 
 privateArrayList<ArrayList<HashMap<Integer,Integer>>>itemNum2Time; 
  
 publicGSPTool(StringfilePath,intminSupportCount,intmin_gap, 
 intmax_gap){ 
 this.filePath=filePath; 
 this.minSupportCount=minSupportCount; 
 this.min_gap=min_gap; 
 this.max_gap=max_gap; 
 totalFrequencySeqs=newArrayList<>(); 
 readDataFile(); 
 } 
  
 /** 
 *从文件中读取数据 
 */ 
 privatevoidreadDataFile(){ 
 Filefile=newFile(filePath); 
 ArrayList<String[]>dataArray=newArrayList<String[]>(); 
  
 try{ 
 BufferedReaderin=newBufferedReader(newFileReader(file)); 
 Stringstr; 
 String[]tempArray; 
 while((str=in.readLine())!=null){ 
 tempArray=str.split(""); 
 dataArray.add(tempArray); 
 } 
 in.close(); 
 }catch(IOExceptione){ 
 e.getStackTrace(); 
 } 
  
 HashMap<Integer,Sequence>mapSeq=newHashMap<>(); 
 Sequenceseq; 
 ItemSetitemSet; 
 inttID; 
 String[]itemStr; 
 for(String[]str:dataArray){ 
 tID=Integer.parseInt(str[0]); 
 itemStr=newString[Integer.parseInt(str[1])]; 
 System.arraycopy(str,2,itemStr,0,itemStr.length); 
 itemSet=newItemSet(itemStr); 
  
 if(mapSeq.containsKey(tID)){ 
 seq=mapSeq.get(tID); 
 }else{ 
 seq=newSequence(tID); 
 } 
 seq.getItemSetList().add(itemSet); 
 mapSeq.put(tID,seq); 
 } 
  
 //将序列图加入到序列List中 
 totalSequences=newArrayList<>(); 
 for(Map.Entryentry:mapSeq.entrySet()){ 
 totalSequences.add((Sequence)entry.getValue()); 
 } 
 } 
  
 /** 
 *生成1频繁项集 
 * 
 *@return 
 */ 
 privateArrayList<Sequence>generateOneFrequencyItem(){ 
 intcount=0; 
 intcurrentTransanctionID=0; 
 SequencetempSeq; 
 ItemSettempItemSet; 
 HashMap<Integer,Integer>itemNumMap=newHashMap<>(); 
 ArrayList<Sequence>seqList=newArrayList<>(); 
  
 for(Sequenceseq:totalSequences){ 
 for(ItemSetitemSet:seq.getItemSetList()){ 
 for(intnum:itemSet.getItems()){ 
 //如果没有此种类型项，则进行添加操作 
 if(!itemNumMap.containsKey(num)){ 
 itemNumMap.put(num,1); 
 } 
 } 
 } 
 } 
  
 booleanisContain=false; 
 intnumber=0; 
 for(Map.Entryentry:itemNumMap.entrySet()){ 
 count=0; 
 number=(int)entry.getKey(); 
 for(Sequenceseq:totalSequences){ 
 isContain=false; 
  
 for(ItemSetitemSet:seq.getItemSetList()){ 
 for(intnum:itemSet.getItems()){ 
 if(num==number){ 
 isContain=true; 
 break; 
 } 
 } 
  
 if(isContain){ 
 break; 
 } 
 } 
  
 if(isContain){ 
 count++; 
 } 
 } 
  
 itemNumMap.put(number,count); 
 } 
  
  
 for(Map.Entryentry:itemNumMap.entrySet()){ 
 count=(int)entry.getValue(); 
 if(count>=minSupportCount){ 
 tempSeq=newSequence(); 
 tempItemSet=newItemSet(newint[]{(int)entry.getKey()}); 
  
 tempSeq.getItemSetList().add(tempItemSet); 
 seqList.add(tempSeq); 
 } 
  
 } 
 //将序列升序排列 
 Collections.sort(seqList); 
 //将频繁1项集加入总频繁项集列表中 
 totalFrequencySeqs.addAll(seqList); 
  
 returnseqList; 
 } 
  
 /** 
 *通过1频繁项集连接产生2频繁项集 
 * 
 *@paramoneSeq 
 *1频繁项集序列 
 *@return 
 */ 
 privateArrayList<Sequence>generateTwoFrequencyItem( 
 ArrayList<Sequence>oneSeq){ 
 SequencetempSeq; 
 ArrayList<Sequence>resultSeq=newArrayList<>(); 
 ItemSettempItemSet; 
 intnum1; 
 intnum2; 
  
 //假如将<a>,<b>2个1频繁项集做连接组合，可以分为<aa>，<ab>，<ba>,<bb>4个序列模式 
 //注意此时的每个序列中包含2个独立项集 
 for(inti=0;i<oneSeq.size();i++){ 
 num1=oneSeq.get(i).getFirstItemSetNum(); 
 for(intj=0;j<oneSeq.size();j++){ 
 num2=oneSeq.get(j).getFirstItemSetNum(); 
  
 tempSeq=newSequence(); 
 tempItemSet=newItemSet(newint[]{num1}); 
 tempSeq.getItemSetList().add(tempItemSet); 
 tempItemSet=newItemSet(newint[]{num2}); 
 tempSeq.getItemSetList().add(tempItemSet); 
  
 if(countSupport(tempSeq)>=minSupportCount){ 
 resultSeq.add(tempSeq); 
 } 
 } 
 } 
  
 //上面连接还有1种情况是每个序列中只包含有一个项集的情况，此时a,b的划分则是<(a,a)><(a,b)><(b,b)> 
 for(inti=0;i<oneSeq.size();i++){ 
 num1=oneSeq.get(i).getFirstItemSetNum(); 
 for(intj=i;j<oneSeq.size();j++){ 
 num2=oneSeq.get(j).getFirstItemSetNum(); 
  
 tempSeq=newSequence(); 
 tempItemSet=newItemSet(newint[]{num1,num2}); 
 tempSeq.getItemSetList().add(tempItemSet); 
  
 if(countSupport(tempSeq)>=minSupportCount){ 
 resultSeq.add(tempSeq); 
 } 
 } 
 } 
 //同样将2频繁项集加入到总频繁项集中 
 totalFrequencySeqs.addAll(resultSeq); 
  
 returnresultSeq; 
 } 
  
 /** 
 *根据上次的频繁集连接产生新的侯选集 
 * 
 *@paramseqList 
 *上次产生的候选集 
 *@return 
 */ 
 privateArrayList<Sequence>generateCandidateItem( 
 ArrayList<Sequence>seqList){ 
 SequencetempSeq; 
 ArrayList<Integer>tempNumArray; 
 ArrayList<Sequence>resultSeq=newArrayList<>(); 
 //序列数字项列表 
 ArrayList<ArrayList<Integer>>seqNums=newArrayList<>(); 
  
 for(inti=0;i<seqList.size();i++){ 
 tempNumArray=newArrayList<>(); 
 tempSeq=seqList.get(i); 
 for(ItemSetitemSet:tempSeq.getItemSetList()){ 
 tempNumArray.addAll(itemSet.copyItems()); 
 } 
 seqNums.add(tempNumArray); 
 } 
  
 ArrayList<Integer>array1; 
 ArrayList<Integer>array2; 
 //序列i,j的拷贝 
 Sequenceseqi=null; 
 Sequenceseqj=null; 
 //判断是否能够连接，默认能连接 
 booleancanConnect=true; 
 //进行连接运算，包括自己与自己连接 
 for(inti=0;i<seqNums.size();i++){ 
 for(intj=0;j<seqNums.size();j++){ 
 array1=(ArrayList<Integer>)seqNums.get(i).clone(); 
 array2=(ArrayList<Integer>)seqNums.get(j).clone(); 
  
 //将第一个数字组去掉第一个，第二个数字组去掉最后一个，如果剩下的部分相等，则可以连接 
 array1.remove(0); 
 array2.remove(array2.size()-1); 
  
 canConnect=true; 
 for(intk=0;k<array1.size();k++){ 
 if(array1.get(k)!=array2.get(k)){ 
 canConnect=false; 
 break; 
 } 
 } 
  
 if(canConnect){ 
 seqi=seqList.get(i).copySeqence(); 
 seqj=seqList.get(j).copySeqence(); 
  
 intlastItemNum=seqj.getLastItemSetNum(); 
 if(seqj.isLastItemSetSingleNum()){ 
 //如果j序列的最后项集为单一值，则最后一个数字以独立项集加入i序列 
 ItemSetitemSet=newItemSet(newint[]{lastItemNum}); 
 seqi.getItemSetList().add(itemSet); 
 }else{ 
 //如果j序列的最后项集为非单一值，则最后一个数字加入i序列最后一个项集中 
 ItemSetitemSet=seqi.getLastItemSet(); 
 itemSet.getItems().add(lastItemNum); 
 } 
  
 //判断是否超过最小支持度阈值 
 if(isChildSeqContained(seqi) 
 &&countSupport(seqi)>=minSupportCount){ 
 resultSeq.add(seqi); 
 } 
 } 
 } 
 } 
  
 totalFrequencySeqs.addAll(resultSeq); 
 returnresultSeq; 
 } 
  
 /** 
 *判断此序列的所有子序列是否也是频繁序列 
 * 
 *@paramseq 
 *待比较序列 
 *@return 
 */ 
 privatebooleanisChildSeqContained(Sequenceseq){ 
 booleanisContained=false; 
 ArrayList<Sequence>childSeqs; 
  
 childSeqs=seq.createChildSeqs(); 
 for(SequencetempSeq:childSeqs){ 
 isContained=false; 
  
 for(SequencefrequencySeq:totalFrequencySeqs){ 
 if(tempSeq.compareIsSame(frequencySeq)){ 
 isContained=true; 
 break; 
 } 
 } 
  
 if(!isContained){ 
 break; 
 } 
 } 
  
 returnisContained; 
 } 
  
 /** 
 *候选集判断支持度的值 
 * 
 *@paramseq 
 *待判断序列 
 *@return 
 */ 
 privateintcountSupport(Sequenceseq){ 
 intcount=0; 
 intmatchNum=0; 
 SequencetempSeq; 
 ItemSettempItemSet; 
 HashMap<Integer,Integer>timeMap; 
 ArrayList<ItemSet>itemSetList; 
 ArrayList<ArrayList<Integer>>numArray=newArrayList<>(); 
 //每项集对应的时间链表 
 ArrayList<ArrayList<Integer>>timeArray=newArrayList<>(); 
  
 for(ItemSetitemSet:seq.getItemSetList()){ 
 numArray.add(itemSet.getItems()); 
 } 
  
 for(inti=0;i<totalSequences.size();i++){ 
 timeArray=newArrayList<>(); 
  
 for(ints=0;s<numArray.size();s++){ 
 ArrayList<Integer>childNum=numArray.get(s); 
 ArrayList<Integer>localTime=newArrayList<>(); 
 tempSeq=totalSequences.get(i); 
 itemSetList=tempSeq.getItemSetList(); 
  
 for(intj=0;j<itemSetList.size();j++){ 
 tempItemSet=itemSetList.get(j); 
 matchNum=0; 
 intt=0; 
  
 if(tempItemSet.getItems().size()==childNum.size()){ 
 timeMap=itemNum2Time.get(i).get(j); 
 //只有当项集长度匹配时才匹配 
 for(intk=0;k<childNum.size();k++){ 
 if(timeMap.containsKey(childNum.get(k))){ 
 matchNum++; 
 t=timeMap.get(childNum.get(k)); 
 } 
 } 
  
 //如果完全匹配，则记录时间 
 if(matchNum==childNum.size()){ 
 localTime.add(t); 
 } 
 } 
  
 } 
  
 if(localTime.size()>0){ 
 timeArray.add(localTime); 
 } 
 } 
  
 //判断时间是否满足时间最大最小约束，如果满足，则此条事务包含候选事务 
 if(timeArray.size()==numArray.size() 
 &&judgeTimeInGap(timeArray)){ 
 count++; 
 } 
 } 
  
 returncount; 
 } 
  
 /** 
 *判断事务是否满足时间约束 
 * 
 *@paramtimeArray 
 *时间数组，每行代表各项集的在事务中的发生时间链表 
 *@return 
 */ 
 privatebooleanjudgeTimeInGap(ArrayList<ArrayList<Integer>>timeArray){ 
 booleanresult=false; 
 intpreTime=0; 
 ArrayList<Integer>firstTimes=timeArray.get(0); 
 timeArray.remove(0); 
  
 if(timeArray.size()==0){ 
 returnfalse; 
 } 
  
 for(inti=0;i<firstTimes.size();i++){ 
 preTime=firstTimes.get(i); 
  
 if(dfsJudgeTime(preTime,timeArray)){ 
 result=true; 
 break; 
 } 
 } 
  
 returnresult; 
 } 
  
 /** 
 *深度优先遍历时间，判断是否有符合条件的时间间隔 
 * 
 *@parampreTime 
 *@paramtimeArray 
 *@return 
 */ 
 privatebooleandfsJudgeTime(intpreTime, 
 ArrayList<ArrayList<Integer>>timeArray){ 
 booleanresult=false; 
 ArrayList<ArrayList<Integer>>timeArrayClone=(ArrayList<ArrayList<Integer>>)timeArray 
 .clone(); 
 ArrayList<Integer>firstItemItem=timeArrayClone.get(0); 
  
 for(inti=0;i<firstItemItem.size();i++){ 
 if(firstItemItem.get(i)-preTime>=min_gap 
 &&firstItemItem.get(i)-preTime<=max_gap){ 
 //如果此2项间隔时间满足时间约束，则继续往下递归 
 preTime=firstItemItem.get(i); 
 timeArrayClone.remove(0); 
  
 if(timeArrayClone.size()==0){ 
 returntrue; 
 }else{ 
 result=dfsJudgeTime(preTime,timeArrayClone); 
 if(result){ 
 returntrue; 
 } 
 } 
 } 
 } 
  
 returnresult; 
 } 
  
 /** 
 *初始化序列项到时间的序列图，为了后面的时间约束计算 
 */ 
 privatevoidinitItemNumToTimeMap(){ 
 Sequenceseq; 
 itemNum2Time=newArrayList<>(); 
 HashMap<Integer,Integer>tempMap; 
 ArrayList<HashMap<Integer,Integer>>tempMapList; 
  
 for(inti=0;i<totalSequences.size();i++){ 
 seq=totalSequences.get(i); 
 tempMapList=newArrayList<>(); 
  
 for(intj=0;j<seq.getItemSetList().size();j++){ 
 ItemSetitemSet=seq.getItemSetList().get(j); 
 tempMap=newHashMap<>(); 
 for(intitemNum:itemSet.getItems()){ 
 tempMap.put(itemNum,j+1); 
 } 
  
 tempMapList.add(tempMap); 
 } 
  
 itemNum2Time.add(tempMapList); 
 } 
 } 
  
 /** 
 *进行GSP算法计算 
 */ 
 publicvoidgspCalculate(){ 
 ArrayList<Sequence>oneSeq; 
 ArrayList<Sequence>twoSeq; 
 ArrayList<Sequence>candidateSeq; 
  
 initItemNumToTimeMap(); 
 oneSeq=generateOneFrequencyItem(); 
 twoSeq=generateTwoFrequencyItem(oneSeq); 
 candidateSeq=twoSeq; 
  
 //不断连接生产候选集，直到没有产生出侯选集 
 for(;;){ 
 candidateSeq=generateCandidateItem(candidateSeq); 
  
 if(candidateSeq.size()==0){ 
 break; 
 } 
 } 
  
 outputSeqence(totalFrequencySeqs); 
  
 } 
  
 /** 
 *输出序列列表信息 
 * 
 *@paramoutputSeqList 
 *待输出序列列表 
 */ 
 privatevoidoutputSeqence(ArrayList<Sequence>outputSeqList){ 
 for(Sequenceseq:outputSeqList){ 
 System.out.print("<"); 
 for(ItemSetitemSet:seq.getItemSetList()){ 
 System.out.print("("); 
 for(intnum:itemSet.getItems()){ 
 System.out.print(num+","); 
 } 
 System.out.print("),"); 
 } 
 System.out.println(">"); 
 } 
 } 
  
 } 

调用类Client.java:

[java]view plaincopyprint? 
   
 packageDataMining_GSP; 
  
 /** 
 *GSP序列模式分析算法 
 *@authorlyq 
 * 
 */ 
 publicclassClient{ 
 publicstaticvoidmain(String[]args){ 
 StringfilePath="C:\\Users\\lyq\\Desktop\\icon\\testInput.txt"; 
 //最小支持度阈值 
 intminSupportCount=2; 
 //时间最小间隔 
 intmin_gap=1; 
 //施加最大间隔 
 intmax_gap=5; 
  
 GSPTooltool=newGSPTool(filePath,minSupportCount,min_gap,max_gap); 
 tool.gspCalculate(); 
 } 
 } 

算法的输出(挖掘出的所有频繁模式)：

[java]view plaincopyprint? 
   
 <(1,),> 
 <(2,),> 
 <(3,),> 
 <(4,),> 
 <(5,),> 
 <(1,),(3,),> 
 <(1,),(4,),> 
 <(1,),(5,),> 
 <(2,),(3,),> 
 <(2,),(4,),> 
 <(3,),(4,),> 
 <(3,),(5,),> 
 <(4,),(5,),> 
 <(1,),(3,),(4,),> 
 <(1,),(3,),(5,),> 
 <(2,),(3,),(4,),> 

算法实现的难点

1、算法花费了几天的时间，难点首先在于对算法原理本身的理解，网上对于此算法的资料特别少，而且不同的人所表达的意思都有少许的不同，讲的也不是很详细，于是就通过阅读别人的代码理解GSP算法的原理，我的代码实现也是参考了参考资料的C语言的实现。

2、在实现时间约束的支持度计数统计的时候，调试了一段时间，做时间统计容易出错，因为层级实在太多容易搞晕。

3、还有1个是Sequence和ItemSet的拷贝时的引用问题，在产生新的序列时一定要深拷贝1个否则导致同一引用会把原数据给改掉的。

GSP算法和Apriori算法的比较

我是都实现过了GSP算法和Apriori算法的，后者是被称为关联规则挖掘算法，偏向于挖掘关联规则的，2个算法在连接的操作上有不一样的地方，还有在数据的构成方式上，Apriori的数据会简单一点，都是单项单项构成的，而且在做支持度统计的时候只需判断存在与否即可。不需要考虑时间约束。Apriori算法给定K项集，连接到K-1项集算法就停止了，而GSP算法是直到不能够产生候选集为止。