接下来就要看findLargeItemSets这个函数中的do/while循环了,m_Ls是保存所有项的一个集合,那kMinusOneSets就是上一个项集,接下来是AprioriItemSet.mergeAllItemSets,但是这个函数在执行第一次的时候,它并看不出来什么作用:
public static FastVector mergeAllItemSets(FastVector itemSets, int size,
int totalTrans) {
FastVector newVector = new FastVector();
ItemSet result;
int numFound, k;
for (int i = 0; i < itemSets.size(); i++) {
ItemSet first = (ItemSet) itemSets.elementAt(i);
out: for (int j = i + 1; j < itemSets.size(); j++) {
ItemSet second = (ItemSet) itemSets.elementAt(j);
result = new AprioriItemSet(totalTrans);
result.m_items = new int[first.m_items.length];
// Find and copy common prefix of size 'size'
numFound = 0;
k = 0;
while (numFound < size) {
if (first.m_items[k] == second.m_items[k]) {
if (first.m_items[k] != -1)
numFound++;
result.m_items[k] = first.m_items[k];
} else
break out;
k++;
}
// Check difference
while (k < first.m_items.length) {
if ((first.m_items[k] != -1) && (second.m_items[k] != -1))
break;
else {
if (first.m_items[k] != -1)
result.m_items[k] = first.m_items[k];
else
result.m_items[k] = second.m_items[k];
}
k++;
}
if (k == first.m_items.length) {
result.m_counter = 0;
newVector.addElement(result);
}
}
}
return newVector;
}
这里有一个out标签,也就是break out执行后会跳到out那里接着执行,很象goto。这里的size是我们所需要的多少项的,这里first.m_items[k]要等于second.m_items[k]的原因是在相应的特征上值一定要相同才可以合并成一个新的规则,并且要注意,在注释的那一行中也提到了,就是前面部分一定要一样,大小为size。而在下面就是给result.items赋值余下的部分赋值,这里first.m_items[k]与second.m_items[i]不能同时为-1,因为同时为-1又多出来一项,所以就要去除这种情况。
下面是ItermSet.getHashtable:
public static Hashtable getHashtable(FastVector itemSets, int initialSize) {
Hashtable hashtable = new Hashtable(initialSize);
for (int i = 0; i < itemSets.size(); i++) {
ItemSet current = (ItemSet) itemSets.elementAt(i);
hashtable.put(current, new Integer(current.m_counter));
}
return hashtable;
}
Hashtable的key是一个ItemSet,而值是它的计数。下面是ItemSet.pruneItemSet:
public static FastVector pruneItemSets(FastVector toPrune,
Hashtable kMinusOne) {
FastVector newVector = new FastVector(toPrune.size());
int help, j;
for (int i = 0; i < toPrune.size(); i++) {
ItemSet current = (ItemSet) toPrune.elementAt(i);
for (j = 0; j < current.m_items.length; j++)
if (current.m_items[j] != -1) {
help = current.m_items[j];
current.m_items[j] = -1;
if (kMinusOne.get(current) == null) {
current.m_items[j] = help;
break;
} else {
current.m_items[j] = help;
}
}
if (j == current.m_items.length)
newVector.addElement(current);
}
return newVector;
}
pruneItemSets里面的前面这一段代码,主要是为了让m_item[j]这项为-1来,再用kMinusOne.get,如果为空,那么就break,如果这项中每一个(n-1)个特征值都是存在的,那么才将它放到newVector中去。似乎完全不明白这在做什么,现在看findRulesQuickly:
private void findRulesQuickly() throws Exception {
FastVector[] rules;
// Build rules
for (int j = 1; j < m_Ls.size(); j++) {
FastVector currentItemSets = (FastVector) m_Ls.elementAt(j);
Enumeration enumItemSets = currentItemSets.elements();
while (enumItemSets.hasMoreElements()) {
AprioriItemSet currentItemSet = (AprioriItemSet) enumItemSets
.nextElement();
// AprioriItemSet currentItemSet = new
// AprioriItemSet((ItemSet)enumItemSets.nextElement());
rules = currentItemSet.generateRules(m_minMetric,
m_hashtables,j + 1);
for (int k = 0; k < rules[0].size(); k++) {
m_allTheRules[0].addElement(rules[0].elementAt(k));
m_allTheRules[1].addElement(rules[1].elementAt(k));
m_allTheRules[2].addElement(rules[2].elementAt(k));
}
}
}
}
现在的问题是,如果这是第一次执行,那么我们的m_Ls大小为1,也就是什么都不执行,直接跳过去了。其中的generateRules如下:
public FastVector[] generateRules(double minConfidence,
FastVector hashtables, int numItemsInSet) {
FastVector premises = new FastVector(), consequences = new
FastVector(), conf = new FastVector();
FastVector[] rules = new FastVector[3], moreResults;
AprioriItemSet premise, consequence;
Hashtable hashtable = (Hashtable) hashtables
.elementAt(numItemsInSet - 2);
// Generate all rules with one item in the consequence.
for (int i = 0; i < m_items.length; i++)
if (m_items[i] != -1) {
premise = new AprioriItemSet(m_totalTransactions);
consequence = new AprioriItemSet(m_totalTransactions);
premise.m_items = new int[m_items.length];
consequence.m_items = new int[m_items.length];
consequence.m_counter = m_counter;
for (int j = 0; j < m_items.length; j++)
consequence.m_items[j] = -1;
System.arraycopy(m_items, 0, premise.m_items, 0,
m_items.length);
premise.m_items[i] = -1;
consequence.m_items[i] = m_items[i];
premise.m_counter = ((Integer) hashtable.get(premise))
.intValue();
premises.addElement(premise);
consequences.addElement(consequence);
conf.addElement(new Double(confidenceForRule(premise,
consequence)));
}
rules[0] = premises;
rules[1] = consequences;
rules[2] = conf;
pruneRules(rules, minConfidence);
// Generate all the other rules
moreResults = moreComplexRules(rules, numItemsInSet, 1,
minConfidence,
hashtables);
if (moreResults != null)
for (int i = 0; i < moreResults[0].size(); i++) {
rules[0].addElement(moreResults[0].elementAt(i));
rules[1].addElement(moreResults[1].elementAt(i));
rules[2].addElement(moreResults[2].elementAt(i));
}
return rules;
}
其中比较重要的有两句就是premise.m_items[i]=-1表示我们把premise去掉一项,还将这一项做为consequence,其中的confidenceForRule没什么好讲的,就是公式:
public static double confidenceForRule(AprioriItemSet premise,
AprioriItemSet consequence) {
return (double) consequence.m_counter / (double) premise.m_counter;
}
而rules这里可以看出来,三个元素就是关联规则的左部,右部,和置信度。
public static void pruneRules(FastVector[] rules, double minConfidence) {
FastVector newPremises = new FastVector(rules[0].size()),
newConsequences = new FastVector(rules[1].size()),
newConf = new FastVector(rules[2].size());
for (int i = 0; i < rules[0].size(); i++)
if (!(((Double) rules[2].elementAt(i)).doubleValue() <
minConfidence)) {
newPremises.addElement(rules[0].elementAt(i));
newConsequences.addElement(rules[1].elementAt(i));
newConf.addElement(rules[2].elementAt(i));
}
rules[0] = newPremises;
rules[1] = newConsequences;
rules[2] = newConf;
}
这里对规则进行裁减,如果小于minConfidence就将这个规则删除。
private final FastVector[] moreComplexRules(FastVector[] rules,
int numItemsInSet, int numItemsInConsequence,
double minConfidence,FastVector hashtables) {
AprioriItemSet newPremise;
FastVector[] result, moreResults;
FastVector newConsequences, newPremises = new FastVector(),
newConf = new FastVector();
Hashtable hashtable;
if (numItemsInSet > numItemsInConsequence + 1) {
hashtable = (Hashtable) hashtables.elementAt(numItemsInSet
- numItemsInConsequence - 2);
newConsequences = mergeAllItemSets(rules[1],
numItemsInConsequence - 1, m_totalTransactions);
Enumeration enu = newConsequences.elements();
while (enu.hasMoreElements()) {
AprioriItemSet current = (AprioriItemSet) enu.nextElement();
current.m_counter = m_counter;
newPremise = subtract(current);
newPremise.m_counter = ((Integer) hashtable.get(newPremise))
.intValue();
newPremises.addElement(newPremise);
newConf.addElement(new Double(confidenceForRule(newPremise,
current)));
}
result = new FastVector[3];
result[0] = newPremises;
result[1] = newConsequences;
result[2] = newConf;
pruneRules(result, minConfidence);
moreResults = moreComplexRules(result, numItemsInSet,
numItemsInConsequence + 1, minConfidence, hashtables);
if (moreResults != null)
for (int i = 0; i < moreResults[0].size(); i++) {
result[0].addElement(moreResults[0].elementAt(i));
result[1].addElement(moreResults[1].elementAt(i));
result[2].addElement(moreResults[2].elementAt(i));
}
return result;
} else
return null;
}
这里的mergeAllItemSets把所有numItemInConsequnce个的项得到了,然后对得到的newConsequences进行循环,下面有一个subtract:
public final AprioriItemSet subtract(AprioriItemSet toSubtract) {
AprioriItemSet result = new AprioriItemSet(m_totalTransactions);
result.m_items = new int[m_items.length];
for (int i = 0; i < m_items.length; i++)
if (toSubtract.m_items[i] == -1)
result.m_items[i] = m_items[i];
else
result.m_items[i] = -1;
result.m_counter = 0;
return result;
}
也就是如果这里是当toSubtract中的不为-1的值,将值换为-1,如果toSubtract中相应的值为-1,那么就保留原值。下面有一个递归的过程,就是将numItemsInConsequence+1,也就是Consequence中的项数可以是numItemsInConsequence+1个了。而它的值不能大于numTermsInSet。最后在generateRules把两个结果合并了。
接下来则应该是对规则结果进行排序,先是对support进行排序,再对confidence进行排序,这里之所以要supports[j-i]这样反着写是因为stableSort结果是升序的,stable的意思是稳定,懒地解释了,不懂看一下数据结构吧:
// Sort rules according to their support
int j = m_allTheRules[2].size() - 1;
supports = new double[m_allTheRules[2].size()];
for (int i = 0; i < (j + 1); i++)
supports[j - i] = ((double) ((ItemSet) m_allTheRules[1]
.elementAt(j - i)).support()) * (-1);
indices = Utils.stableSort(supports);
for (int i = 0; i < (j + 1); i++) {
sortedRuleSet[0].addElement(m_allTheRules[0]
.elementAt(indices[j - i]));
sortedRuleSet[1].addElement(m_allTheRules[1]
.elementAt(indices[j - i]));
sortedRuleSet[2].addElement(m_allTheRules[2]
.elementAt(indices[j - i]));
if (m_metricType != CONFIDENCE || m_significanceLevel != -1) {
sortedRuleSet[3].addElement(m_allTheRules[3]
.elementAt(indices[j - i]));
sortedRuleSet[4].addElement(m_allTheRules[4]
.elementAt(indices[j - i]));
sortedRuleSet[5].addElement(m_allTheRules[5]
.elementAt(indices[j - i]));
}
}
// Sort rules according to their confidence
m_allTheRules[0].removeAllElements();
m_allTheRules[1].removeAllElements();
m_allTheRules[2].removeAllElements();
if (m_metricType != CONFIDENCE || m_significanceLevel != -1) {
m_allTheRules[3].removeAllElements();
m_allTheRules[4].removeAllElements();
m_allTheRules[5].removeAllElements();
}
confidences = new double[sortedRuleSet[2].size()];
int sortType = 2 + m_metricType;
for (int i = 0; i < sortedRuleSet[2].size(); i++)
confidences[i] = ((Double) sortedRuleSet[sortType].elementAt(i))
.doubleValue();
indices = Utils.stableSort(confidences);
for (int i = sortedRuleSet[0].size() - 1; (i >= (sortedRuleSet[0]
.size() - m_numRules))
&& (i >= 0); i--) {
m_allTheRules[0].addElement(sortedRuleSet[0]
.elementAt(indices[i]));
m_allTheRules[1].addElement(sortedRuleSet[1]
.elementAt(indices[i]));
m_allTheRules[2].addElement(sortedRuleSet[2]
.elementAt(indices[i]));
if (m_metricType != CONFIDENCE || m_significanceLevel != -1) {
m_allTheRules[3].addElement(sortedRuleSet[3]
.elementAt(indices[i]));
m_allTheRules[4].addElement(sortedRuleSet[4]
.elementAt(indices[i]));
m_allTheRules[5].addElement(sortedRuleSet[5]
.elementAt(indices[i]));
}
}
关于对类别规则的分析,我现在用不着,也就没分析。