2021SC@SDUSC
上篇,我们Pig的数据结构基本讲完,本篇讲的是目前架构的变种
InternalCachedBag
首先是内部已缓存包
继承关系如下
public class InternalCachedBag extends SelfSpillBag
所继承类的UML
这个类注释里面没有标明类的作用,但是通过查看引用找到了一个测试类TestDataBag,里面有各种类的测试类,不过暂时没找到入口函数,也不知道整个项目是怎么跑起来的,所以没法直接测试了hhh,UML如下
不过以上的都不重要,重要的是看看测试函数的源代码,可以看出,这部分源码测试了InternalCachedBag的各种性质,其中assertEquals就是断言前一个值与后一个值相同,若不同会报错,所以可以认为后面的值就是前面的宜当的值
@Test
public void testInternalCachedBag() throws Exception {
// check adding empty tuple
DataBag bg0 = new InternalCachedBag();
bg0.add(TupleFactory.getInstance().newTuple());
bg0.add(TupleFactory.getInstance().newTuple());
assertEquals(bg0.size(), 2);
// check equal of bags
DataBag bg1 = new InternalCachedBag(1, 0.5f);
assertEquals(bg1.size(), 0);
String[][] tupleContents = new String[][] {{"a", "b"},{"c", "d" }, { "e", "f"} };
for (int i = 0; i < tupleContents.length; i++) {
bg1.add(Util.createTuple(tupleContents[i]));
}
// check size, and isSorted(), isDistinct()
assertEquals(bg1.size(), 3);
assertFalse(bg1.isSorted());
assertFalse(bg1.isDistinct());
tupleContents = new String[][] {{"c", "d" }, {"a", "b"},{ "e", "f"} };
DataBag bg2 = new InternalCachedBag(1, 0.5f);
for (int i = 0; i < tupleContents.length; i++) {
bg2.add(Util.createTuple(tupleContents[i]));
}
assertEquals(bg1, bg2);
// check bag with data written to disk
DataBag bg3 = new InternalCachedBag(1, 0.0f);
tupleContents = new String[][] {{ "e", "f"}, {"c", "d" }, {"a", "b"}};
for (int i = 0; i < tupleContents.length; i++) {
bg3.add(Util.createTuple(tupleContents[i]));
}
assertEquals(bg1, bg3);
// check iterator
Iterator<Tuple> iter = bg3.iterator();
DataBag bg4 = new InternalCachedBag(1, 0.0f);
while(iter.hasNext()) {
bg4.add(iter.next());
}
assertEquals(bg3, bg4);
// call iterator methods with irregular order
iter = bg3.iterator();
assertTrue(iter.hasNext());
assertTrue(iter.hasNext());
DataBag bg5 = new InternalCachedBag(1, 0.0f);
bg5.add(iter.next());
bg5.add(iter.next());
assertTrue(iter.hasNext());
bg5.add(iter.next());
assertFalse(iter.hasNext());
assertFalse(iter.hasNext());
assertEquals(bg3, bg5);
bg4.clear();
assertEquals(bg4.size(), 0);
}
这里前面忘记给出构造方法的,这里给出
public InternalCachedBag() {
this(1, -1f);
}
public InternalCachedBag(int bagCount) {
this(bagCount, -1f);
}
public InternalCachedBag(int bagCount, float percent) {
super(bagCount, percent);
init();
}
private void init() {
factory = TupleFactory.getInstance();
mContents = new ArrayList<Tuple>();
addDone = false;
}
这里调用了super,所以我们还要看看selfSpillBag的构造函数
public SelfSpillBag(int bagCount) {
memLimit = new MemoryLimits(bagCount, -1);
}
public SelfSpillBag(int bagCount, float percent) {
memLimit = new MemoryLimits(bagCount, percent);
}
意外地简单啊,后一个参数是内存大小限制,-1应该是无限制,其他的性质看源码也可以看出来,这里就不一一列举了,总而言之这个bag就是一个可以限制j大小的spillableBag
InternalDistinctBag
接下来分析内部独特包
继承关系
public class InternalDistinctBag extends SortedSpillBag
父类UML
放出一个注释
/**
* 没有倍数的无序元组集合。数据在进入时不会重复存储。当需要溢出时,数据会被排序并写入磁盘。
* 数据存储在 HashSet 中。当需要排序时,它会被放置在一个 ArrayList 中,然后进行排序。
* 尽管有这些诡计,但发现这比将其存储在 TreeSet 中要快。当内存中的元组数量达到限制时,这个包会主动溢出
*/
基本的性质注释中已经介绍的相对比较清楚了,接下来继续去TestDataBag中找相应的测试函数
@Test
public void testInternalDistinctBag() throws Exception {
// check adding empty tuple
DataBag bg0 = new InternalDistinctBag();
bg0.add(TupleFactory.getInstance().newTuple());
bg0.add(TupleFactory.getInstance().newTuple());
assertEquals(bg0.size(), 1);// 因为实例化参数是一样的所以被认为是同一个tuple
// check equal of bags
DataBag bg1 = new InternalDistinctBag();
assertEquals(bg1.size(), 0);
String[][] tupleContents = new String[][] {{ "e", "f"}, {"a", "b"}, {"e", "d" }, {"a", "b"}, {"e", "f"}};
for (int i = 0; i < tupleContents.length; i++) {
bg1.add(Util.createTuple(tupleContents[i]));
}
// check size, and isSorted(), isDistinct()
assertEquals(bg1.size(), 3);
assertFalse(bg1.isSorted());
assertTrue(bg1.isDistinct());
tupleContents = new String[][] {{"a", "b" }, {"e", "d"}, {"e", "d"}, { "e", "f"} };
DataBag bg2 = new InternalDistinctBag();
for (int i = 0; i < tupleContents.length; i++) {
bg2.add(Util.createTuple(tupleContents[i]));
}
assertEquals(bg1, bg2);// 和集合的性质类似,顺序不影响相等
Iterator<Tuple> iter = bg1.iterator();
iter.next().equals(Util.createTuple(new String[] {"a", "b"}));
iter.next().equals(Util.createTuple(new String[] {"c", "d"}));
iter.next().equals(Util.createTuple(new String[] {"e", "f"}));
// check bag with data written to disk
DataBag bg3 = new InternalDistinctBag(1, 0.0f);
tupleContents = new String[][] {{ "e", "f"}, {"a", "b"}, {"e", "d" }, {"a", "b"}, {"e", "f"}};
for (int i = 0; i < tupleContents.length; i++) {
bg3.add(Util.createTuple(tupleContents[i]));
}
assertEquals(bg2, bg3);
assertEquals(bg3.size(), 3);
// call iterator methods with irregular order
iter = bg3.iterator();
assertTrue(iter.hasNext());
assertTrue(iter.hasNext());
DataBag bg4 = new InternalDistinctBag(1, 0.0f);// 喜闻乐见的限制内存大小
bg4.add(iter.next());
bg4.add(iter.next());
assertTrue(iter.hasNext());
bg4.add(iter.next());
assertFalse(iter.hasNext());
assertFalse(iter.hasNext());
assertEquals(bg3, bg4);
// check clear
bg3.clear();
assertEquals(bg3.size(), 0);
// 测试所有数据溢出
DataBag bg5 = new InternalDistinctBag();
for(int j=0; j<3; j++) {
for (int i = 0; i < tupleContents.length; i++) {
bg5.add(Util.createTuple(tupleContents[i]));
}
bg5.spill();
}
assertEquals(bg5.size(), 3);
// 测试大多数数据溢出,内存中有一些数据并合并溢出文件
DataBag bg6 = new InternalDistinctBag();
for(int j=0; j<104; j++) {
for (int i = 0; i < tupleContents.length; i++) {
bg6.add(Util.createTuple(tupleContents[i]));
}
if (j != 103) {
bg6.spill();
}
}
assertEquals(bg6.size(), 3);
// 检查 sorted bag 的两个实现是否可以正确比较
DataBag bg7 = new DistinctDataBag();
for(int j=0; j<104; j++) {
for (int i = 0; i < tupleContents.length; i++) {
bg7.add(Util.createTuple(tupleContents[i]));
}
if (j != 103) {
bg7.spill();
}
}
assertEquals(bg6, bg7);
}
主要性质看源代码也可以看得出来了,总结一下InternalDistinctDataBag是独特的、排序的、使用hashSet存储的Tuple,同时可以指定限制的内存大小,超过会溢出
老规矩,贴出构造函数
public InternalDistinctBag() {
this(1, -1.0f);
}
public InternalDistinctBag(int bagCount) {
this(bagCount, -1.0f);
}
public InternalDistinctBag(int bagCount, float percent) {
super(bagCount, percent);
if (percent < 0) {
percent = 0.2F;
if (PigMapReduce.sJobConfInternal.get() != null) {
String usage = PigMapReduce.sJobConfInternal.get().get(PigConfiguration.PIG_CACHEDBAG_MEMUSAGE);
if (usage != null) {
percent = Float.parseFloat(usage);
}
}
}
init(bagCount, percent);
}
private void init(int bagCount, double percent) {
mContents = new HashSet<Tuple>();
}
又用到了父类,于是贴出父类的构造函数
SortedSpillBag(int bagCount, float percent){
super(bagCount, percent);
}
好家伙,父类调用了它的父类,于是看看继承关系
public abstract class SortedSpillBag extends SelfSpillBag
这里就很明了了,SelfSpillBag不就是InternalCachedBag的父类吗,前面有源码,于是为什么这个函数也可以限制内存大小也就很明白了
本篇博客就到这里,下一篇将继续介绍其他包的变种