【软件工程实践】Pig项目11-Data目录源码分析-其他元组

本文链接：https://blog.csdn.net/Aulic/article/details/121742574

本文介绍了Pig项目的Data目录中ReadOnceBag和SingleTupleBag的数据包，ReadOnceBag是不存储元组但能通过Hadoop迭代器访问的DataBag，适用于避免元组复制的情况；而SingleTupleBag是仅存储一个元组的高效实现，适用于POPreCombinerLocalRearrange等场景。此外，文章提到了AmendableTuple，这是一个带有标识符的可修改元组类。

摘要由CSDN通过智能技术生成

2021SC@SDUSC

回顾下上两篇的内容

其他包1中我们讲了InternalCachedBag和InternalDistinctBag，以及它的父类SelfSpillBag、SortedSpillBag

其他包2中我们讲了InternalSortedBag、limitedSortedBag，它们的父类是SortedSpillBag、DataBag

还剩下两种其他的包，在开元组的分析之前先把这两个讲完

剩余的其他包

ReadOnceBag

一些注释

/**
* 这个包不将元组存储在内存中，但可以访问通常由 Hadoop 提供的迭代器。
* 当您已经有一个迭代元组并且不想再次复制到新包时使用它。
*/

继承关系

public class ReadOnceBag implements DataBag

构造器

public ReadOnceBag() {
    }

    /**
     * 此构造函数通过获取迭代器的所有权而不复制迭代器的元素，从现有的元组迭代器中创建一个包。
     * @param pkg POPackageLite
     * @param tupIter Iterator<NullableTuple>
     * @param key Object
     */
    public ReadOnceBag(Packager pkgr, Iterator<NullableTuple> tupIter,
            PigNullableWritable keyWritable) {
        this.pkgr = pkgr;
        this.tupIter = tupIter;
        this.keyWritable = keyWritable;
    }

关键的属性

// 创建这个的打包者
    protected Packager pkgr;

    //元组的迭代器。标记为瞬态，因为我们永远不会序列化它
    protected transient Iterator<NullableTuple> tupIter;

    //正在处理的关键
    protected PigNullableWritable keyWritable;

UML图

接下来看看测试函数

（很遗憾，TestDataBag并没有测试这个）

总而言之，应该是用于特殊用途的DataBag “当有一个迭代元组且不想复制的时候使用它”

SingleTupleBag

顾名思义，只放了一个元组的包

一些注释

/**
* DataBag 接口的一个简单的高性能实现，它只包含一个元组。
* 这将从 POPreCombinerLocalRearrange 和其他任何需要单个元组不可序列化 DataBag 的地方使用。
*/

继承关系

public class SingleTupleBag implements DataBag

构造函数和关键属性

private static final long serialVersionUID = 1L;
    Tuple item;

    public SingleTupleBag(Tuple t) {
        item = t;
    }

UML图

测试函数

// See PIG-1285
@Test
public void testSerializeSingleTupleBag() throws Exception {
    Tuple t = Util.createTuple(new String[] {"foo", "bar", "baz"});
    DataBag stBag = new SingleTupleBag(t);
    PipedOutputStream pos = new PipedOutputStream();
    DataOutputStream dos = new DataOutputStream(pos);
    PipedInputStream pis = new PipedInputStream(pos);
    DataInputStream dis = new DataInputStream(pis);
    stBag.write(dos);
    DataBag dfBag = new DefaultDataBag();
    dfBag.readFields(dis);
    assertTrue(dfBag.equals(stBag));
}

可以看到，通过特殊的方法，可以将SingleTupleBag的元组赋给DafaultDataBag，它们是等效的

至此，其他包的剩余部分正式完结！！！

其他元组

首先看看有哪些元组

根据上图，除了我们已经见过的元组，剩下的元组有：AmendabaleTuple、AppendableSahemaTuple、BinSedesTuple、NonWritableTuple、SchemaTuple、TargetedTuple、TimestampedTuple

AmendabaleTuple

特别的短，直接上完整源码..

public class AmendableTuple extends DefaultTuple {
    /**
     *
     */
    private static final long serialVersionUID = 2L;
    Object amendKey;       // 此元组所属的组的标识符。

    public AmendableTuple(int numFields, Object amendKey) {
        super(numFields);
        this.amendKey = amendKey;
    }

    public Object getAmendKey() {
        return amendKey;
    }
    public void setAmendKey(Object amendKey) {
        this.amendKey = amendKey;
    }

}

UML图

DefaultTuple老早之前讲了，为了方便观看这里重新梳理一遍..

一些注释

/**
* 元组的默认实现。这个类将由 DefaultTupleFactory 创建。
*/

继承关系

public class DefaultTuple extends AbstractTuple

关键属性

private static final long serialVersionUID = 2L;
protected List<Object> mFields;

构造函数

/**
     * 默认构造函数。此构造函数是公共的，因此 hadoop 可以直接调用它。但是，在 pig 中，您永远不应该调用此函数。
     * 请改用 TupleFactory。 <br>时间复杂度：O(1)，分配后
     */
    public DefaultTuple() {
        mFields = new ArrayList<Object>();
    }

    /**
     * 构造一个具有已知字段数的元组。包级别，以便调用者不能直接调用它。
     * <br>结果元组预先填充了空元素。时间复杂度：O(N)，分配后
     * @param size
     *            Number of fields to allocate in the tuple.
     */
    DefaultTuple(int size) {
        mFields = new ArrayList<Object>(size);
        for (int i = 0; i < size; i++)
            mFields.add(null);
    }

    /**
     *从现有的对象列表构造一个元组。包级别，以便调用者不能直接调用它。
     * <br>时间复杂度：O(N)加上输入对象迭代的运行时间，分配后
     * @param c
     *            List of objects to turn into a tuple.
     */
    DefaultTuple(List<Object> c) {
        mFields = new ArrayList<Object>(c);
    }

    /**
     * 从现有的对象列表构造一个元组。包级别，以便调用者不能直接调用它。 <br>时间复杂度：O(1)
     *
     * @param c
     *            List of objects to turn into a tuple. This list will be kept as part of the tuple.
     * @param junk
     *            Just used to differentiate from the constructor above that copies the list.
     */
    DefaultTuple(List<Object> c, int junk) {
        mFields = c;
    }