【软件工程实践】Pig项目2-Data目录源码分析-Tuple

最新推荐文章于 2024-10-09 11:13:12 发布

苏伊士运河的小挖土机

最新推荐文章于 2024-10-09 11:13:12 发布

阅读量164

点赞数

分类专栏： Pig 文章标签： java

本文链接：https://blog.csdn.net/Aulic/article/details/120690730

版权

Pig 专栏收录该内容

13 篇文章 1 订阅

订阅专栏

2021SC@SDUSC

Data目录文件列表如下

文件很多，我们先了解Pig的数据结构，再进行分析，其中一个很重要的概念是数据模型

相关资料链接：【Pig源码分析】谈谈Pig的数据模型 -数据库-火龙果软件工程

数据模型Schema

Schema为数据所遵从的类型格式，包括两个部分：
field的名称类型

field表示数据块(A field is a piece of data)可理解为数据字段

Schema与Pig Latin的关系

Pig Latin表达式操作的是relation，FILTER、FOREACH、GROUP、SPLIT等关系操作符所操作的relation就是bag，bag为tuple的集合，tuple为有序的field列表集合

因此，Schema是Pig Latin表达式操作的一个单元

用户常用as语句来自定义schema，或是load函数导入schema，比如：

若不指定field的类型，则其默认为bytearray。对未知schema进行操作时，有：

若join/cogroup/cross多关系操作遇到未知schema，则会将其视为null schema，导致返回结果的schema也为null；

若flatten一个empty inner schema的bag（即:bag{}）时，则返回结果的schema为null；

若union时二者relation的schema不一致，则返回结果的schema为null；

若field的schema为null，会将该字段视为bytearray。

为了保证pig脚本运行的有效性，在写UDF时要在outputSchema方法中指定返回结果的schema。

注释：UDF为用户自定义函数（Userdefined function)

数据类型

Pig的基本数据类型与对应的Java类：

复杂数据类型及其对应的Java类：

笔记：这里可以进一步理解Schama，如'hello'、18就是数据块field；有序field的集合是tuple，如（18,1），bag为tuple的集合，如{('hello'),(18,1)},某些操作符所操作的relation就是bag

Tuple源码分析

在Data目录下搜索含有Tuple的文件名：

我们来参照上文提到文章里的内容：

在KEYSET源码中，创建Tuple对象采用工厂+单例设计模式：

private static final TupleFactory TUPLE_FACTORY = TupleFactory.getInstance();
Tuple t = TUPLE_FACTORY.newTuple(s);

笔记：keyset.java位于目录，代码如下

其中，map为java自带的类，用法参照以下博客，我们只需要知道它是用于遍历的就行了

java笔记--Map的用法_Linias的博客-CSDN博客_java map

可以看见新建bag的代码为 new NonSpillableDataBag(m.szie())，新建tuple的代码为前面那两端代码

继续参考前面的博客：

事实上，TupleFactory是个抽象类，实现接口TupleMaker<Tuple>。在方法TupleFactory.getInstance()中，默认情况下返回的是BinSedesTupleFactory对象，同时支持加载用户重写的TupleFactory类（pig.data.tuple.factory.name指定类名、 pig.data.tuple.factory.jar指定类所在的jar）。BinSedesTupleFactory继承于TupleFactory：

在BinSedesTupleFactory的newTuple方法中，返回的是BinSedesTuple对象。BinSedesTuple类继承于DefaultTuple类，在DefaultTuple类中有List<Object> mFields字段，这便是存储Tuple数据的地方了，mFields所持有类型为ArrayList<Object>()；。类图关系：

笔记：这篇博客已经将Tuple整体脉络整理地非常清楚了，我们继续看一些细节

TupleFactory.java开头有以下注释

这是一个构造元组的工厂。这个类是抽象的，因此用户可以重写元组工厂，如果他们想提供自己的
返回它们对元组的实现。如果属性pig.data.tuple.factory.name设置为类名 pig.data.tuple.factory.jar被设置为一个指向jar的URL包含上面命名的类，然后{@link #getInstance()}将创建一个指定类的实例，使用指定的jar。否则,它将创建一个{@link DefaultTupleFactory}的实例

原文

/**
 * A factory to construct tuples.  This class is abstract so that users can
 * override the tuple factory if they desire to provide their own that
 * returns their implementation of a tuple.  If the property
 * pig.data.tuple.factory.name is set to a class name and
 * pig.data.tuple.factory.jar is set to a URL pointing to a jar that
 * contains the above named class, then {@link #getInstance()} will create a
 * an instance of the named class using the indicated jar.  Otherwise, it
 * will create an instance of {@link DefaultTupleFactory}.
 */

TupleFactory实现的函数：

获取对单例工厂的引用。
@return用于构造元组的TupleFactory

/**
 * Get a reference to the singleton factory.
 * @return The TupleFactory to use to construct tuples.
 */
public static TupleFactory getInstance()

创建一个空元组。这个（函数）应该尽量少用
，使用newTuple(int)代替。
@return空的新元组

/**
 * Create an empty tuple.  This should be used as infrequently as
 * possible, use newTuple(int) instead.
 * @return Empty new tuple.
 */
public abstract Tuple newTuple();

创建一个具有size字段的元组。只要可能，这是首选
*在null构造函数上，因为构造函数可以预分配
*存放字段的容器大小。一旦这个被调用，它
*是合法的调用Tuple。Set (x, object)，其中x <大小。
* @param size元组字段个数。
* @return带有size字段的元组

/**
 * Create a tuple with size fields.  Whenever possible this is preferred
 * over the null constructor, as the constructor can preallocate the
 * size of the container holding the fields.  Once this is called, it
 * is legal to call Tuple.set(x, object), where x &lt; size.
 * @param size Number of fields in the tuple.
 * @return Tuple with size fields
 */
public abstract Tuple newTuple(int size);

从提供的对象列表创建一个元组。底层列表
*将被复制。
* @param c用于作为元组字段的对象列表。
* @return一个以列表对象作为字段的元组
/**
 * Create a tuple from the provided list of objects.  The underlying list
 * will be copied.
 * @param c List of objects to use as the fields of the tuple.
 * @return A tuple with the list objects as its fields
 */
public abstract Tuple newTuple(List c);

从提供的对象列表创建一个元组，保持所提供的
*列表。新的元组将接管所提供列表的所有权。
* @param list将成为元组字段的对象列表。
* @return一个以列表对象作为字段的元组
/**
 * Create a tuple from a provided list of objects, keeping the provided
 * list.  The new tuple will take over ownership of the provided list.
 * @param list List of objects that will become the fields of the tuple.
 * @return A tuple with the list objects as its fields
 */
public abstract Tuple newTupleNoCopy(List list);

用单个元素创建一个元组。这很有用，因为
事实上，包(目前)只接受元组，我们经常结束
*将单个元素放入元组中，以便将其放入包中。
* @param datum放入元组的数据。
* @return一个只有一个字段的元组
/**
 * Create a tuple with a single element.  This is useful because of
 * the fact that bags (currently) only take tuples, we often end up
 * sticking a single element in a tuple in order to put it in a bag.
 * @param datum Datum to put in the tuple.
 * @return A tuple with one field
 */
public abstract Tuple newTuple(Object datum);

返回表示实现的元组的实际类
*工厂将返回。这是需要的，因为Hadoop需要
*以了解我们将用于输入和输出的确切类。
* @return实现元组的类
/**
 * Return the actual class representing a tuple that the implementing
 * factory will be returning.  This is needed because Hadoop needs
 * to know the exact class we will be using for input and output.
 * @return Class that implements tuple.
 */
public abstract Class<? extends Tuple> tupleClass();

protected TupleFactory() {
}

仅供测试之用。这个函数不应该是
*被任何人调用，但单元测试
/**
 * Provided for testing purposes only.  This function should never be
 * called by anybody but the unit tests.
 */
public static void resetSelf() {
    gSelf = null;
}


返回实现元组原始比较器的实际类
工厂将会返回。重写此选项以允许Hadoop
*加快元组排序。实际返回的类应该知道
*元组的序列化细节。默认实现
* (PigTupleDefaultRawComparator)将在比较之前序列化数据
* @return实现元组原始比较器的类
/**
 * Return the actual class implementing the raw comparator for tuples
 * that the factory will be returning. Ovverride this to allow Hadoop to
 * speed up tuple sorting. The actual returned class should know the
 * serialization details for the tuple. The default implementation 
 * (PigTupleDefaultRawComparator) will serialize the data before comparison
 * @return Class that implements tuple raw comparator.
 */
public Class<? extends TupleRawComparator> tupleRawComparatorClass() {
    return PigTupleDefaultRawComparator.class;
}

此方法用于检查该工厂是否创建了元组
*将是一个固定的大小时，他们被创建。在实践中，这意味着
*是否支持append。
* @return该元组是否固定
/**
 * This method is used to inspect whether the Tuples created by this factory
 * will be of a fixed size when they are created. In practical terms, this means
 * whether they support append or not.
 * @return where the Tuple is fixed or not
 */
public abstract boolean isFixedSize();

总结：TupleFactory是个抽象类，除了getInstance外其他大多没给出实现，等待继承它的类来实现，至于TupleMaker更加简单粗暴，以下为其代码