学习hive源代码

最新推荐文章于 2024-03-12 07:32:35 发布

eric_lee

最新推荐文章于 2024-03-12 07:32:35 发布

阅读量796

点赞数

分类专栏：大数据开发

大数据开发专栏收录该内容

113 篇文章 0 订阅

订阅专栏

学习 Hive

李建奇

1 学习

看了一部分代码，感觉，hive 比较复杂，使用场景有限，一般用 hadoop 原生的 map reduce 就可以了。

1.1 版本

0.6

1.2 目的

学习 facebook 等应用 hive 的经验，以便应用于公司。

学习代码的目的是便于更好的应用，比如debuging , tuning. 以及应用新的patch. 等。

2 Pig + Hive : ETL + data warehouse

The data preparation phase is often known as ETL (Extract Transform Load) or the data factory. "Factory" is a good analogy because it captures the essence of what is being done in this stage: Just as a physical factory brings in raw materials and outputs products ready for consumers, so a data factory brings in raw data and produces data sets ready for data users to consume. Raw data is loaded in, cleaned up, conformed to the selected data model, joined with other data sources, and so on. Users in this phase are generally engineers, data specialists, or researchers.

The data presentation phase is usually referred to as the data warehouse. A warehouse stores products ready for consumers; they need only come and select the proper products off of the shelves. In this phase, users may be engineers using the data for their systems, analysts, or decisionmakers.

Given the different workloads and different users for each phase, we have found that different tools work best in each phase. Pig (combined with a workflow system such as Oozie) is best suited for the data factory, and Hive for the data warehouse.

2.1 data warehouse

Data warehouse use cases

In the data warehouse phase of processing, we see two dominant use cases: business-intelligence analysis and ad-hoc queries.

In the first case, users connect the data to business intelligence (BI) tools — such as MicroStrategy — to generate reports or do further analysis.

In the second case, users run ad-hoc queries issued by data analysts or decisionmakers.

In both cases, the relational model and SQL are the best fit. Indeed, data warehousing has been one of the core use cases for SQL through much of its history. It has the right constructs to support the types of queries and tools that analysts want to use. And it is already in use by both the tools and users in the field.

2.2 facebook 的应用架构

3 hive

3.1 Architecuture

3.2 Query Translation

SELECT url, count(*) FROM page_views GROUPBY url

3.3 SerDe

3.4 Table 存储结构

4 Ql

我从plan 开始分析，是因为，我觉得 plan 应该是这个系统的核心。就是说sql ->plan->execute。

4.1 plan

4.1.1 从 UT 的角度

public class TestPlan extends TestCase {

final String F1 = "#affiliations";

final String F2 = "friends[0].friendid";

try {

// initialize a complete map reduce configuration

ExprNodeDesc expr1 = new ExprNodeColumnDesc(

TypeInfoFactory.stringTypeInfo, F1, "", false);

ExprNodeDesc expr2 = new ExprNodeColumnDesc(

TypeInfoFactory.stringTypeInfo, F2, "", false);

ExprNodeDesc filterExpr = TypeCheckProcFactory.DefaultExprProcessor

.getFuncExprNodeDesc("==", expr1, expr2);

FilterDesc filterCtx = new FilterDesc(filterExpr, false);

// 一个 filter 类型的 operator

Operator<FilterDesc> op = OperatorFactory.get(FilterDesc.class);

op.setConf(filterCtx);

// 定义了一个 pathToAlias

ArrayList<String> aliasList = new ArrayList<String>();

aliasList.add("a");

LinkedHashMap<String, ArrayList<String>> pa = new LinkedHashMap<String, ArrayList<String>>();

pa.put("/tmp/testfolder", aliasList);

// 定义了一个 path To Operator

TableDesc tblDesc = Utilities.defaultTd;

PartitionDesc partDesc = new PartitionDesc(tblDesc, null);

LinkedHashMap<String, PartitionDesc> pt = new LinkedHashMap<String, PartitionDesc>();

pt.put("/tmp/testfolder", partDesc);

// 定义了一个 alias to Operator

LinkedHashMap<String, Operator<? extends Serializable>> ao =

new LinkedHashMap<String, Operator<? extends Serializable>>();

ao.put("a", op);

MapredWork mrwork = new MapredWork();

mrwork.setPathToAliases(pa);

mrwork.setPathToPartitionInfo(pt);

mrwork.setAliasToWork(ao);

}

我猜一个 job 是由input ,output + MapRedWork 构成的。接下来，我看 plan 如何执行。

4.1.2 MapredWork

//这个类是计划的核心部分

public class MapredWork implements Serializable {

private static final long serialVersionUID = 1L;

private String command;

// map side work

// use LinkedHashMap to make sure the iteration order is

// deterministic, to ease testing

private LinkedHashMap<String, ArrayList<String>> pathToAliases;

private LinkedHashMap<String, PartitionDesc> pathToPartitionInfo;

private LinkedHashMap<String, Operator<? extends Serializable>> aliasToWork;

private LinkedHashMap<String, PartitionDesc> aliasToPartnInfo;

// map<->reduce interface

// schema of the map-reduce 'key' object - this is homogeneous

private TableDesc keyDesc;

// schema of the map-reduce 'val' object - this is heterogeneous

private List<TableDesc> tagToValueDesc;

private Operator<?> reducer;

private Integer numReduceTasks;

private Integer numMapTasks;

private Integer minSplitSize;

private boolean needsTagging;

private boolean hadoopSupportsSplittable;

private MapredLocalWork mapLocalWork;

private String inputformat;

4.1.3 Dviver

这是 QL 的主要接口

// 分析

//Driver

compile(command)

{

ctx = new Context(conf);

ParseDriver pd = new ParseDriver();

ASTNode tree = pd.parse(command, ctx);

tree = ParseUtils.findRootNonNullToken(tree);

BaseSemanticAnalyzer sem = SemanticAnalyzerFactory.get(conf, tree);

// Do semantic analysis and plan generation

sem.analyze(tree, ctx);

// validate the plan

sem.validate();

plan = new QueryPlan(command, sem);

// initialize FetchTask right here

if (plan.getFetchTask() != null) {

plan.getFetchTask().initialize(conf, plan, null);

}

// lauchTask

// 这个与hadoop 中 run 一个 task 是类似的

TaskResult tskRes = new TaskResult();

TaskRunner tskRun = new TaskRunner(tsk, tskRes);

//启动一个线程，让 task. execute

tskRun.start();

//物理执行

public int execute() {

plan.setStarted();

int jobs = countJobs(plan.getRootTasks());

// 取出 root tasks ,

for (Task<? extends Serializable> tsk : plan.getRootTasks()) {

driverCxt.addToRunnable(tsk);

}

//依次执行

// Loop while you either have tasks running, or tasks queued up

while (running.size() != 0 || runnable.peek() != null) {

// Launch upto maxthreads tasks

while (runnable.peek() != null && running.size() < maxthreads) {

Task<? extends Serializable> tsk = runnable.remove();

launchTask(tsk, queryId, noName, running, jobname, jobs, driverCxt);

}

// poll the Tasks to see which one completed

TaskResult tskRes = pollTasks(running.keySet());

TaskRunner tskRun = running.remove(tskRes);

Task<? extends Serializable> tsk = tskRun.getTask();

if (tsk.getChildTasks() != null) {

for (Task<? extends Serializable> child : tsk.getChildTasks()) {

if (DriverContext.isLaunchable(child)) {

driverCxt.addToRunnable(child);

}

4.1.4 QueryPlan

// 生成执行的图， Hive 中最复杂的部分

/**

* Populate api.QueryPlan from exec structures. This includes constructing the

* dependency graphs of stages and operators.

* @throws IOException

private void populateQueryPlan() throws IOException {

query.setStageGraph(new org.apache.hadoop.hive.ql.plan.api.Graph());

query.getStageGraph().setNodeType(NodeType.STAGE);

}

看到这里，我觉得 hive 挺复杂，想要用好不容易。如果，查询不是很多，写 mapper,reduce task 可能也没有什么问题。

4.1.5 Task

4.2 exec

4.2.1 TestExecDriver

// 把文件load 到 table

// load the test files into tables

i = 0;

db = Hive.get(conf);

String[] srctables = {"src", "src2"};

LinkedList<String> cols = new LinkedList<String>();

cols.add("key");

cols.add("value");

for (String src : srctables) {

db.dropTable(MetaStoreUtils.DEFAULT_DATABASE_NAME, src, true, true);

db.createTable(src, cols, null, ~~TextInputFormat~~.class,

~~IgnoreKeyTextOutputFormat~~.class);

db.loadTable(hadoopDataFile[i], src, false, null);

i++;

}

private void executePlan(File planFile) throws Exception {

String cmdLine = conf.getVar(HiveConf.ConfVars.HADOOPBIN) + " jar "

+ conf.getJar() + " org.apache.hadoop.hive.ql.exec.ExecDriver -plan "

+ planFile.toString() + " " + ExecDriver.generateCmdLine(conf);

Process executor = Runtime.getRuntime().exec(cmdLine);

private void populateMapPlan1(Table src) {

mr.setNumReduceTasks(Integer.valueOf(0));

Operator<FileSinkDesc> op2 = OperatorFactory.get(new FileSinkDesc(tmpdir

+ "mapplan1.out", Utilities.defaultTd, true));

Operator<FilterDesc> op1 = OperatorFactory.get(getTestFilterDesc("key"),

op2);

Utilities.addMapWork(mr, src, "a", op1);

}

4.2.2 Utilities.java

public static void addMapWork(MapredWork mr, Table tbl, String alias,

Operator<?> work) {

mr.addMapWork(tbl.getDataLocation().getPath(), alias, work,

new PartitionDesc(getTableDesc(tbl), null));

}

// 我对 alias 的用途不解，现在可以看 ExecDriver 如何 run 了

4.2.3 ExecDriver.java

//执行计划

/**

* Execute a query plan using Hadoop.

@Override

public int execute(DriverContext driverContext) {

// map 的相关设置

job.setMapperClass(ExecMapper.class);

job.setMapOutputKeyClass(HiveKey.class);

job.setMapOutputValueClass(BytesWritable.class);

job.setPartitionerClass((Class<? extends ~~Partitioner~~>)

(Class.forName(HiveConf.getVar(job, HiveConf.ConfVars.HIVEPARTITIONER))));

if (work.getNumMapTasks() != null) {

job.setNumMapTasks(work.getNumMapTasks().intValue());

}

if (work.getMinSplitSize() != null) {

HiveConf.setIntVar(job, HiveConf.ConfVars.MAPREDMINSPLITSIZE,

work.getMinSplitSize().intValue());

}

//reduce 的相关设置

job.setNumReduceTasks(work.getNumReduceTasks().intValue());

job.setReducerClass(ExecReducer.class);

if (work.getInputformat() != null) {

HiveConf.setVar(job, HiveConf.ConfVars.HIVEINPUTFORMAT, work.getInputformat());

}

//呵呵，这个注释有点意思

// No-Op - we don't really write anything here ..

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(Text.class);

}

// 添加输入

addInputPaths(job, work, emptyScratchDirStr);

Utilities.setMapRedWork(job, work, hiveScratchDir);

//提交job

orig_rj = rj = jc.submitJob(job);

// 至此，我明白了， hive 与我做过的 hadoopWrapper 的执行原理很像。

接下来，我要好好看看，如何作计划

//按我的经验，常用的operator ： aggregation, join

//我想 hive 除了减少 job 的 coding ,应该还能干点别的，比如，增量，继续看。

回到 4.1.3 Driver

4.2.4 ExecMapper

public class ExecMapper extends ~~MapReduceBase~~ implements ~~Mapper~~ {

private MapOperator mo;

5 SerDe

6 MetaStore

7 Shim

7.1 HadoopShims

// 屏蔽 hadoop 版本的差异

* In order to be compatible with multiple versions of Hadoop, all parts

* of the Hadoop interface that are not cross-version compatible are

* encapsulated in an implementation of this class. Users should use

* the ShimLoader class as a factory to obtain an implementation of

* HadoopShims corresponding to the version of Hadoop currently on the

* classpath.

public interface HadoopShims {

8 Ref

<http://developer.yahoo.com/blogs/hadoop/posts/2010/08/pig_and_hive_at_yahoo/>

<http://wiki.apache.org/hadoop/Hive>

http://www.slideshare.net/jsichi/hive-evolution-apachecon-2010

eric_lee

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录