学习 Hive
李建奇
1 学习
看了一部分代码,感觉,hive 比较复杂,使用场景有限,一般用 hadoop 原生的 map reduce 就可以了。
|
1.1 版本
0.6
1.2 目的
学习 facebook 等应用 hive 的经验,以便应用于公司。
学习代码的目的是便于更好的应用,比如debuging , tuning. 以及应用新的patch. 等。
2 Pig + Hive : ETL + data warehouse
The data preparation phase is often known as ETL (Extract Transform Load) or the data factory. "Factory" is a good analogy because it captures the essence of what is being done in this stage: Just as a physical factory brings in raw materials and outputs products ready for consumers, so a data factory brings in raw data and produces data sets ready for data users to consume. Raw data is loaded in, cleaned up, conformed to the selected data model, joined with other data sources, and so on. Users in this phase are generally engineers, data specialists, or researchers. The data presentation phase is usually referred to as the data warehouse. A warehouse stores products ready for consumers; they need only come and select the proper products off of the shelves. In this phase, users may be engineers using the data for their systems, analysts, or decisionmakers. Given the different workloads and different users for each phase, we have found that different tools work best in each phase. Pig (combined with a workflow system such as Oozie) is best suited for the data factory, and Hive for the data warehouse.
|
2.1 data warehouse
Data warehouse use casesIn the data warehouse phase of processing, we see two dominant use cases: business-intelligence analysis and ad-hoc queries. In the first case, users connect the data to business intelligence (BI) tools — such as MicroStrategy — to generate reports or do further analysis. In the second case, users run ad-hoc queries issued by data analysts or decisionmakers. In both cases, the relational model and SQL are the best fit. Indeed, data warehousing has been one of the core use cases for SQL through much of its history. It has the right constructs to support the types of queries and tools that analysts want to use. And it is already in use by both the tools and users in the field.
|
2.2 facebook 的应用架构
3 hive
3.1 Architecuture
3.2 Query Translation
SELECT url, count(*) FROM page_views GROUPBY url
3.3 SerDe
3.4 Table 存储结构
4 Ql
我从plan 开始分析,是因为,我觉得 plan 应该是这个系统的核心。就是说sql ->plan->execute。
4.1 plan
4.1.1 从 UT 的角度
public class TestPlan extends TestCase { final String F1 = "#affiliations"; final String F2 = "friends[0].friendid";
try { // initialize a complete map reduce configuration ExprNodeDesc expr1 = new ExprNodeColumnDesc( TypeInfoFactory.stringTypeInfo, F1, "", false); ExprNodeDesc expr2 = new ExprNodeColumnDesc( TypeInfoFactory.stringTypeInfo, F2, "", false); ExprNodeDesc filterExpr = TypeCheckProcFactory.DefaultExprProcessor .getFuncExprNodeDesc("==", expr1, expr2);
FilterDesc filterCtx = new FilterDesc(filterExpr, false);
// 一个 filter 类型的 operator
Operator<FilterDesc> op = OperatorFactory.get(FilterDesc.class); op.setConf(filterCtx);
// 定义了一个 pathToAlias
ArrayList<String> aliasList = new ArrayList<String>(); aliasList.add("a"); LinkedHashMap<String, ArrayList<String>> pa = new LinkedHashMap<String, ArrayList<String>>(); pa.put("/tmp/testfolder", aliasList);
// 定义了一个 path To Operator
TableDesc tblDesc = Utilities.defaultTd; PartitionDesc partDesc = new PartitionDesc(tblDesc, null); LinkedHashMap<String, PartitionDesc> pt = new LinkedHashMap<String, PartitionDesc>(); pt.put("/tmp/testfolder", partDesc);
// 定义了一个 alias to Operator
LinkedHashMap<String, Operator<? extends Serializable>> ao = new LinkedHashMap<String, Operator<? extends Serializable>>(); ao.put("a", op);
MapredWork mrwork = new MapredWork(); mrwork.setPathToAliases(pa); mrwork.setPathToPartitionInfo(pt); mrwork.setAliasToWork(ao);
}
|
我猜 一个 job 是由input ,output + MapRedWork 构成的。接下来,我看 plan 如何执行。
|
4.1.2 MapredWork
//这个类是计划的核心部分
public class MapredWork implements Serializable { private static final long serialVersionUID = 1L; private String command; // map side work // use LinkedHashMap to make sure the iteration order is // deterministic, to ease testing private LinkedHashMap<String, ArrayList<String>> pathToAliases;
private LinkedHashMap<String, PartitionDesc> pathToPartitionInfo;
private LinkedHashMap<String, Operator<? extends Serializable>> aliasToWork;
private LinkedHashMap<String, PartitionDesc> aliasToPartnInfo;
// map<->reduce interface // schema of the map-reduce 'key' object - this is homogeneous private TableDesc keyDesc;
// schema of the map-reduce 'val' object - this is heterogeneous private List<TableDesc> tagToValueDesc;
private Operator<?> reducer;
private Integer numReduceTasks; private Integer numMapTasks; private Integer minSplitSize;
private boolean needsTagging; private boolean hadoopSupportsSplittable;
private MapredLocalWork mapLocalWork; private String inputformat;
|
4.1.3 Dviver
这是 QL 的主要接口
// 分析 |
//Driver compile(command) { ctx = new Context(conf);
ParseDriver pd = new ParseDriver(); ASTNode tree = pd.parse(command, ctx); tree = ParseUtils.findRootNonNullToken(tree);
BaseSemanticAnalyzer sem = SemanticAnalyzerFactory.get(conf, tree); // Do semantic analysis and plan generation sem.analyze(tree, ctx);
// validate the plan sem.validate();
plan = new QueryPlan(command, sem); // initialize FetchTask right here if (plan.getFetchTask() != null) { plan.getFetchTask().initialize(conf, plan, null); }
}
|
// lauchTask // 这个与hadoop 中 run 一个 task 是类似的 TaskResult tskRes = new TaskResult();
TaskRunner tskRun = new TaskRunner(tsk, tskRes);
//启动一个线程,让 task. execute tskRun.start();
|
|
//物理执行 public int execute() {
plan.setStarted(); int jobs = countJobs(plan.getRootTasks());
// 取出 root tasks , for (Task<? extends Serializable> tsk : plan.getRootTasks()) { driverCxt.addToRunnable(tsk); }
//依次执行 // Loop while you either have tasks running, or tasks queued up
while (running.size() != 0 || runnable.peek() != null) { // Launch upto maxthreads tasks while (runnable.peek() != null && running.size() < maxthreads) { Task<? extends Serializable> tsk = runnable.remove(); launchTask(tsk, queryId, noName, running, jobname, jobs, driverCxt); }
// poll the Tasks to see which one completed TaskResult tskRes = pollTasks(running.keySet()); TaskRunner tskRun = running.remove(tskRes); Task<? extends Serializable> tsk = tskRun.getTask();
if (tsk.getChildTasks() != null) { for (Task<? extends Serializable> child : tsk.getChildTasks()) { if (DriverContext.isLaunchable(child)) { driverCxt.addToRunnable(child); } } } } }
|
4.1.4 QueryPlan
// 生成 执行的 图 , Hive 中 最复杂的部分 /** * Populate api.QueryPlan from exec structures. This includes constructing the * dependency graphs of stages and operators. * * @throws IOException */ private void populateQueryPlan() throws IOException { query.setStageGraph(new org.apache.hadoop.hive.ql.plan.api.Graph()); query.getStageGraph().setNodeType(NodeType.STAGE);
}
|
看到这里,我觉得 hive 挺复杂 ,想要用好不容易。如果,查询不是很多,写 mapper,reduce task 可能也没有什么问题。
4.1.5 Task
4.2 exec
4.2.1 TestExecDriver
// 把文件load 到 table // load the test files into tables i = 0; db = Hive.get(conf); String[] srctables = {"src", "src2"}; LinkedList<String> cols = new LinkedList<String>(); cols.add("key"); cols.add("value"); for (String src : srctables) { db.dropTable(MetaStoreUtils.DEFAULT_DATABASE_NAME, src, true, true); db.createTable(src, cols, null, db.loadTable(hadoopDataFile[i], src, false, null); i++; }
|
private void executePlan(File planFile) throws Exception {
String cmdLine = conf.getVar(HiveConf.ConfVars.HADOOPBIN) + " jar " + conf.getJar() + " org.apache.hadoop.hive.ql.exec.ExecDriver -plan " + planFile.toString() + " " + ExecDriver.generateCmdLine(conf);
Process executor = Runtime.getRuntime().exec(cmdLine);
|
private void populateMapPlan1(Table src) { mr.setNumReduceTasks(Integer.valueOf(0));
Operator<FileSinkDesc> op2 = OperatorFactory.get(new FileSinkDesc(tmpdir + "mapplan1.out", Utilities.defaultTd, true)); Operator<FilterDesc> op1 = OperatorFactory.get(getTestFilterDesc("key"), op2);
Utilities.addMapWork(mr, src, "a", op1); }
|
|
4.2.2 Utilities.java
public static void addMapWork(MapredWork mr, Table tbl, String alias, Operator<?> work) { mr.addMapWork(tbl.getDataLocation().getPath(), alias, work, new PartitionDesc(getTableDesc(tbl), null)); }
|
// 我对 alias 的用途不解, 现在可以看 ExecDriver 如何 run 了
|
4.2.3 ExecDriver.java
//执行计划 /** * Execute a query plan using Hadoop. */ @Override public int execute(DriverContext driverContext) {
// map 的相关设置
job.setMapperClass(ExecMapper.class);
job.setMapOutputKeyClass(HiveKey.class); job.setMapOutputValueClass(BytesWritable.class);
job.setPartitionerClass((Class<? extends (Class.forName(HiveConf.getVar(job, HiveConf.ConfVars.HIVEPARTITIONER))));
if (work.getNumMapTasks() != null) { job.setNumMapTasks(work.getNumMapTasks().intValue()); } if (work.getMinSplitSize() != null) { HiveConf.setIntVar(job, HiveConf.ConfVars.MAPREDMINSPLITSIZE, work.getMinSplitSize().intValue()); }
//reduce 的相关设置
job.setNumReduceTasks(work.getNumReduceTasks().intValue()); job.setReducerClass(ExecReducer.class);
if (work.getInputformat() != null) { HiveConf.setVar(job, HiveConf.ConfVars.HIVEINPUTFORMAT, work.getInputformat()); }
//呵呵,这个注释有点意思 // No-Op - we don't really write anything here .. job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); } // 添加输入 addInputPaths(job, work, emptyScratchDirStr); Utilities.setMapRedWork(job, work, hiveScratchDir); //提交job orig_rj = rj = jc.submitJob(job);
|
// 至此 ,我明白了, hive 与我做过的 hadoopWrapper 的执行原理很像 。 接下来,我要好好看看,如何作计划 //按我的经验,常用的operator : aggregation, join
//我想 hive 除了减少 job 的 coding ,应该还能干点别的,比如,增量,继续看。 回到 4.1.3 Driver
|
4.2.4 ExecMapper
public class ExecMapper extends
private MapOperator mo;
|
5 SerDe
6 MetaStore
7 Shim
7.1 HadoopShims
// 屏蔽 hadoop 版本的差异
* In order to be compatible with multiple versions of Hadoop, all parts * of the Hadoop interface that are not cross-version compatible are * encapsulated in an implementation of this class. Users should use * the ShimLoader class as a factory to obtain an implementation of * HadoopShims corresponding to the version of Hadoop currently on the * classpath. */ public interface HadoopShims {
|
8 Ref
|
<http://developer.yahoo.com/blogs/hadoop/posts/2010/08/pig_and_hive_at_yahoo/>
<http://wiki.apache.org/hadoop/Hive>
http://www.slideshare.net/jsichi/hive-evolution-apachecon-2010