学习hive源代码

学习 Hive

 

李建奇

 

1       学习

看了一部分代码,感觉,hive 比较复杂,使用场景有限,一般用 hadoop 原生的 map reduce 就可以了。

 

 

1.1        版本

0.6

 

1.2        目的

  学习 facebook 等应用 hive 的经验,以便应用于公司。

  学习代码的目的是便于更好的应用,比如debuging , tuning. 以及应用新的patch. 等。

 

2       Pig + Hive : ETL + data warehouse

 

The data preparation phase is often known as ETL (Extract Transform Load) or the data factory. "Factory" is a good analogy because it captures the essence of what is being done in this stage: Just as a physical factory brings in raw materials and outputs products ready for consumers, so a data factory brings in raw data and produces data sets ready for data users to consume. Raw data is loaded in, cleaned up, conformed to the selected data model, joined with other data sources, and so on. Users in this phase are generally engineers, data specialists, or researchers.

The data presentation phase is usually referred to as the data warehouse. A warehouse stores products ready for consumers; they need only come and select the proper products off of the shelves. In this phase, users may be engineers using the data for their systems, analysts, or decisionmakers.

Given the different workloads and different users for each phase, we have found that different tools work best in each phase. Pig (combined with a workflow system such as Oozie) is best suited for the data factory, and Hive for the data warehouse.

 

 

 

 

 

2.1        data warehouse

 

Data warehouse use cases

In the data warehouse phase of processing, we see two dominant use cases: business-intelligence analysis and ad-hoc queries.

In the first case, users connect the data to business intelligence (BI) tools — such as MicroStrategy — to generate reports or do further analysis.

In the second case, users run ad-hoc queries issued by data analysts or decisionmakers.

In both cases, the relational model and SQL are the best fit. Indeed, data warehousing has been one of the core use cases for SQL through much of its history. It has the right constructs to support the types of queries and tools that analysts want to use. And it is already in use by both the tools and users in the field.

 

 

 

 

 

2.2        facebook 的应用架构

 

 

3       hive

 

3.1        Architecuture

 

 

3.2         Query Translation

SELECT url, count(*) FROM page_views GROUPBY url

 

3.3          SerDe

 

 

3.4        Table 存储结构

 

 

 

 

4       Ql

  我从plan 开始分析,是因为,我觉得 plan 应该是这个系统的核心。就是说sql ->plan->execute。

4.1        plan

4.1.1  从 UT 的角度

public class TestPlan extends TestCase {

   final String F1 = "#affiliations";

    final String F2 = "friends[0].friendid";

 

    try {

      // initialize a complete map reduce configuration

      ExprNodeDesc expr1 = new ExprNodeColumnDesc(

          TypeInfoFactory.stringTypeInfo, F1, "", false);

      ExprNodeDesc expr2 = new ExprNodeColumnDesc(

          TypeInfoFactory.stringTypeInfo, F2, "", false);

      ExprNodeDesc filterExpr = TypeCheckProcFactory.DefaultExprProcessor

          .getFuncExprNodeDesc("==", expr1, expr2);

 

      FilterDesc filterCtx = new FilterDesc(filterExpr, false);

 

      // 一个 filter 类型的 operator

 

      Operator<FilterDesc> op = OperatorFactory.get(FilterDesc.class);

      op.setConf(filterCtx);

 

       // 定义了一个 pathToAlias

 

      ArrayList<String> aliasList = new ArrayList<String>();

      aliasList.add("a");

      LinkedHashMap<String, ArrayList<String>> pa = new LinkedHashMap<String, ArrayList<String>>();

      pa.put("/tmp/testfolder", aliasList);

 

      

          // 定义了一个 path To Operator

 

      TableDesc tblDesc = Utilities.defaultTd;

      PartitionDesc partDesc = new PartitionDesc(tblDesc, null);

      LinkedHashMap<String, PartitionDesc> pt = new LinkedHashMap<String, PartitionDesc>();

      pt.put("/tmp/testfolder", partDesc);

 

         // 定义了一个 alias to Operator

 

      LinkedHashMap<String, Operator<? extends Serializable>> ao =

        new LinkedHashMap<String, Operator<? extends Serializable>>();

      ao.put("a", op);

 

      MapredWork mrwork = new MapredWork();

      mrwork.setPathToAliases(pa);

      mrwork.setPathToPartitionInfo(pt);

      mrwork.setAliasToWork(ao);

 

}

 

我猜 一个 job 是由input ,output + MapRedWork 构成的。接下来,我看 plan 如何执行。

 

 

 

 

4.1.2  MapredWork

 

//这个类是计划的核心部分

 

public class MapredWork implements Serializable {

  private static final long serialVersionUID = 1L;

  private String command;

  // map side work

  // use LinkedHashMap to make sure the iteration order is

  // deterministic, to ease testing

  private LinkedHashMap<String, ArrayList<String>> pathToAliases;

 

  private LinkedHashMap<String, PartitionDesc> pathToPartitionInfo;

 

  private LinkedHashMap<String, Operator<? extends Serializable>> aliasToWork;

 

  private LinkedHashMap<String, PartitionDesc> aliasToPartnInfo;

 

  // map<->reduce interface

  // schema of the map-reduce 'key' object - this is homogeneous

  private TableDesc keyDesc;

 

  // schema of the map-reduce 'val' object - this is heterogeneous

  private List<TableDesc> tagToValueDesc;

 

  private Operator<?> reducer;

 

  private Integer numReduceTasks;

  private Integer numMapTasks;

  private Integer minSplitSize;

 

  private boolean needsTagging;

  private boolean hadoopSupportsSplittable;

 

  private MapredLocalWork mapLocalWork;

  private String inputformat;

 

 

 

4.1.3  Dviver  

 这是 QL 的主要接口

 

 

   // 分析

//Driver

compile(command)

{

  ctx = new Context(conf);

 

      ParseDriver pd = new ParseDriver();

      ASTNode tree = pd.parse(command, ctx);

      tree = ParseUtils.findRootNonNullToken(tree);

 

      BaseSemanticAnalyzer sem = SemanticAnalyzerFactory.get(conf, tree);

      // Do semantic analysis and plan generation

      sem.analyze(tree, ctx);

 

      // validate the plan

      sem.validate();

 

      plan = new QueryPlan(command, sem);

   // initialize FetchTask right here

      if (plan.getFetchTask() != null) {

        plan.getFetchTask().initialize(conf, plan, null);

      }

 

}

 

 

// lauchTask

   // 这个与hadoop 中 run 一个 task 是类似的

TaskResult tskRes = new TaskResult();

 

 

    TaskRunner tskRun = new TaskRunner(tsk, tskRes);

 

    //启动一个线程,让 task. execute

    tskRun.start();

 

 

 

 

 

 

 

//物理执行

public int execute() {

 

  plan.setStarted();

 int jobs = countJobs(plan.getRootTasks());

    

      // 取出 root tasks ,

for (Task<? extends Serializable> tsk : plan.getRootTasks()) {

        driverCxt.addToRunnable(tsk);

      }

 

      //依次执行

      // Loop while you either have tasks running, or tasks queued up

 

      while (running.size() != 0 || runnable.peek() != null) {

        // Launch upto maxthreads tasks

        while (runnable.peek() != null && running.size() < maxthreads) {

          Task<? extends Serializable> tsk = runnable.remove();

          launchTask(tsk, queryId, noName, running, jobname, jobs, driverCxt);

        }

 

        // poll the Tasks to see which one completed

        TaskResult tskRes = pollTasks(running.keySet());

        TaskRunner tskRun = running.remove(tskRes);

        Task<? extends Serializable> tsk = tskRun.getTask();

 

          if (tsk.getChildTasks() != null) {

          for (Task<? extends Serializable> child : tsk.getChildTasks()) {

            if (DriverContext.isLaunchable(child)) {

              driverCxt.addToRunnable(child);

            }

          }

        }

      }

}

 

 

 

 

4.1.4   QueryPlan

 

 

// 生成 执行的 图 ,  Hive 中 最复杂的部分

/**

   * Populate api.QueryPlan from exec structures. This includes constructing the

   * dependency graphs of stages and operators.

   *

   * @throws IOException

   */

  private void populateQueryPlan() throws IOException {

    query.setStageGraph(new org.apache.hadoop.hive.ql.plan.api.Graph());

    query.getStageGraph().setNodeType(NodeType.STAGE);

 

 

}

 

 

 

  看到这里,我觉得 hive 挺复杂 ,想要用好不容易。如果,查询不是很多,写 mapper,reduce task 可能也没有什么问题。

4.1.5  Task

 

 

 

 

 

 

4.2        exec

 

4.2.1  TestExecDriver

 

  // 把文件load 到 table

// load the test files into tables

      i = 0;

      db = Hive.get(conf);

      String[] srctables = {"src", "src2"};

      LinkedList<String> cols = new LinkedList<String>();

      cols.add("key");

      cols.add("value");

      for (String src : srctables) {

        db.dropTable(MetaStoreUtils.DEFAULT_DATABASE_NAME, src, true, true);

        db.createTable(src, cols, null, TextInputFormat.class,

            IgnoreKeyTextOutputFormat.class);

        db.loadTable(hadoopDataFile[i], src, false, null);

        i++;

      }

 

  private void executePlan(File planFile) throws Exception {

  

    String cmdLine = conf.getVar(HiveConf.ConfVars.HADOOPBIN) + " jar "

        + conf.getJar() + " org.apache.hadoop.hive.ql.exec.ExecDriver -plan "

        + planFile.toString() + " " + ExecDriver.generateCmdLine(conf);

 

    Process executor = Runtime.getRuntime().exec(cmdLine);

 

 

  private void populateMapPlan1(Table src) {

    mr.setNumReduceTasks(Integer.valueOf(0));

 

    Operator<FileSinkDesc> op2 = OperatorFactory.get(new FileSinkDesc(tmpdir

        + "mapplan1.out", Utilities.defaultTd, true));

    Operator<FilterDesc> op1 = OperatorFactory.get(getTestFilterDesc("key"),

        op2);

 

    Utilities.addMapWork(mr, src, "a", op1);

  }

 

 

 

 

 

 

 

 

 

4.2.2  Utilities.java

 

 

public static void addMapWork(MapredWork mr, Table tbl, String alias,

      Operator<?> work) {

    mr.addMapWork(tbl.getDataLocation().getPath(), alias, work,

        new PartitionDesc(getTableDesc(tbl), null));

  }

 

 // 我对 alias 的用途不解, 现在可以看 ExecDriver 如何 run

 

 

 

 

4.2.3  ExecDriver.java

 

 

//执行计划

/**

   * Execute a query plan using Hadoop.

   */

  @Override

  public int execute(DriverContext driverContext) {

 

   // map 的相关设置

 

job.setMapperClass(ExecMapper.class);

 

    job.setMapOutputKeyClass(HiveKey.class);

    job.setMapOutputValueClass(BytesWritable.class);

 

      job.setPartitionerClass((Class<? extends Partitioner>)

         (Class.forName(HiveConf.getVar(job, HiveConf.ConfVars.HIVEPARTITIONER))));

 

    if (work.getNumMapTasks() != null) {

      job.setNumMapTasks(work.getNumMapTasks().intValue());

    }

    if (work.getMinSplitSize() != null) {

      HiveConf.setIntVar(job, HiveConf.ConfVars.MAPREDMINSPLITSIZE,

          work.getMinSplitSize().intValue());

}

 

//reduce 的相关设置

 

    job.setNumReduceTasks(work.getNumReduceTasks().intValue());

    job.setReducerClass(ExecReducer.class);

 

    if (work.getInputformat() != null) {

      HiveConf.setVar(job, HiveConf.ConfVars.HIVEINPUTFORMAT, work.getInputformat());

    }

   

     //呵呵,这个注释有点意思

  // No-Op - we don't really write anything here ..

    job.setOutputKeyClass(Text.class);

    job.setOutputValueClass(Text.class);

}

     // 添加输入

addInputPaths(job, work, emptyScratchDirStr);

      Utilities.setMapRedWork(job, work, hiveScratchDir);

     //提交job

        orig_rj = rj = jc.submitJob(job);

 

// 至此 ,我明白了, hive 与我做过的 hadoopWrapper 的执行原理很像 。

接下来,我要好好看看,如何作计划

//按我的经验,常用的operator : aggregation, join

 

//我想 hive 除了减少 job 的 coding ,应该还能干点别的,比如,增量,继续看。

回到 4.1.3 Driver

 

 

4.2.4  ExecMapper

 

public class ExecMapper extends MapReduceBase implements Mapper {

 

  private MapOperator mo;

 

 

 

 

 

 

 

 

5       SerDe

 

 

 

 

 

 

6       MetaStore

 

 

 

7       Shim

7.1        HadoopShims

 

// 屏蔽 hadoop 版本的差异

 

 * In order to be compatible with multiple versions of Hadoop, all parts

 * of the Hadoop interface that are not cross-version compatible are

 * encapsulated in an implementation of this class. Users should use

 * the ShimLoader class as a factory to obtain an implementation of

 * HadoopShims corresponding to the version of Hadoop currently on the

 * classpath.

 */

public interface HadoopShims {

 

 

 

 

 

8       Ref

 

 

 

 

<http://developer.yahoo.com/blogs/hadoop/posts/2010/08/pig_and_hive_at_yahoo/>

 <http://wiki.apache.org/hadoop/Hive>

 http://www.slideshare.net/jsichi/hive-evolution-apachecon-2010

 

 

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值