Getting Started with ORC and HCatalog

最新推荐文章于 2021-01-18 09:40:57 发布

rocksword

最新推荐文章于 2021-01-18 09:40:57 发布

阅读量448

点赞数

分类专栏： Hadoop 文章标签： Hadoop Bigdata

Hadoop 专栏收录该内容

26 篇文章 0 订阅

订阅专栏

ORC (Optimized Row Columnar) is a columnar file format optimized to improve performance of Hive. Through the Hive metastore and HCatalog reading, writing, and processing can also be accomplished by MapReduce, Pig, Cascading, and so on. It is very similar to Parquet which is being developed by Cloudera and Twitter. Both are part of the most current Hive release and available to be used immediately. In this post I would like to describe some of the key concepts of ORC and demonstrate how to get started quickly using HCatalog.

The key concepts of ORC files are:

Split the Rows Into Strips (default size 250MB)
Splitting the data into independent Strips allows to access the data concurrently using separate RecordReaders. Strips can be split without scanning for markers.
Use Lightweight Indexes
Strips are surrounded be an lightweight index and footer. The index, so called strip index, is applied to the columns contained within the Strip. This makes for really fast filtering as records can be skipped not contained in the query.
Compress Column Values
Column values are compressed so that not the complete row or Strip has to be decompressed prior to reading but only the values that are potential of interest. Supported codecs are Snappy, Zlib, or none.
Support for Hive’s Data Types
ORC supports Hive types like datetime, decimals, and so on. For Hive data types you should check which are supported by you current version and also if they are supported by HCatalog if you plan to use it outside of Hive!

There are some more aspects to ORC files, most notable a custom serialization that helps on speed and space. This includes ZigZag, dictionary, and variable-length encoding. All of that is very similar to what Parquet uses and the most notable differences seems to be the Parquet’s emphasis on deeply nested data as well as the use of Thrift in Parquet overProtocol Buffers in ORC.

Here is a visiualization provided be Hortonworks which shows some of the compression benefits you could gain from using ORC files.

Source: http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/

In general to achieve a good compression using Zlib, which is the default for ORC, over Snappy is a good choice. This comes likely at sacrificing speed.

Creating a first ORC Backed Table in Hive

As ORC (Ser/Deser) is included with the most recent release of Hive using it is very simple. As a quick example we are going to model crawled Blog posts to be stored in our database web. This could be modeled for example like this:

 
PgSQL
 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         10 
       
         11 
       
         12 
       
         13 
       
        CREATE 
          
        TABLE 
         web.blogs( 
       
          uri 
         STRING, 
       
        host 
         STRING, 
       
        type 
          
        INT, 
       
          publish_date 
          
        DATE, 
       
          crawled 
          
        TIMESTAMP, 
       
          title 
         STRING, 
       
        text 
         STRING, 
       
          comments 
          
        ARRAY 
        <STRUCT 
        <author:STRING, 
          
        text:STRING, 
         published: 
          
        DATE 
        > 
        > 
       
         ) 
       
         PARTITIONED 
          
        BY 
         (crawl_day 
         STRING) 
       
         STORED 
          
        AS 
         ORC 
       
         TBLPROPERTIES 
         ( 
        "orc.compress"= 
        "SNAPPY");

This creates us a ORC backed table in Hive which uses Snappy compression instead of Zlib. The table will be partitioned by the day each blog post was crawled (Y-m-d). HCatalog only supports partitions being of type string. Partitions are basically folders in HDFS and and the column equals the partition values. That’s why it needs to be a string here and we use two columns to store the crawl date/day. There are some other table options (TBLEPROPERTIES) specific to ORC that might interest you:

orc.compress (default: zlib)
orc.compress.size (default: 262144 Bytes)
orc.stripe.size (default: 256 MB)
orc.row.index.stride (default: 10,000)
orc.create.index (default: true)

A blog post as consists of a URI, text, author and so on as well as the comments to the each post. We use a STRUCT to include the comments within each post. To create the table in our database we can execute the following command using hcat :

 
         1 
       
        hcat 
          
        - 
        e 
          
        'CREATE TABLE ...' 
          
        || 
          
        hcat 
          
        - 
        f 
          
        create_blog 
        . 
        hcat

We include here the data definition from above to achieve the desired result of creating our table in Hive/HCatalog by using -e or we put the above statement into a file and use -f . Refer to hcat --help for more information.

From here we can use either Pig, Sqoop, Hive, MapReduce, or Cascading to update and integrate data to our table. Let’s have a quick look at how we can use either Pig to insert and query some data. For this we’ll use a small set of data we can experiment with. Here are the blog posts and comments we are going to use for this example.

Posts (post.tsv):

 
    
 
    
 
    
 
    
 
    
 
    
 
   
 
 
  
         1 
       

         2 
       

         3 
       
 
        / 
        hadoop 
        / 
        mr  
        hortonworks 
        . 
        com 
          
        2 
          
        2014 
        - 
        04 
        - 
        01T10 
        : 
        33 
        : 
        24.123 
        - 
        0800 
          
        2014 
        - 
        04 
        - 
        01T01 
        : 
        03 
        : 
        44.123 
        + 
        0000 
          
        MR  
        Title  
        MR  
        Text 
       
 
        / 
        hadoop 
        / 
        cdh54  
        cloudera 
        . 
        com 
          
        3 
          
        2014 
        - 
        04 
        - 
        11T12 
        : 
        13 
        : 
        24.125 
        - 
        0800 
          
        2014 
        - 
        04 
        - 
        01T01 
        : 
        03 
        : 
        44.523 
        + 
        0000 
          
        CDH 
          
        54 
          
        CDH  
        Text 
       
 
        / 
        ml 
        / 
        train  
        mapr 
        . 
        com 
          
        3 
          
        2014 
        - 
        04 
        - 
        09T17 
        : 
        53 
        : 
        44.123 
        - 
        0800 
          
        2014 
        - 
        04 
        - 
        01T01 
        : 
        03 
        : 
        45.003 
        + 
        0000 
          
        SVM  
        SVM  
        Train 
       
 
 

Comments (comments.tsv):

 
   
 
 
  
         1 
       

         2 
       
 
        / 
        hadoop 
        / 
        mr  
        hortonworks 
        . 
        com  
        hue  
        nice 
        ! 
          
        2014 
        - 
        04 
        - 
        02T11 
        : 
        33 
        : 
        24.567 
        - 
        0800 
       
 
        / 
        hadoop 
        / 
        mr  
        hortonworks 
        . 
        com  
        sqoop  
        nice 
        ! 
          
        2014 
        - 
        04 
        - 
        02T12 
        : 
        33 
        : 
        24.765 
        - 
        0800 
       
 
 

We have posts and comments in separate files. For simplicity the values are tab separated so we can use the default PigLoader. Please keep in mind that the defaultPigLoader does not supports proper escaping and would in such a case be a bad choice as title and text could likely contain tabs as well as line breaks.

Since posts and comments are in separated files we have to group them together. This sometimes does not very look nice in Pig and requires some re-naming. Pig appends in some cases the name of a tuple to the values to make them unique. This is often ignored so that posts::uri and uri are the same, but here unfortunately not. But a re-naming doesn’t hurt either as it will be part of any appropriate map or reduce step and doesn’t require a separated step. Let’s have a look on how we could load the posts and comments and then store them into Hive:

 
PgSQL
 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         10 
       
         11 
       
         12 
       
         13 
       
         14 
       
         15 
       
         16 
       
         17 
       
         18 
       
         19 
       
         20 
       
         21 
       
         22 
       
         23 
       
         24 
       
         25 
       
         26 
       
         27 
       
         28 
       
         29 
       
         30 
       
         31 
       
         32 
       
         33 
       
        -- load posts 
       
         posts 
         = 
          
        LOAD 
          
        'posts.tsv' 
          
        AS 
         ( 
       
           uri: 
         chararray, 
          
        host: 
         chararray, 
          
        type: 
          
        int, 
          
           publish_date: 
         datetime, 
          
           crawled: 
         datetime, 
          
           title: 
         chararray, 
          
        text: 
         chararray); 
       
        -- load comments 
       
         comments 
         = 
          
        LOAD 
          
        'comments.tsv' 
          
        AS 
         ( 
       
           uri: 
         chararray, 
          
        host: 
         chararray, 
          
           author: 
         chararray, 
          
        text: 
         chararray, 
          
           published: 
         datetime); 
       
        -- group posts and comments together 
       
         posts_comments 
         = 
         COGROUP 
          
           posts 
          
        BY 
         (uri, 
          
        host), 
          
           comments 
          
        BY 
         (uri, 
          
        host); 
       
        -- prepare expected output 
       
         blog_posts 
         = 
         FOREACH 
         posts_comments 
         { 
       
           coms 
         = 
         FOREACH 
         comments 
         GENERATE 
         author, 
          
        text, 
         published; 
       
           post 
         = 
          
        LIMIT 
         posts 
         1; 
                
           GENERATE 
         FLATTEN(post), 
         coms 
          
        AS 
         comments; 
       
         } 
       
         blog_posts 
         = 
         FOREACH 
         blog_posts 
         GENERATE 
         post::uri 
          
        AS 
         uri, 
         post:: 
        host 
          
        AS 
          
        host, 
         post:: 
        type 
          
        AS 
          
        type, 
          
         post::publish_date 
          
        AS 
         publish_date, 
         post::crawled 
          
        AS 
         crawled, 
         post::title 
          
        AS 
         title, 
         post:: 
        text 
          
        AS 
          
        text, 
         comments; 
       
         STORE 
         blog_posts 
          
        INTO 
          
        'web.blogs' 
          
        USING 
         org.apache.hive.hcatalog.pig.HCatStorer( 
        'crawl_day=2014-05-05');

To execute the above you have to run Pig with HCatalog support. It does not by default include the required libraries. In addition depending on your setup it might be required to point to the metastore of Hive. You can do that with the following command:

 
         1 
       
        pig 
          
        - 
        useHCatalog 
          
        [ 
        - 
        Dhive 
        . 
        metastore 
        . 
        uris 
        = 
        thrift 
        : 
        //metastore.server:9083]

With -useHCatolog Pig can make use of HCatalog and here in addition we also point to the location of the Hive metastore.

That’s it! And you should be able to execute the following:

 
    
 
    
 
    
 
    
 
    
 
    
 
   
 
 
  
         1 
       

         2 
       

         3 
       

         4 
       

         5 
       

         6 
       

         7 
       

         8 
       

         9 
       
 
        hive 
        > 
          
        use 
          
        web 
        ; 
       
 
        OK 
       
 
        Time  
        taken 
        : 
          
        2.936 
          
        seconds 
       
 
        hive 
        > 
          
        select * 
          
        from  
        blogs 
        ; 
       
 
        OK 
       
 
        / 
        ml 
        / 
        train  
        mapr 
        . 
        com 
          
        3 
          
        NULL 
          
        2014 
        - 
        03 
        - 
        31 
          
        18 
        : 
        03 
        : 
        45.003 
          
        SVM  
        SVM  
        Train 
          
        [ 
        ] 
          
        2014 
        - 
        05 
        - 
        05 
       
 
        / 
        hadoop 
        / 
        mr  
        hortonworks 
        . 
        com 
          
        2 
          
        NULL 
          
        2014 
        - 
        03 
        - 
        31 
          
        18 
        : 
        03 
        : 
        44.123 
          
        MR 
          
        Title 
          
        MR 
          
        Text 
          
        [ 
        { 
        "author" 
        : 
        "sqoop" 
        , 
        "text" 
        : 
        "nice!" 
        , 
        "published" 
        : 
        null 
        } 
        , 
        { 
        "author" 
        : 
        "hue" 
        , 
        "text" 
        : 
        "nice!" 
        , 
        "published" 
        : 
        null 
        } 
        ] 
          
        2014 
        - 
        05 
        - 
        05 
       
 
        / 
        hadoop 
        / 
        cdh54  
        cloudera 
        . 
        com 
          
        3 
          
        NULL 
          
        2014 
        - 
        03 
        - 
        31 
          
        18 
        : 
        03 
        : 
        44.523 
          
        CDH 
          
        54 
          
        CDH  
        Text 
          
        [ 
        ] 
          
        2014 
        - 
        05 
        - 
        05 
       
 
        Time  
        taken 
        : 
          
        3.124 
          
        seconds 
        , 
          
        Fetched 
        : 
          
        3 
          
        row 
        ( 
        s 
        )