【hive】How-to: Use a SerDe in Apache Hive

最新推荐文章于 2021-07-13 19:09:03 发布

「已注销」

最新推荐文章于 2021-07-13 19:09:03 发布

阅读量2k

点赞数

分类专栏： Hive

Hive 专栏收录该内容

34 篇文章 0 订阅

订阅专栏

转载：http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/

How-to: Use a SerDe in Apache Hive

by Jon Natkins
December 21, 2012
no comments

Apache Hive is a fantastic tool for performing SQL-style queries across data that is often not appropriate for a relational database. For example, semistructured and unstructured data can be queried gracefully via Hive, due to two core features: The first is Hive’s support of complex data types, such as structs, arrays, and unions, in addition to many of the common data types found in most relational databases. The second feature is the SerDe.

What is a SerDe?

The SerDe interface allows you to instruct Hive as to how a record should be processed. A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De). The Deserializer interface takes a string or binary representation of a record, and translates it into a Java object that Hive can manipulate. The Serializer, however, will take a Java object that Hive has been working with, and turn it into something that Hive can write to HDFS or another supported system. Commonly, Deserializers are used at query time to execute SELECT statements, and Serializers are used when writing data, such as through an INSERT-SELECT statement.

In this article, we will examine a SerDe for processing JSON data, which can be used to transform a JSON record into something that Hive can process.

Developing a SerDe

To start, we can write a basic template for a SerDe, which utilizes the Hive serde2 API (org.apache.hadoop.hive.serde2). This API should be used in favor of the older serde API, which has been deprecated:

package com.cloudera.hive.serde; import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Properties;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hive.serde.Constants;
import org.apache.hadoop.hive.serde2.SerDe;
import org.apache.hadoop.hive.serde2.SerDeException;
import org.apache.hadoop.hive.serde2.SerDeStats;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.typeinfo.StructTypeInfo;
import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory;
import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;

/**
* A template for a custom Hive SerDe
*/
public class BoilerplateSerDe implements SerDe {
 
 private StructTypeInfo rowTypeInfo;
 private ObjectInspector rowOI;
 private List<String> colNames;
 private List<Object> row = new ArrayList<Object>();

 /**
  * An initialization function used to gather information about the table.
  * Typically, a SerDe implementation will be interested in the list of
  * column names and their types. That information will be used to help 
  * perform actual serialization and deserialization of data.
  */
 @Override
 public void initialize(Configuration conf, Properties tbl)
     throws SerDeException {
   // Get a list of the table's column names.
   String colNamesStr = tbl.getProperty(Constants.LIST_COLUMNS);
   colNames = Arrays.asList(colNamesStr.split(","));
  
   // Get a list of TypeInfos for the columns. This list lines up with
   // the list of column names.
   String colTypesStr = tbl.getProperty(Constants.LIST_COLUMN_TYPES);
   List<TypeInfo> colTypes =
       TypeInfoUtils.getTypeInfosFromTypeString(colTypesStr);
  
   rowTypeInfo =
       (StructTypeInfo) TypeInfoFactory.getStructTypeInfo(colNames, colTypes);
   rowOI =
       TypeInfoUtils.getStandardJavaObjectInspectorFromTypeInfo(rowTypeInfo);
 }

 /**
  * This method does the work of deserializing a record into Java objects
  * that Hive can work with via the ObjectInspector interface.
  */
 @Override
 public Object deserialize(Writable blob) throws SerDeException {
   row.clear();
   // Do work to turn the fields in the blob into a set of row fields
   return row;
 }

 /**
  * Return an ObjectInspector for the row of data
  */
 @Override
 public ObjectInspector getObjectInspector() throws SerDeException {
   return rowOI;
 }

 /**
  * Unimplemented
  */
 @Override
 public SerDeStats getSerDeStats() {
   return null;
 }

 /**
  * Return the class that stores the serialized data representation.
  */
 @Override
 public Class<? extends Writable> getSerializedClass() {
   return Text.class;
 }

 /**
  * This method takes an object representing a row of data from Hive, and
  * uses the ObjectInspector to get the data for each column and serialize
  * it.
  */
 @Override
 public Writable serialize(Object obj, ObjectInspector oi)
     throws SerDeException {
   // Take the object and transform it into a serialized representation
   return new Text();
 }
}

Breaking this down a bit, the initialize() method is called only once and gathers some commonly-used pieces of information from the table properties, such as the column names and types. Using the type info of the row, you can instantiate an ObjectInspector for the row (ObjectInspectors are Hive objects that are used to describe and examine complex type hierarchies.) The two other important methods are serialize() and deserialize(), which do the namesake work of the SerDe.

In a SerDe, the serialize() method takes a Java object representing a row of data, and converts that object into a serialized representation of the row. The serialized class is determined by the return type of getSerializedClass(). In the JSONSerDe, the serialize() method converts the object into a JSON string represented by a Text object. To do the serialization from Java into JSON, I’ve opted to use the Jackson JSON library, which allows me to convert a Java object to a JSON string with just a small amount of code:

ObjectMapper mapper = new ObjectMapper();
// Let Jackson do the work of serializing the object
return new Text(mapper.writeValueAsString(deparsedObj));

Jackson understands how to convert basic Java objects like Maps, Lists, and primitives into JSON strings. However, the Java object that is passed into the serialize() method is an internal Hive representation of a row, which Jackson can’t work with. The goal here is to use the ObjectInspector to interpret the Hive object, and convert it into a more basic Java representation.

In the JSONSerDe code, this process is broken up into a number of methods. The control flow is fairly simple, so let’s just examine some of the interesting pieces:

private Object deparseObject(Object obj, ObjectInspector oi) {
 switch (oi.getCategory()) {
 case LIST:
   return deparseList(obj, (ListObjectInspector)oi);
 case MAP:
   return deparseMap(obj, (MapObjectInspector)oi);
 case PRIMITIVE:
   return deparsePrimitive(obj, (PrimitiveObjectInspector)oi);
 case STRUCT:
   return deparseStruct(obj, (StructObjectInspector)oi, false);
 case UNION:
   // Unsupported by JSON
 default:
   return null;
 }
}

The deparseObject()method is nothing more than a fork in the road. ObjectInspectors have a category, which will identify the underlying subtype of the inspector.

private Object deparseList(Object obj, ListObjectInspector listOI) {
 List<Object> list = new ArrayList<Object>();
 List<?> field = listOI.getList(obj);
 ObjectInspector elemOI = listOI.getListElementObjectInspector();
 for (Object elem : field) {
   list.add(deparseObject(elem, elemOI));
 }
 return list;
}

In the deparseList() method, your goal is to translate a Hive list field into a Java array. In order to do this properly, you need to also deparse each of the list elements. Fortunately, you can obtain an ObjectInspector specifically for the list elements from a ListObjectInspector. You can follow this same pattern with all the other Hive data types to fully translate the object, and then let Jackson do the work of writing out a JSON object.

The opposite of serialize() is deserialize(). The deserialize() method takes a JSON string, and converts it into a Java object that Hive can process. Again, you can use Jackson to do most of the heavy lifting. Jackson will convert a JSON record into a Java Map with just a couple lines of code:

ObjectMapper mapper = new ObjectMapper();
  // This is really a Map<String, Object>. For more information about how
  // Jackson parses JSON in this example, see
  // http://wiki.fasterxml.com/JacksonDataBinding
  Map<?,?> root = mapper.readValue(blob.toString(), Map.class);

When deserializing, you need information from Hive about what type of data each field contains. You can use theTypeInfo API similarly to how we used ObjectInspectors while serializing. Looking again at handling a list type:

private Object parseList(Object field, ListTypeInfo fieldTypeInfo) {
   ArrayList<Object> list = (ArrayList<Object>) field;
   TypeInfo elemTypeInfo = fieldTypeInfo.getListElementTypeInfo();
   
   for (int i = 0; i < list.size(); i++) {
     list.set(i, parseField(list.get(i), elemTypeInfo));
   }

   return list.toArray();
  }

Also like ObjectInspectors, TypeInfos have subtypes. For a Hive list field, the TypeInfo is actually a ListTypeInfo, which we can use to also determine the type of the list elements. You can parse each list element one-by-one, and return the necessary array.

Using the SerDe

Tables can be configured to process data using a SerDe by specifying the SerDe to use at table creation time, or through the use of an ALTER TABLE statement. For example:

ADD JAR /tmp/hive-serdes-1.0-SNAPSHOT.jar

CREATE EXTERNAL TABLE tweets (
   ...
    retweeted_status STRUCT<
      text:STRING,
      user:STRUCT<screen_name:STRING,name:STRING>>,
    entities STRUCT<
      urls:ARRAY<STRUCT<expanded_url:STRING>>,
      user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
      hashtags:ARRAY<STRUCT<text:STRING>>>,
    text STRING,
   ...
  )
  PARTITIONED BY (datehour INT)
  ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
  LOCATION '/user/flume/tweets';

The bolded section of the above CREATE TABLE statement shows how a table is configured to use a SerDe. If the SerDe is not on the Hive classpath, it must be added at runtime using the ADD JARcommand. It should be noted that one limitation of the JSONSerDe is that the field names must match the JSON field names. JSON fields that are not present in the table will be ignored, and records that don’t have certain fields will return NULLs for any missing fields. As an example, the raw data that the above fields refer to looks like this:

{
     "retweeted_status": {
        "contributors": null,
        "text": "#Crowdsourcing – drivers already generate traffic data for your smartphone to suggest alternative routes when a road is clogged. #bigdata",
        "geo": null,
        "retweeted": false,
        "in_reply_to_screen_name": null,
        "truncated": false,
        "entities": {
           "urls": [],
           "hashtags": [
              {
                 "text": "Crowdsourcing",
                 "indices": [
                    0,
                    14
                 ]
              },
              {
                 "text": "bigdata",
                 "indices": [
                    129,
                    137
                 ]
              }
           ],
           "user_mentions": []
        },
        "in_reply_to_status_id_str": null,
        "id": 245255511388336128,
        "in_reply_to_user_id_str": null,
        "source": "SocialOomph",
        "favorited": false,
        "in_reply_to_status_id": null,
        "in_reply_to_user_id": null,
        "retweet_count": 0,
        "created_at": "Mon Sep 10 20:20:45 +0000 2012",
        "id_str": "245255511388336128",
        "place": null,
        "user": {
           "location": "Oregon, ",
           "default_profile": false,
           "statuses_count": 5289,
           "profile_background_tile": false,
           "lang": "en",
           "profile_link_color": "627E91",
           "id": 347471575,
           "following": null,
           "protected": false,
           "favourites_count": 17,
           "profile_text_color": "D4B020",
           "verified": false,
           "description": "Dad, Innovator, Sales Professional. Project Management Professional (PMP).  Soccer Coach,  Little League Coach  #Agile #PMOT - views are my own -",
           "contributors_enabled": false,
           "name": "Scott Ostby",
           "profile_sidebar_border_color": "404040",
           "profile_background_color": "0F0F0F",
           "created_at": "Tue Aug 02 21:10:39 +0000 2011",
           "default_profile_image": false,
           "followers_count": 19005,
           "profile_image_url_https": "https://si0.twimg.com/profile_images/1928022765/scott_normal.jpg",
           "geo_enabled": true,
           "profile_background_image_url": "http://a0.twimg.com/profile_background_images/327807929/xce5b8c5dfff3dc3bbfbdef5ca2a62b4.jpg",
           "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/327807929/xce5b8c5dfff3dc3bbfbdef5ca2a62b4.jpg",
           "follow_request_sent": null,
           "url": "http://facebook.com/ostby",
           "utc_offset": -28800,
           "time_zone": "Pacific Time (US & Canada)",
           "notifications": null,
           "friends_count": 13172,
           "profile_use_background_image": true,
           "profile_sidebar_fill_color": "1C1C1C",
           "screen_name": "ScottOstby",
           "id_str": "347471575",
           "profile_image_url": "http://a0.twimg.com/profile_images/1928022765/scott_normal.jpg",
           "show_all_inline_media": true,
           "is_translator": false,
           "listed_count": 45
        },
        "coordinates": null
     },
     "contributors": null,
     "text": "RT @ScottOstby: #Crowdsourcing – drivers already generate traffic data for your smartphone to suggest alternative routes when a road is  ...",
     "geo": null,
     "retweeted": false,
     "in_reply_to_screen_name": null,
     "truncated": false,
     "entities": {
        "urls": [],
        "hashtags": [
           {
              "text": "Crowdsourcing",
              "indices": [
                 16,
                 30
              ]
           }
        ],
        "user_mentions": [
           {
              "id": 347471575,
              "name": "Scott Ostby",
              "indices": [
                 3,
                 14
              ],
              "screen_name": "ScottOstby",
              "id_str": "347471575"
           }
        ]
     },
     "in_reply_to_status_id_str": null,
     "id": 245270269525123072,
     "in_reply_to_user_id_str": null,
     "source": "web",
     "favorited": false,
     "in_reply_to_status_id": null,
     "in_reply_to_user_id": null,
     "retweet_count": 0,
     "created_at": "Mon Sep 10 21:19:23 +0000 2012",
     "id_str": "245270269525123072",
     "place": null,
     "user": {
        "location": "",
        "default_profile": true,
        "statuses_count": 1294,
        "profile_background_tile": false,
        "lang": "en",
        "profile_link_color": "0084B4",
        "id": 21804678,
        "following": null,
        "protected": false,
        "favourites_count": 11,
        "profile_text_color": "333333",
        "verified": false,
        "description": "",
        "contributors_enabled": false,
        "name": "Parvez Jugon",
        "profile_sidebar_border_color": "C0DEED",
        "profile_background_color": "C0DEED",
        "created_at": "Tue Feb 24 22:10:43 +0000 2009",
        "default_profile_image": false,
        "followers_count": 70,
        "profile_image_url_https": "https://si0.twimg.com/profile_images/2280737846/ni91dkogtgwp1or5rwp4_normal.gif",
        "geo_enabled": false,
        "profile_background_image_url": "http://a0.twimg.com/images/themes/theme1/bg.png",
        "profile_background_image_url_https": "https://si0.twimg.com/images/themes/theme1/bg.png",
        "follow_request_sent": null,
        "url": null,
        "utc_offset": null,
        "time_zone": null,
        "notifications": null,
        "friends_count": 299,
        "profile_use_background_image": true,
        "profile_sidebar_fill_color": "DDEEF6",
        "screen_name": "ParvezJugon",
        "id_str": "21804678",
        "profile_image_url": "http://a0.twimg.com/profile_images/2280737846/ni91dkogtgwp1or5rwp4_normal.gif",
        "show_all_inline_media": false,
        "is_translator": false,
        "listed_count": 7
     },
     "coordinates": null
  }

Once the table is set up, querying complex data is as simple as SQL:

hive> SELECT * FROM tweets;
OK
id	created_at	source	favorited	retweet_count	retweeted_status	entities	text	user	in_reply_to_screen_name
245270269525123072	Mon Sep 10 21:19:23 +0000 2012	web	false	0	{"text":"#Crowdsourcing – drivers already generate traffic data for yoursmartphone to suggest alternative routes when a road is clogged. #bigdata","user":{"screen_name":"ScottOstby","name":"Scott Ostby"}}	{"urls":[],"user_mentions":[{"screen_name":"ScottOstby","name":"Scott Ostby"}],"hashtags":[{"text":"Crowdsourcing"}]}	RT @ScottOstby: #Crowdsourcing – drivers already generate trafficdata for your smartphone to suggest alternative routes when a road is  ...	{"screen_name":"ParvezJugon","name":"Parvez Jugon","friends_count":299,"followers_count":70,"statuses_count":1294,"verified":false,"utc_offset":null,"time_zone":null}	NULL
Time taken: 3.07 seconds

If you’re interested in getting the code to use the JSONSerDe in a real environment, see the CDH Twitter Example on the Cloudera Github.

Conclusion

The SerDe interface is extremely powerful for dealing with data with a complex schema. By utilizing SerDes, any dataset can be made queryable through Hive. For a full use case involving the JSONSerDe discussed in this article, seeAnalyzing Twitter with Apache Hadoop.

Jon Natkins is a Software Engineer at Cloudera, working on the Enterprise team.