【hive】How-to: Use a SerDe in Apache Hive

转载:http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/


How-to: Use a SerDe in Apache Hive

Apache Hive is a fantastic tool for performing SQL-style queries across data that is often not appropriate for a relational database. For example, semistructured and unstructured data can be queried gracefully via Hive, due to two core features: The first is Hive’s support of complex data types, such as structs, arrays, and unions, in addition to many of the common data types found in most relational databases. The second feature is the SerDe.

What is a SerDe?

The SerDe interface allows you to instruct Hive as to how a record should be processed. A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De). The Deserializer interface takes a string or binary representation of a record, and translates it into a Java object that Hive can manipulate. The Serializer, however, will take a Java object that Hive has been working with, and turn it into something that Hive can write to HDFS or another supported system. Commonly, Deserializers are used at query time to execute SELECT statements, and Serializers are used when writing data, such as through an INSERT-SELECT statement.

In this article, we will examine a SerDe for processing JSON data, which can be used to transform a JSON record into something that Hive can process.

Developing a SerDe

To start, we can write a basic template for a SerDe, which utilizes the Hive serde2 API (org.apache.hadoop.hive.serde2). This API should be used in favor of the older serde API, which has been deprecated:

package com.cloudera.hive.serde; import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Properties;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hive.serde.Constants;
import org.apache.hadoop.hive.serde2.SerDe;
import org.apache.hadoop.hive.serde2.SerDeException;
import org.apache.hadoop.hive.serde2.SerDeStats;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.typeinfo.StructTypeInfo;
import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory;
import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;

/**
* A template for a custom Hive SerDe
*/
public class BoilerplateSerDe implements SerDe {
 
 private StructTypeInfo rowTypeInfo;
 private ObjectInspector rowOI;
 private List<String> colNames;
 private List<Object> row = new ArrayList<Object>();

 /**
  * An initialization function used to gather information about the table.
  * Typically, a SerDe implementation will be interested in the list of
  * column names and their types. That information will be used to help 
  * perform actual serialization and deserialization of data.
  */
 @Override
 public void initialize(Configuration conf, Properties tbl)
     throws SerDeException {
   // Get a list of the table's column names.
   String colNamesStr = tbl.getProperty(Constants.LIST_COLUMNS);
   colNames = Arrays.asList(colNamesStr.split(","));
  
   // Get a list of TypeInfos for the columns. This list lines up with
   // the list of column names.
   String colTypesStr = tbl.getProperty(Constants.LIST_COLUMN_TYPES);
   List<TypeInfo> colTypes =
       TypeInfoUtils.getTypeInfosFromTypeString(colTypesStr);
  
   rowTypeInfo =
       (StructTypeInfo) TypeInfoFactory.getStructTypeInfo(colNames, colTypes);
   rowOI =
       TypeInfoUtils.getStandardJavaObjectInspectorFromTypeInfo(rowTypeInfo);
 }

 /**
  * This method does the work of deserializing a record into Java objects
  * that Hive can work with via the ObjectInspector interface.
  */
 @Override
 public Object deserialize(Writable blob) throws SerDeException {
   row.clear();
   // Do work to turn the fields in the blob into a set of row fields
   return row;
 }

 /**
  * Return an ObjectInspector for the row of data
  */
 @Override
 public ObjectInspector getObjectInspector() throws SerDeException {
   return rowOI;
 }

 /**
  * Unimplemented
  */
 @Override
 public SerDeStats getSerDeStats() {
   return null;
 }

 /**
  * Return the class that stores the serialized data representation.
  */
 @Override
 public Class<? extends Writable> getSerializedClass() {
   return Text.class;
 }

 /**
  * This method takes an object representing a row of data from Hive, and
  * uses the ObjectInspector to get the data for each column and serialize
  * it.
  */
 @Override
 public Writable serialize(Object obj, ObjectInspector oi)
     throws SerDeException {
   // Take the object and transform it into a serialized representation
   return new Text();
 }
}

Breaking this down a bit, the initialize() method is called only once and gathers some commonly-used pieces of information from the table properties, such as the column names and types. Using the type info of the row, you can instantiate an ObjectInspector for the row (ObjectInspectors are Hive objects that are used to describe and examine complex type hierarchies.) The two other important methods are serialize() and deserialize(), which do the namesake work of the SerDe.

In a SerDe, the serialize() method takes a Java object representing a row of data, and converts that object into a serialized representation of the row. The serialized class is determined by the return type of getSerializedClass(). In the JSONSerDe, the serialize() method converts the object into a JSON string represented by a Text object. To do the serialization from Java into JSON, I’ve opted to use the Jackson JSON library, which allows me to convert a Java object to a JSON string with just a small amount of code:

ObjectMapper mapper = new ObjectMapper();
// Let Jackson do the work of serializing the object
return new Text(mapper.writeValueAsString(deparsedObj));

Jackson understands how to convert basic Java objects like Maps, Lists, and primitives into JSON strings. However, the Java object that is passed into the serialize() method is an internal Hive representation of a row, which Jackson can’t work with. The goal here is to use the ObjectInspector to interpret the Hive object, and convert it into a more basic Java representation.

In the JSONSerDe code, this process is broken up into a number of methods. The control flow is fairly simple, so let’s just examine some of the interesting pieces:

private Object deparseObject(Object obj, ObjectInspector oi) {
 switch (oi.getCategory()) {
 case LIST:
   return deparseList(obj, (ListObjectInspector)oi);
 case MAP:
   return deparseMap(obj, (MapObjectInspector)oi);
 case PRIMITIVE:
   return deparsePrimitive(obj, (PrimitiveObjectInspector)oi);
 case STRUCT:
   return deparseStruct(obj, (StructObjectInspector)oi, false);
 case UNION:
   // Unsupported by JSON
 default:
   return null;
 }
}

The deparseObject()method is nothing more than a fork in the road. ObjectInspectors have a category, which will identify the underlying subtype of the inspector.

private Object deparseList(Object obj, ListObjectInspector listOI) {
 List<Object> list = new ArrayList<Object>();
 List<?> field = listOI.getList(obj);
 ObjectInspector elemOI = listOI.getListElementObjectInspector();
 for (Object elem : field) {
   list.add(deparseObject(elem, elemOI));
 }
 return list;
}

In the deparseList() method, your goal is to translate a Hive list field into a Java array. In order to do this properly, you need to also deparse each of the list elements. Fortunately, you can obtain an ObjectInspector specifically for the list elements from a ListObjectInspector. You can follow this same pattern with all the other Hive data types to fully translate the object, and then let Jackson do the work of writing out a JSON object.

The opposite of serialize() is deserialize(). The deserialize() method takes a JSON string, and converts it into a Java object that Hive can process. Again, you can use Jackson to do most of the heavy lifting. Jackson will convert a JSON record into a Java Map with just a couple lines of code:

ObjectMapper mapper = new ObjectMapper();
  // This is really a Map<String, Object>. For more information about how
  // Jackson parses JSON in this example, see
  // http://wiki.fasterxml.com/JacksonDataBinding
  Map<?,?> root = mapper.readValue(blob.toString(), Map.class);
  

When deserializing, you need information from Hive about what type of data each field contains. You can use theTypeInfo API similarly to how we used ObjectInspectors while serializing. Looking again at handling a list type:

private Object parseList(Object field, ListTypeInfo fieldTypeInfo) {
   ArrayList<Object> list = (ArrayList<Object>) field;
   TypeInfo elemTypeInfo = fieldTypeInfo.getListElementTypeInfo();
   
   for (int i = 0; i < list.size(); i++) {
     list.set(i, parseField(list.get(i), elemTypeInfo));
   }

   return list.toArray();
  }

Also like ObjectInspectors, TypeInfos have subtypes. For a Hive list field, the TypeInfo is actually a ListTypeInfo, which we can use to also determine the type of the list elements. You can parse each list element one-by-one, and return the necessary array.

Using the SerDe

Tables can be configured to process data using a SerDe by specifying the SerDe to use at table creation time, or through the use of an ALTER TABLE statement. For example:

ADD JAR /tmp/hive-serdes-1.0-SNAPSHOT.jar

CREATE EXTERNAL TABLE tweets (
   ...
    retweeted_status STRUCT<
      text:STRING,
      user:STRUCT<screen_name:STRING,name:STRING>>,
    entities STRUCT<
      urls:ARRAY<STRUCT<expanded_url:STRING>>,
      user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
      hashtags:ARRAY<STRUCT<text:STRING>>>,
    text STRING,
   ...
  )
  PARTITIONED BY (datehour INT)
  ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
  LOCATION '/user/flume/tweets';

The bolded section of the above CREATE TABLE statement shows how a table is configured to use a SerDe. If the SerDe is not on the Hive classpath, it must be added at runtime using the ADD JARcommand. It should be noted that one limitation of the JSONSerDe is that the field names must match the JSON field names. JSON fields that are not present in the table will be ignored, and records that don’t have certain fields will return NULLs for any missing fields. As an example, the raw data that the above fields refer to looks like this:

{
     "retweeted_status": {
        "contributors": null,
        "text": "#Crowdsourcing – drivers already generate traffic data for your smartphone to suggest alternative routes when a road is clogged. #bigdata",
        "geo": null,
        "retweeted": false,
        "in_reply_to_screen_name": null,
        "truncated": false,
        "entities": {
           "urls": [],
           "hashtags": [
              {
                 "text": "Crowdsourcing",
                 "indices": [
                    0,
                    14
                 ]
              },
              {
                 "text": "bigdata",
                 "indices": [
                    129,
                    137
                 ]
              }
           ],
           "user_mentions": []
        },
        "in_reply_to_status_id_str": null,
        "id": 245255511388336128,
        "in_reply_to_user_id_str": null,
        "source": "SocialOomph",
        "favorited": false,
        "in_reply_to_status_id": null,
        "in_reply_to_user_id": null,
        "retweet_count": 0,
        "created_at": "Mon Sep 10 20:20:45 +0000 2012",
        "id_str": "245255511388336128",
        "place": null,
        "user": {
           "location": "Oregon, ",
           "default_profile": false,
           "statuses_count": 5289,
           "profile_background_tile": false,
           "lang": "en",
           "profile_link_color": "627E91",
           "id": 347471575,
           "following": null,
           "protected": false,
           "favourites_count": 17,
           "profile_text_color": "D4B020",
           "verified": false,
           "description": "Dad, Innovator, Sales Professional. Project Management Professional (PMP).  Soccer Coach,  Little League Coach  #Agile #PMOT - views are my own -",
           "contributors_enabled": false,
           "name": "Scott Ostby",
           "profile_sidebar_border_color": "404040",
           "profile_background_color": "0F0F0F",
           "created_at": "Tue Aug 02 21:10:39 +0000 2011",
           "default_profile_image": false,
           "followers_count": 19005,
           "profile_image_url_https": "https://si0.twimg.com/profile_images/1928022765/scott_normal.jpg",
           "geo_enabled": true,
           "profile_background_image_url": "http://a0.twimg.com/profile_background_images/327807929/xce5b8c5dfff3dc3bbfbdef5ca2a62b4.jpg",
           "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/327807929/xce5b8c5dfff3dc3bbfbdef5ca2a62b4.jpg",
           "follow_request_sent": null,
           "url": "http://facebook.com/ostby",
           "utc_offset": -28800,
           "time_zone": "Pacific Time (US & Canada)",
           "notifications": null,
           "friends_count": 13172,
           "profile_use_background_image": true,
           "profile_sidebar_fill_color": "1C1C1C",
           "screen_name": "ScottOstby",
           "id_str": "347471575",
           "profile_image_url": "http://a0.twimg.com/profile_images/1928022765/scott_normal.jpg",
           "show_all_inline_media": true,
           "is_translator": false,
           "listed_count": 45
        },
        "coordinates": null
     },
     "contributors": null,
     "text": "RT @ScottOstby: #Crowdsourcing – drivers already generate traffic data for your smartphone to suggest alternative routes when a road is  ...",
     "geo": null,
     "retweeted": false,
     "in_reply_to_screen_name": null,
     "truncated": false,
     "entities": {
        "urls": [],
        "hashtags": [
           {
              "text": "Crowdsourcing",
              "indices": [
                 16,
                 30
              ]
           }
        ],
        "user_mentions": [
           {
              "id": 347471575,
              "name": "Scott Ostby",
              "indices": [
                 3,
                 14
              ],
              "screen_name": "ScottOstby",
              "id_str": "347471575"
           }
        ]
     },
     "in_reply_to_status_id_str": null,
     "id": 245270269525123072,
     "in_reply_to_user_id_str": null,
     "source": "web",
     "favorited": false,
     "in_reply_to_status_id": null,
     "in_reply_to_user_id": null,
     "retweet_count": 0,
     "created_at": "Mon Sep 10 21:19:23 +0000 2012",
     "id_str": "245270269525123072",
     "place": null,
     "user": {
        "location": "",
        "default_profile": true,
        "statuses_count": 1294,
        "profile_background_tile": false,
        "lang": "en",
        "profile_link_color": "0084B4",
        "id": 21804678,
        "following": null,
        "protected": false,
        "favourites_count": 11,
        "profile_text_color": "333333",
        "verified": false,
        "description": "",
        "contributors_enabled": false,
        "name": "Parvez Jugon",
        "profile_sidebar_border_color": "C0DEED",
        "profile_background_color": "C0DEED",
        "created_at": "Tue Feb 24 22:10:43 +0000 2009",
        "default_profile_image": false,
        "followers_count": 70,
        "profile_image_url_https": "https://si0.twimg.com/profile_images/2280737846/ni91dkogtgwp1or5rwp4_normal.gif",
        "geo_enabled": false,
        "profile_background_image_url": "http://a0.twimg.com/images/themes/theme1/bg.png",
        "profile_background_image_url_https": "https://si0.twimg.com/images/themes/theme1/bg.png",
        "follow_request_sent": null,
        "url": null,
        "utc_offset": null,
        "time_zone": null,
        "notifications": null,
        "friends_count": 299,
        "profile_use_background_image": true,
        "profile_sidebar_fill_color": "DDEEF6",
        "screen_name": "ParvezJugon",
        "id_str": "21804678",
        "profile_image_url": "http://a0.twimg.com/profile_images/2280737846/ni91dkogtgwp1or5rwp4_normal.gif",
        "show_all_inline_media": false,
        "is_translator": false,
        "listed_count": 7
     },
     "coordinates": null
  }
 

Once the table is set up, querying complex data is as simple as SQL:

hive> SELECT * FROM tweets;
OK
id	created_at	source	favorited	retweet_count	retweeted_status	entities	text	user	in_reply_to_screen_name
245270269525123072	Mon Sep 10 21:19:23 +0000 2012	web	false	0	{"text":"#Crowdsourcing – drivers already generate traffic data for yoursmartphone to suggest alternative routes when a road is clogged. #bigdata","user":{"screen_name":"ScottOstby","name":"Scott Ostby"}}	{"urls":[],"user_mentions":[{"screen_name":"ScottOstby","name":"Scott Ostby"}],"hashtags":[{"text":"Crowdsourcing"}]}	RT @ScottOstby: #Crowdsourcing – drivers already generate trafficdata for your smartphone to suggest alternative routes when a road is  ...	{"screen_name":"ParvezJugon","name":"Parvez Jugon","friends_count":299,"followers_count":70,"statuses_count":1294,"verified":false,"utc_offset":null,"time_zone":null}	NULL
Time taken: 3.07 seconds

If you’re interested in getting the code to use the JSONSerDe in a real environment, see the CDH Twitter Example on the Cloudera Github.

Conclusion

The SerDe interface is extremely powerful for dealing with data with a complex schema. By utilizing SerDes, any dataset can be made queryable through Hive. For a full use case involving the JSONSerDe discussed in this article, seeAnalyzing Twitter with Apache Hadoop.

Jon Natkins is a Software Engineer at Cloudera, working on the Enterprise team.

### 回答1: 无法解决 org.apache.hive:hive-exec:2.3.7 的问题可能有以下几种原因: 1. 版本不兼容:可能当前项目使用的其他依赖库与 org.apache.hive:hive-exec:2.3.7 的版本不兼容。检查一下你使用的所有依赖库,并确保它们与 Hive 的版本兼容。尝试使用与 Hive 相对应的依赖库版本或者升级 Hive 到与你的依赖库版本兼容的版本。 2. 依赖库配置错误:可能在你的项目配置文件(如 pom.xml 或 build.gradle)中没有正确引入 org.apache.hive:hive-exec:2.3.7 依赖库。检查一下你的配置文件,并确保已经正确引入了 Hive 的相关依赖库。如果配置文件中没有该依赖库,尝试添加它到你的配置文件中。 3. 仓库访问问题:如果你使用的依赖库存储在远程仓库中,可能出现了无法访问该仓库的问题。检查一下你的网络连接,并确保可以正常访问依赖库所在的仓库。如果访问受限制,可以尝试使用代理服务器或者更改依赖库的存储位置,例如将其下载到本地并通过本地路径引用。 4. Maven/Gradle 配置问题:如果你使用的是 Maven 或 Gradle 进行构建项目,可能出现了配置问题。检查一下你的构建工具的配置文件,确保已正确设置了仓库地址、依赖库的坐标和版本等信息。如果配置错误,尝试重新配置或参考官方文档以获取正确的配置。 如果以上方法都无法解决 org.apache.hive:hive-exec:2.3.7 的问题,可能需要进一步排查具体错误信息或查找其他人是否遇到了相似的问题。 ### 回答2: 无法解析 org.apache.hive:hive-exec:2.3.7 的原因有多种可能。以下是一些常见的解决方法: 1. 检查是否将正确的 Maven 仓库添加到项目的配置文件中。您可以在项目的pom.xml文件中添加 Hive 依赖项。确保将 Maven 中央仓库添加到配置文件中,以便从中央仓库下载依赖项。 2. 检查网络连接是否正常。如果您的网络连接存在问题,可能无法连接到依赖项所在的 Maven 仓库。确保您的网络连接正常,然后尝试重新构建项目。 3. 检查您正在使用的 Maven 版本是否与项目中指定的依赖项版本不兼容。尝试更新 Maven 版本,并确保使用的 Maven 版本与项目中的依赖项版本兼容。 4. 如果您正在使用的是私有 Maven 仓库,请确保正确配置了仓库的 URL 和凭据信息。有时候,无法解析依赖项是由于未正确配置私有仓库的原因导致的。 5. 检查您本地的 Maven 仓库是否已正确下载和缓存所需依赖项。如果 Maven 仓库中缺少所需的 Hive 依赖项,那么将无法解析该依赖项。您可以尝试删除本地 Maven 仓库中与 Hive 相关的文件,然后重新构建项目以重新下载依赖项。 如果上述方法都无法解决问题,您可能需要进一步检查您的项目配置和环境设置。您还可以搜索相关错误信息和日志,以获得更多关于无法解析依赖项的原因和解决方法的信息。 ### 回答3: 无法解析org.apache.hive:hive-exec:2.3.7的问题可能涉及以下几个方面: 1. 依赖库未添加或版本不正确:检查项目的依赖配置文件中是否添加了org.apache.hive:hive-exec:2.3.7的依赖,如果已添加,请确认版本是否正确。可以尝试通过更新或更换依赖版本来解决问题。 2. 仓库地址或网络连接问题:检查项目的仓库地址是否配置正确,并确认网络连接正常。如果仓库地址无误且网络正常,可能是由于仓库服务器问题导致无法解析依赖库。可以尝试更换其他仓库地址或稍后再次尝试解析。 3. 代理配置问题:如果项目处于代理环境中,需要确保代理配置正确。检查maven的settings.xml文件中的代理配置是否正确,并确认代理服务器的可用性。 4. 本地maven仓库损坏:如果本地maven仓库损坏或缺少相应的依赖库,也可能导致无法解析依赖。可以尝试清理本地maven仓库,并重新下载依赖库。 总之,无法解析org.apache.hive:hive-exec:2.3.7的问题通常是由于依赖配置问题、仓库地址或网络连接问题、代理配置问题、本地maven仓库损坏等原因引起的。根据具体情况逐一排查并解决相应问题,可以解决这个问题。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值