给大数据文件的每一行产生唯一的id

最新推荐文章于 2024-07-05 14:14:34 发布

lingerlanlan

最新推荐文章于 2024-07-05 14:14:34 发布

阅读量4.9k

点赞数 2

分类专栏： Hadoop 文章标签： hive 唯一ID 自增ID

本文链接：https://blog.csdn.net/lingerlanlan/article/details/46430747

版权

Hadoop 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

给大数据文件的每一行产生唯一的id

4个主要思路：

1 单线程处理

2 普通多线程

3 hive

4 Hadoop

搜到一些参考资料

《Hadoop实战》的笔记-2、Hadoop输入与输出

https://book.douban.com/annotation/17068812/

TextInputFormat：文件偏移量：整行数据

但是这个偏移量，貌似是在一个文件的偏移，而不是全局。

Generate Auto-increment Id in Map-reduceJob

http://shzhangji.com/blog/2013/10/31/generate-auto-increment-id-in-map-reduce-job/

Generate unique customer id / insert uniquerows in hive

http://stackoverflow.com/questions/26855003/generate-unique-customer-id-insert-unique-rows-in-hive

Need to add auto increment column in atable using hive

http://stackoverflow.com/questions/23082763/need-to-add-auto-increment-column-in-a-table-using-hive

https://hadooptutorial.info/writing-custom-udf-in-hive-auto-increment-column-hive/

Here make sure that addition of annotation@UDFType(stateful = true) is required otherwisecounter value will not get increment in the Hive column, it will just returnvalue 1 for all the rows but not the actual row number.

最后我采取了用hive写udf的方案。

package hive.udf;
/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.udf.UDFType;

/**
 * UDFRowSequence.
 */
@Description(name = "row_sequence",
    value = "_FUNC_() - Returns a generated row sequence number starting from 1")
@UDFType(deterministic = false, stateful = true)//stateful参数是必要的
public class UDFRowSequence extends UDF
{
  private int result;

  public UDFRowSequence() {
    result=0;
  }

  public int evaluate() {
	  result++;
    return result;
  }
}

// End UDFRowSequence.java

本文作者：linger

本文链接：http://blog.csdn.net/lingerlanlan/article/details/46430747

lingerlanlan

关注

2
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
给大数据文件的每一行产生唯一的id

给大数据文件的每一行产生唯一的id4个主要思路：1 单线程处理2 普通多线程3 hive4 Hadoop 搜到一些参考资料《Hadoop实战》的笔记-2、Hadoop输入与输出https://book.douban.com/annotation/17068812/TextInputFormat：文件偏移量：整行数据但是这个偏移量，貌似
复制链接

扫一扫

专栏目录