通过写文件方式写数据到hive库中

最新推荐文章于 2023-06-12 11:39:11 发布

伟衙内

最新推荐文章于 2023-06-12 11:39:11 发布

阅读量836

点赞数 1

分类专栏： kettle 文章标签： hive hadoop

本文链接：https://blog.csdn.net/lw18751836671/article/details/119911638

版权

kettle 专栏收录该内容

25 篇文章 7 订阅

订阅专栏

前言

还是之前kettle中那个hive环境，这次不用kettle将数据加载到hive中去，而是用写文件的方式。

环境介绍：

三个hadoop节点

192.168.10.211 node01 （master）

192.168.10.212 node02 (slave)

192.168.10.213 node03 (slave)

hive安装在 192.168.10.212 上。

hive上的user_info_dest表的DDL语句，

CREATE TABLE `ntzw_dev_64.user_info_dest`(
  `id` string, 
  `name` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://node01:9000/user/hive_remote/warehouse/ntzw_dev_64.db/user_info_dest'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='true', 
  'numFiles'='7', 
  'numRows'='6', 
  'rawDataSize'='231', 
  'totalSize'='237', 
  'transient_lastDdlTime'='1629798016')

hadoop的命令： http://hadoop.apache.org/docs/r1.0.4/cn/hdfs_shell.html

环境查看

首先去hadoop的安装目录，查看文件系统中的文件，

./hadoop fs -ls -R /user/hive_remote/warehouse/ntzw_dev_64.db

/user/hive_remote/warehouse/ntzw_dev_64.db 这个是上面DDL语句出location的路径。

如下图所示，这是当初在搞kettle的时候建立的两个表，这是以目录形式存在，然后表数据以文件形式存在，

使用如下语句将表清空，然后重新造了两条数据，

insert overwrite table user_info_dest select * from user_info_dest where 1=0;

继续看下文件系统中的样子，有3个user_info_dest目录下的文件，看下这三个文件内容

可以看到copy_1和copy_2中分散放了这两条数据

那么如果对这些文件操作，是否就可以在Hive中看到了呢？这就要用到hive的一个SQL，load data

编写代码

pom中引入包

<dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-common</artifactId>
   <version>2.7.3</version>
</dependency>
<dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-hdfs</artifactId>
   <version>2.7.3</version>
</dependency>

      <dependency>
          <groupId>org.apache.hive</groupId>
          <artifactId>hive-jdbc</artifactId>
          <version>1.2.1</version>
      </dependency>

写一个junit测试类，这个是主要的内容，先往hadoop中写入，然后利用load data这个方法来加载到hive库中。完整的代码会在最后贴出来。

/**
     * 通过写hive的文件系统，将数据录入进去
     */
    @Test
    public void writeHiveFile() throws SQLException, IOException {
        String randomId = UUID.randomUUID().toString().replace("-","");

        String tableName= "user_info_dest";

        String hdfsIpAndPort = "hdfs://192.168.10.212:9000";
        String dbPath = "/user/hive_remote/warehouse/ntzw_dev_64.db/";

        //建表语句后面的  location
        String filePath = hdfsIpAndPort+dbPath+tableName+"_"+randomId;

        //将数据写入Hadoop中
        this.writeData2Hadoop(filePath);

        //加载到hive表的语句  从hadoop文件中加载到hive库中，这样 select 语句才能看见
        String loadDataSql = "load data inpath '"+filePath+"' "+"into table "+tableName;

        System.out.println(loadDataSql);

        //获取连接，并执行加载语句
        Connection connection = this.getHiveConnection();
        Statement stmt = connection.createStatement();
        stmt.execute(loadDataSql);

        stmt.close();
        connection.close();

        //加载完成后需要删除临时文件
        this.deleteHiveTmpFile(filePath);

    }

/**
     * 往hadoop文件系统里面写文件
     * @param filePath
     * @throws IOException
     */
    private void writeData2Hadoop(String filePath) throws IOException {
        Configuration conf = new Configuration();
        conf.setBoolean("dfs.support.append", true);
        conf.set("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem");

        //获取文件系统对象
        FileSystem fs = FileSystem.get(URI.create(filePath),conf);

        OutputStream output = null;

        //文件检查
        Path path = new Path(filePath);
        if (!(fs.exists(path)&&fs.isFile(path))) {
            log.info("不存在文件[{}],开始创建文件.",filePath);
            fs.createNewFile(path);
            output = fs.create(path);
        } else {
            log.info("文件[{}]已存在,追加内容到文件",filePath);
            output = fs.append(path);
        }

        //往文件里面写数据
        List<UserInfoDest> dataList = this.initData();
        for (UserInfoDest data : dataList) {
            //目标表只有两列  id,name

            StringBuilder sb = new StringBuilder();
            sb.append(data.id);//表的id
            //hive的表的默认分割符，\t，如果指定了其他列分割符，那么此处也要改变
            sb.append("\001");
            sb.append(data.name);//表的name
            sb.append("\n");//一行内容结束后换行
            output.write(sb.toString().getBytes());
        }
        output.flush();
        output.close();

        fs.close();
    }

错误解决

连接拒绝

连接拒绝，咋还能拒绝呢，幸好报错后面有链接。

报错后面的链接，https://cwiki.apache.org/confluence/display/HADOOP2/ConnectionRefused

打开以后，看到如下图，上面说可能是/etc/hosts文件中有127.0.0.1或者localhost这种映射。

查看192.168.10.212，安装了hadoop一个节点的敌方，也是安装hive的机器，好家伙，还真是，立刻注释掉。

然后重启了hadoop(其实压根不需要重启，重启毛线啊)

重新运行代码，发现还是连接拒绝，然后检查一下代码，突然灵光一闪，我看到了hdfs写的是192.168.10.212，这个是hive的安装路径。

然后我看了一下user_info_dest的DDL语句中location，上面是node01，node01不是192.168.10.211的IP么？

将代码中filePath的改为IP改为192.168.10.211

//建表语句后面的  location
String filePath = "hdfs://192.168.10.211:9000/user/hive_remote/warehouse/ntzw_dev_64.db/"+tableName+"_"+randomId;

Permission denied

继续运行，新的报错，权限问题。

完整点的报错如下所示，

org.apache.hadoop.security.AccessControlException: Permission denied: user=Administrator, access=WRITE, inode="/user/hive_remote/warehouse/ntzw_dev_64.db/user_info_dest_0730452cf2c94db38a32a0f32b499646":hadoop:supergroup:drwxr-xr-x

at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:320)

at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:292)

at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:213)

at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)

at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1728)

at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1712)

at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:1695)

at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2515)

at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2450)

at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2334)

at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:624)

at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:397)

at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)

at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)

at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)

at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)

at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)

at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)

at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)

at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)

at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1653)

at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1689)

at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1624)

at org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:448)

at org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:444)

at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)

at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:459)

at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:387)

at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)

at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)

at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:854)

at org.apache.hadoop.fs.FileSystem.createNewFile(FileSystem.java:1154)

at com.lw.kettle.HiveOper.writeData2Hadoop(HiveOper.java:76)

分析上述报错，使用用户Administrator，没有写权限，这个Administrator不是我windows本地的用户么？

在hadoop上安装程序使用的用户是新建的hadoop/hadoop用户，就算是root用户我也认啊，怎么是windows用户呢？

然后我百度了一下，https://blog.csdn.net/zhangjunli/article/details/106321516

上面这个博客告诉了我原因，原来会先登录，登录使用的是环境变量的配置，如果没有配置获取windows用户。

查看了权限，果然是 r_x，那么此处或许可以如下命令，改变权限，但我觉得还是用HADOOP_USER_NAME好一点。

修改权限（不推荐）：

hadoop fs -chmod 777 /user/hive_remote/warehouse/ntzw_dev_64.db

决定采用配置环境变量的方式，

还是不行，调试代码发现获取为空，

然后自己写了一个Test，还是为空，

@Test
public void test(){
    String name = "HADOOP_USER_NAME";
    String hadNameEnv = System.getenv(name);
    String hadNamePro = System.getProperty(name);
    System.out.println(hadNameEnv+"   "+hadNamePro);
}

如上test方法一直是空，估摸着是环境变量没有生效，是不是重启电脑？那还是算了吧。

直接自己设置下即可，

@Before
public void init(){
    System.setProperty("HADOOP_USER_NAME","hadoop");
}

问题解决，然后新的错误在等待。

路径不对

继续运行代码，报如下错误，

org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10028]: Line 1:17 Path is not legal ''hdfs://192.168.10.211:9000/user/hive_remote/warehouse/ntzw_dev_64.db/user_info_dest_93fddf2f22554d4aadb16ed38788664d'': Move from: hdfs://192.168.10.211:9000/user/hive_remote/warehouse/ntzw_dev_64.db/user_info_dest_93fddf2f22554d4aadb16ed38788664d to: hdfs://node01:9000/user/hive_remote/warehouse/ntzw_dev_64.db/user_info_dest is not valid. Please check that values for params "default.fs.name" and "hive.metastore.warehouse.dir" do not conflict.

at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:256)

at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:242)

at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:254)

at com.lw.kettle.HiveOper.writeHiveFile(HiveOper.java:53)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)

at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)

at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)

at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)

at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)

at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)

at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)

at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)

at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)

at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)

at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)

at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)

at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)

at org.junit.runners.ParentRunner.run(ParentRunner.java:363)

at org.junit.runner.JUnitCore.run(JUnitCore.java:137)

at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)

at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)

at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)

at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)

看一眼Hadoop上，确实有我们的文件了，那么就是还没有加载到hive中去，

查看下文件数据，数据正常，是我们生成数据，

将直接在212上手工执行，结果发现是可以通过load data inpath加载到Hive库，那说明这个方法没问题啊！

数据添加进去了，但是为何代码执行不可？

那猜测一点，Hive是安装在212，也就是子节点上，而连接是连接的主节点211，所以无法将211的文件放入212的hive表中，可为何ddl语句是node01，而不是node02呢？

在看文件，加载后路径改变了，

从/user/hive_remote/warehouse/ntzw_dev_64.db/user_info_dest_93fddf2f22554d4aadb16ed38788664d

到/user/hive_remote/warehouse/ntzw_dev_64.db/user_info_dest/user_info_dest_93fddf2f22554d4aadb16ed38788664d

重新运行一遍，打印了SQL，

load data inpath 'hdfs://192.168.10.211:9000/user/hive_remote/warehouse/ntzw_dev_64.db/user_info_dest_3d205d05ae6047eea05a27ed6089d0c1' into table user_info_dest

对比刚才手工执行的，很明显是多了hdfs://192.168.10.211:9000，我靠。

load data inpath "/user/hive_remote/warehouse/ntzw_dev_64.db/user_info_dest_93fddf2f22554d4aadb16ed38788664d" into table user_info_dest;

最终将代码路径改为如下，

看到这个绿色是多么令人兴奋啊，

查看表数据也有了，

完整代码

package com.lw.kettle;

import lombok.extern.slf4j.Slf4j;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.junit.Before;
import org.junit.Test;

import java.io.IOException;
import java.io.OutputStream;
import java.net.URI;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.SQLException;
import java.sql.Statement;
import java.util.*;

/**
 * @author lw
 * @date 2021/8/24 0024
 * @description
 */
@Slf4j
public class HiveOper {

    @Before
    public void init(){
        System.setProperty("HADOOP_USER_NAME","hadoop");
    }

    /**
     * 通过写hive的文件系统，将数据录入进去
     */
    @Test
    public void writeHiveFile() throws SQLException, IOException {
        String randomId = UUID.randomUUID().toString().replace("-","");

        String tableName= "user_info_dest";

        String hdfsIpAndPort = "hdfs://192.168.10.211:9000";
        String dbPath = "/user/hive_remote/warehouse/ntzw_dev_64.db/";

        //建表语句后面的  location
        String filePath = hdfsIpAndPort+dbPath+tableName+"_"+randomId;

        //将数据写入Hadoop中
        this.writeData2Hadoop(filePath);

        //加载到hive表的语句  从hadoop文件中加载到hive库中，这样 select 语句才能看见
        String loadDataSql = "load data inpath '"+dbPath+tableName+"_"+randomId+"' "+"into table "+tableName;

        System.out.println(loadDataSql);

        //获取连接，并执行加载语句
        Connection connection = this.getHiveConnection();
        Statement stmt = connection.createStatement();
        stmt.execute(loadDataSql);

        stmt.close();
        connection.close();

        //加载完成后需要删除临时文件
        this.deleteHiveTmpFile(filePath);

    }

    /**
     * 往hadoop文件系统里面写文件
     * @param filePath
     * @throws IOException
     */
    private void writeData2Hadoop(String filePath) throws IOException {
        Configuration conf = new Configuration();
        conf.setBoolean("dfs.support.append", true);
        conf.set("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem");

        //获取文件系统对象
        FileSystem fs = FileSystem.get(URI.create(filePath),conf);

        OutputStream output = null;

        //文件检查
        Path path = new Path(filePath);
        if (!(fs.exists(path)&&fs.isFile(path))) {
            log.info("不存在文件[{}],开始创建文件.",filePath);
            fs.createNewFile(path);
            output = fs.create(path);
        } else {
            log.info("文件[{}]已存在,追加内容到文件",filePath);
            output = fs.append(path);
        }

        //往文件里面写数据
        List<UserInfoDest> dataList = this.initData();
        for (UserInfoDest data : dataList) {
            //目标表只有两列  id,name

            StringBuilder sb = new StringBuilder();
            sb.append(data.id);//表的id
            //hive的表的默认分割符，\t，如果指定了其他列分割符，那么此处也要改变
            sb.append("\001");
            sb.append(data.name);//表的name
            sb.append("\n");//一行内容结束后换行
            output.write(sb.toString().getBytes());
        }
        output.flush();
        output.close();

        fs.close();
    }

    /**
     * 删除Hive临时文件
     * @param filePath
     */
    private void deleteHiveTmpFile(String filePath){
        Configuration conf = new Configuration();
        conf.setBoolean("dfs.support.append", true);
        conf.set("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem");

        FileSystem fs = null;
        Path path = new Path(filePath);

        try{
            fs = FileSystem.get(URI.create(filePath), conf);

            if ((fs.exists(path)&&fs.isFile(path))) {
                fs.delete(path,false);
            }
        }catch (Exception e){
            e.printStackTrace();
        }
    }

    /**
     * 获取hive的连接
     * @return
     */
    private Connection getHiveConnection(){
        String driver = "org.apache.hive.jdbc.HiveDriver";
        String url = "jdbc:hive2://192.168.10.212:10000/ntzw_dev_64";
        try {
            Class.forName(driver);
            Connection conn = DriverManager.getConnection(url,"hadoop","hadoop");
            return conn;
        } catch (ClassNotFoundException e) {
            e.printStackTrace();
        } catch (SQLException e) {
            e.printStackTrace();
        }
        return null;
    }

    /**
     * 初始化数据
     * @return
     */
    private List<UserInfoDest> initData(){
        String[] cols = new String[]{"id","name"};

        List<UserInfoDest> retList = new ArrayList<UserInfoDest>();

        //初始化两条数据
        for (int i = 0; i < 2; i++) {
            UserInfoDest dest = new UserInfoDest();
            dest.id = cols[0]+"_"+i;
            dest.name = cols[1]+"_"+i;
            retList.add(dest);
        }
        return retList;
    }
}

/**
 * hive的表对象
 */
class UserInfoDest{
    public String id;
    public String name;
}

当然，此处依旧是一个简单模型，在项目中真正用的时候比这个复杂很多，不过简单模型能帮助自己看懂和扩展。

伟衙内

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
通过写文件方式写数据到hive库中

目录前言环境查看编写代码错误解决连接拒绝 Permission denied路径不对完整代码前言还是之前kettle中那个hive环境，这次不用kettle将数据加载到hive中去，而是用写文件的方式。环境介绍：三个hadoop节点192.168.10.211 node01 （master）192.168.10.212 node02 (slave)192...
复制链接

扫一扫

专栏目录