Java API连接HDFS并创建Parquet文件

1、设置连接,参考之前的文章:Java API操作HA方式下的Hadoop

    static String ClusterName = "nsstargate";
	private static final String HADOOP_URL = "hdfs://"+ClusterName;
	public static Configuration conf;

    static {
        conf = new Configuration();
        conf.set("fs.defaultFS", HADOOP_URL);
        conf.set("dfs.nameservices", ClusterName);
        conf.set("dfs.ha.namenodes."+ClusterName, "nn1,nn2");
        conf.set("dfs.namenode.rpc-address."+ClusterName+".nn1", "172.16.50.24:8020");
        conf.set("dfs.namenode.rpc-address."+ClusterName+".nn2", "172.16.50.21:8020");
        //conf.setBoolean(name, value);
        conf.set("dfs.client.failover.proxy.provider."+ClusterName, 
        		"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider");
        conf.set("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem");
    }

注:如果只是Configuration conf = new Configuration(); 不设置hdfs连接信息的话,则会将文件写到本地磁盘上(需要配置hadoop环境信息)。

2、设置parquet文件的schema

public static final MessageType FILE_SCHEMA = Types.buildMessage()
		      .required(PrimitiveTypeName.INT32).named("id")
		      .required(PrimitiveTypeName.BINARY).as(OriginalType.UTF8).named("name")
		      .named("test");

3、创建parquet文件并写入数据

已测试成功的有两种方法:

	public static void createFile1() throws Exception {
		String file = "/user/test/test_parquet_file7.parquet";
		Path path = new Path(file);
		FileSystem fs = path.getFileSystem(conf);
	    if (fs.exists(path)) {
	      fs.delete(path, true);
	    }
		GroupWriteSupport.setSchema(FILE_SCHEMA, conf);
	    SimpleGroupFactory f = new SimpleGroupFactory(FILE_SCHEMA);
		ParquetWriter<Group> writer = new ParquetWriter<Group>(path, new GroupWriteSupport(),
				CODEC, 1024, 1024, 512, true, false, WriterVersion.PARQUET_2_0, conf);
		for (int i = 0; i < 100; i++) {
          writer.write(
              f.newGroup()
              .append("id", i)
              .append("name", UUID.randomUUID().toString()));
        }
        writer.close();
	}
	public static void createFile2() throws Exception {
		String file = "/user/test/test_parquet_file8.parquet";
		Path path = new Path(file);
		FileSystem fs = path.getFileSystem(conf);
	    if (fs.exists(path)) {
	      fs.delete(path, true);
	    }
	    SimpleGroupFactory f = new SimpleGroupFactory(FILE_SCHEMA);
		ParquetWriter<Group> writer = ExampleParquetWriter.builder(path)
				.withConf(conf)
		        .withType(FILE_SCHEMA)
		        .build();
		for (int i = 0; i < 100; i++) {
			Group group = f.newGroup();
			group.add("id", i);
			group.add("name", UUID.randomUUID().toString());
			writer.write(group);
        }
        writer.close();
	}

其中用到的两个类GroupWriteSupport和ExampleParquetWriter在这里parquet-mr

4、Hive建表及load parquet文件

create table test_parquet_table(id int,name string) stored AS PARQUET;

load data inpath '/user/test/test_parquet_file8.parquet' overwrite into table test_parquet_table;

在创建文件以及将parquet文件导入到Hive表中时,需要注意的是:

当字段为boolean类型时,则schema为boolean,写入数据为boolean,创建Hive表为boolean,建表时字段不必和schema中字段的顺序一致,但是必须保证同名

 

 

转载于:https://my.oschina.net/nivalsoul/blog/776713

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值