延长SparkContext初始化时间

作者博客迁移至博客园:http://www.cnblogs.com/xiaodf/

有些应用中可能希望先在driver上运行一段java单机程序,然后再初始化SparkContext用集群模式操作java程序返回值。从而避免过早建立SparkContext对象分配集群资源,使资源长时间空闲。

这里涉及到两个yarn参数:

  <property> 
    <name>yarn.am.liveness-monitor.expiry-interval-ms</name>  
    <value>6000000</value> 
  </property>
   <property> 
    <name>yarn.resourcemanager.am.max-retries</name>  
    <value>10</value> 
  </property>

Yarn会周期性遍历所有的ApplicationMaster,如果一个ApplicationMaster在一定时间(可通过参数yarn.am.liveness-monitor.expiry-interval-ms配置,默认为10min)内未汇报心跳信息,则认为它死掉了,它上面所有正在运行的Container将被置为运行失败(RM不会重新执行这些Container,它只会通过心跳机制告诉对应的AM,由AM决定是否重新执行,如果需要,则AM重新向RM申请资源),AM本身会被重新分配到另外一个节点上(管理员可通过参数yarn.resourcemanager.am.max-retries指定每个ApplicationMaster的尝试次数,默认是1次)执行。

还需要两个spark参数:

  <property> 
    <name>spark.yarn.am.waitTime</name>  
    <value>6000000</value> 
  </property>
   <property> 
    <name>spark.yarn.applicationMaster.waitTries</name>  
    <value>200</value> 
  </property>

集群管理

Spark On YARN
属性名称 默认值 含义
spark.yarn.scheduler.heartbeat.interval-ms5000Spark AppMaster发送心跳信息给YARN RM的时间间隔
spark.yarn.am.waitTime100000启动时等待时间
spark.yarn.applicationMaster.waitTries10RM等待Spark AppMaster启动重试次数,也就是SparkContext初始化次数。超过这个数值,启动失败

下面是一个测试用例,现在driver打印30分钟的信息,然后再初始化SparkContext

package iie.udps.example.spark;

import iie.udps.common.hcatalog.SerHCatInputFormat;
import iie.udps.common.hcatalog.SerHCatOutputFormat;
import java.io.IOException;
import java.util.HashMap;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hive.hcatalog.data.DefaultHCatRecord;
import org.apache.hive.hcatalog.data.HCatRecord;
import org.apache.hive.hcatalog.data.schema.HCatSchema;
import org.apache.hive.hcatalog.mapreduce.OutputJobInfo;
import org.apache.spark.SerializableWritable;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.function.Function2;
import scala.Tuple2;

/**
 * 实现功能:首先在driver上单机打印30分钟数据,然后初始化SparkContext开启集群模式,用spark+hcatlog 读hive表数据,实现GroupByAge功能,
 * 输出结果到hive表中,同时打印xml信息到hdfs文件。
 * spark-submit --class iie.udps.example.spark.SparkTest --master yarn-cluster 
 * --num-executors 2 --executor-memory 1g --executor-cores 1 --driver-memory 1g 
 * --conf spark.yarn.applicationMaster.waitTries=200,--conf spark.yarn.am.waitTime=1800000 --jars /home/xdf/udps-sdk-0.3.jar,/home/xdf/udps-sdk-0.3.jar
 *  /home/xdf/sparktest.jar -c /user/hdfs/TestStdin2.xml
 */
public class SparkTest {

	@SuppressWarnings("rawtypes")
	public static void main(String[] args) throws Exception {
		if (args.length < 2) {
			System.err.println("Usage: <-c> <stdin.xml>");
			System.exit(1);
		}
		
		String stdinXml = args[1];
		OperatorParamXml operXML = new OperatorParamXml();
		List<java.util.Map> stdinList = operXML.parseStdinXml(stdinXml);// 参数列表

		// 获得输入参数
		String inputDBName = stdinList.get(0).get("inputDBName").toString();
		String inputTabName = stdinList.get(0).get("inputTabName").toString();
		String outputDBName = stdinList.get(0).get("outputDBName").toString();
		String outputTabName = stdinList.get(0).get("outputTabName").toString();
		String tempHdfsBasePath = stdinList.get(0).get("tempHdfsBasePath")
				.toString();
		String jobinstanceid = stdinList.get(0).get("jobinstanceid").toString();
		
		System.out.println(inputDBName+": "+ inputTabName +": "+outputDBName+": "+ outputTabName
				+": "+ tempHdfsBasePath+": "+ jobinstanceid);

		long begin = System.currentTimeMillis(); 
		int count = 600;// 写文件行数
		for (int i = 0; i < count; i++) {
			System.out.println("aaaaaaaaaaaaaaa"+i);
			Thread.sleep(3000);
		}
		long end = System.currentTimeMillis();   
        System.out.println("FileOutputStream执行耗时:" + (end - begin) + "ms");   
		
		
		if (inputDBName == "" || inputTabName == "" || jobinstanceid == ""
				|| outputDBName == "" || outputTabName == ""
				|| tempHdfsBasePath == "" || jobinstanceid == "") {

			// 设置异常输出参数
			java.util.Map<String, String> stderrMap = new HashMap<String, String>();
			String errorMessage = "Some operating parameters is empty!!!";
			String errotCode = "80001";
			stderrMap.put("errorMessage", errorMessage);
			stderrMap.put("errotCode", errotCode);
			stderrMap.put("jobinstanceid", jobinstanceid);
			String fileName = "";
			if (tempHdfsBasePath.endsWith("/")) {
				fileName = tempHdfsBasePath + "stderr.xml";
			} else {
				fileName = tempHdfsBasePath + "/stderr.xml";
			}
			
			// 生成异常输出文件
			operXML.genStderrXml(fileName, stderrMap);
		} else {			
			// 根据输入表结构,创建与输入表同样结构的输出表
			HCatSchema schema = operXML
					.getHCatSchema(inputDBName, inputTabName);

			// Spark程序第一件事情就是创建一个JavaSparkContext告诉Spark怎么连接集群
			SparkConf sparkConf = new SparkConf().setAppName("SparkExample");
			
			JavaSparkContext jsc = new JavaSparkContext(sparkConf);
			
			// 读取并处理hive表中的数据,生成RDD数据并处理后返回
			JavaRDD<SerializableWritable<HCatRecord>> LastRDD = getProcessedData(
					jsc, inputDBName, inputTabName, schema);
			
			// 将处理后的数据存到hive输出表中
			storeToTable(LastRDD, outputDBName, outputTabName);

			jsc.stop();

			// 设置正常输出参数
			java.util.Map<String, String> stdoutMap = new HashMap<String, String>();
			stdoutMap.put("outputDBName", outputDBName);
			stdoutMap.put("outputTabName", outputTabName);
			stdoutMap.put("jobinstanceid", jobinstanceid);
			String fileName = "";
			if (tempHdfsBasePath.endsWith("/")) {
				fileName = tempHdfsBasePath + "stdout.xml";
			} else {
				fileName = tempHdfsBasePath + "/stdout.xml";
			}
			
			// 生成正常输出文件
			operXML.genStdoutXml(fileName, stdoutMap);
		}
		System.out.println(inputDBName+": "+ inputTabName +": "+outputDBName+": "+ outputTabName
				+": "+ tempHdfsBasePath+": "+ jobinstanceid);
		System.exit(0);
	}

	/**
	 * 
	 * @param jsc
	 * @param dbName
	 * @param inputTable
	 * @param fieldPosition
	 * @return
	 * @throws IOException
	 */
	@SuppressWarnings("rawtypes")
	public static JavaRDD<SerializableWritable<HCatRecord>> getProcessedData(
			JavaSparkContext jsc, String dbName, String inputTable,
			final HCatSchema schema) throws IOException {
		// 获取hive表数据
		Configuration inputConf = new Configuration();
		Job job = Job.getInstance(inputConf);
		SerHCatInputFormat.setInput(job.getConfiguration(), dbName, inputTable);
		JavaPairRDD<WritableComparable, SerializableWritable> rdd = jsc
				.newAPIHadoopRDD(job.getConfiguration(),
						SerHCatInputFormat.class, WritableComparable.class,
						SerializableWritable.class);

		// 获取表记录集
		JavaPairRDD<Integer, Integer> pairs = rdd
				.mapToPair(new PairFunction<Tuple2<WritableComparable, SerializableWritable>, Integer, Integer>() {
					private static final long serialVersionUID = 1L;

					@SuppressWarnings("unchecked")
					@Override
					public Tuple2<Integer, Integer> call(
							Tuple2<WritableComparable, SerializableWritable> value)
							throws Exception {
						HCatRecord record = (HCatRecord) value._2.value();
						return new Tuple2((Integer) record.get(1), 1);
					}
				});

		JavaPairRDD<Integer, Integer> counts = pairs
				.reduceByKey(new Function2<Integer, Integer, Integer>() {
					private static final long serialVersionUID = 1L;

					@Override
					public Integer call(Integer i1, Integer i2) {
						return i1 + i2;
					}
				});

		JavaRDD<SerializableWritable<HCatRecord>> messageRDD = counts
				.map(new Function<Tuple2<Integer, Integer>, SerializableWritable<HCatRecord>>() {
					private static final long serialVersionUID = 1L;

					@Override
					public SerializableWritable<HCatRecord> call(
							Tuple2<Integer, Integer> arg0) throws Exception {
						HCatRecord record = new DefaultHCatRecord(2);
						record.set(0, arg0._1);
						record.set(1, arg0._2);
						return new SerializableWritable<HCatRecord>(record);
					}
				});
		// 返回处理后的数据
		return messageRDD;
	}

	/**
	 * 将处理后的数据存到输出表中
	 * 
	 * @param rdd
	 * @param dbName
	 * @param tblName
	 */
	@SuppressWarnings("rawtypes")
	public static void storeToTable(
			JavaRDD<SerializableWritable<HCatRecord>> rdd, String dbName,
			String tblName) {
		Job outputJob = null;
		try {
			outputJob = Job.getInstance();
			outputJob.setJobName("SparkExample");
			outputJob.setOutputFormatClass(SerHCatOutputFormat.class);
			outputJob.setOutputKeyClass(WritableComparable.class);
			outputJob.setOutputValueClass(SerializableWritable.class);
			SerHCatOutputFormat.setOutput(outputJob,
					OutputJobInfo.create(dbName, tblName, null));
			HCatSchema schema = SerHCatOutputFormat
					.getTableSchemaWithPart(outputJob.getConfiguration());
			SerHCatOutputFormat.setSchema(outputJob, schema);
		} catch (IOException e) {
			e.printStackTrace();
		}

		// 将RDD存储到目标表中
		rdd.mapToPair(
				new PairFunction<SerializableWritable<HCatRecord>, WritableComparable, SerializableWritable<HCatRecord>>() {
					private static final long serialVersionUID = -4658431554556766962L;

					public Tuple2<WritableComparable, SerializableWritable<HCatRecord>> call(
							SerializableWritable<HCatRecord> record)
							throws Exception {
						return new Tuple2<WritableComparable, SerializableWritable<HCatRecord>>(
								NullWritable.get(), record);
					}
				}).saveAsNewAPIHadoopDataset(outputJob.getConfiguration());

	}
	

}

下面是stdin.xml参数文件内容:

<?xml version="1.0" encoding="UTF-8" ?>
<request>
	<jobinstanceid>SK9cohJD4yklcD8dJuZXDA</jobinstanceid>
	<context>
		<property name="userName" value="zhangsan" />
		<property name="queueName" value="queue1" />
		<property name="processId" value="dns" />
		<property name="jobId" value="jobID" />
		<property name="hiveServerAddress" value="IP:port " />
		<property name="tempDatabaseName" value="database1" />
		<property name="tempHdfsBasePath" value="/user/xdf/test/20141216/SK9cohJD4yklcD8dJuZXDA/spark_example_operator" />
		<property name="departmentId" value="xx" />
	</context>

	<operator name="spark_example_operator" alias="sparkExample" class="SK.I.SparkCountByAge">
	    <parameter name="parse_type">7</parameter>
		<parameter name="outputTableName">xdf.test_out</parameter>
	</operator>
	<datasets>
		<dataset name="inport1">
			<row>xdf.test_in</row>
		</dataset>
	</datasets>
</request>

解析xml文件的类如下:

package iie.udps.example.spark;

import iie.udps.common.hcatalog.SerHCatOutputFormat;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.net.URI;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Map;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hive.conf.HiveConf;
import org.apache.hadoop.hive.metastore.HiveMetaStoreClient;
import org.apache.hadoop.hive.metastore.api.FieldSchema;
import org.apache.hadoop.hive.metastore.api.MetaException;
import org.apache.hadoop.hive.metastore.api.SerDeInfo;
import org.apache.hadoop.hive.metastore.api.StorageDescriptor;
import org.apache.hadoop.hive.metastore.api.Table;
import org.apache.hadoop.hive.ql.io.RCFileInputFormat;
import org.apache.hadoop.hive.ql.io.RCFileOutputFormat;
import org.apache.hadoop.hive.serde.serdeConstants;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.util.Progressable;
import org.apache.hive.hcatalog.common.HCatUtil;
import org.apache.hive.hcatalog.data.schema.HCatSchema;
import org.apache.hive.hcatalog.mapreduce.OutputJobInfo;
import org.apache.spark.SerializableWritable;
import org.apache.thrift.TException;
import org.dom4j.Document;
import org.dom4j.DocumentHelper;
import org.dom4j.Element;
import org.dom4j.io.OutputFormat;
import org.dom4j.io.XMLWriter;

/**
 * Dom4j 生成XML文档与解析XML文档
 */
public class OperatorParamXml {
	
	@SuppressWarnings("rawtypes")
	public List<Map> parseStdinXml(String stdinXml) throws Exception {

		// 读取stdin.xml文件
		Configuration conf = new Configuration();
		FileSystem fs = FileSystem.get(conf);
		FSDataInputStream dis = fs.open(new Path(stdinXml));
		InputStreamReader isr = new InputStreamReader(dis, "utf-8");
		BufferedReader read = new BufferedReader(isr);
		String tempString = "";
		String xmlParams = "";
		while ((tempString = read.readLine()) != null) {
			xmlParams += "\n" + tempString;
		}
		read.close();
		xmlParams = xmlParams.substring(1);

		String userName = null;
		String operatorName = null;
		String inputDBName = null;
		String outputDBName = null;
		String inputTabName = null;
		String outputTabName = null;
		String strs = null;
		String fieldName = null;
		String inputFilePath = null;
		String schemaList = null;
		String jobinstanceid = null;
		String tempHdfsBasePath = null;
		String queueName = null;
		String processId = null;
		String jobId = null;
		String hiveServerAddress = null;
		String departmentId = null;

		List<Map> list = new ArrayList<Map>();
		Map<String, String> map = new HashMap<String, String>();
		Document document = DocumentHelper.parseText(xmlParams); // 将字符串转化为xml
		Element node1 = document.getRootElement(); // 获得根节点
		Iterator iter1 = node1.elementIterator(); // 获取根节点下的子节点
		while (iter1.hasNext()) {
			Element node2 = (Element) iter1.next();
			if ("jobinstanceid".equals(node2.getName())) {
				jobinstanceid = node2.getText();
				map.put("jobinstanceid", jobinstanceid);
				System.out.println("====jobinstanceid=====" + jobinstanceid);
			}
			// 获取通用参数
			if ("context".equals(node2.getName())) {
				Iterator iter2 = node2.elementIterator();
				while (iter2.hasNext()) {
					Element node3 = (Element) iter2.next();
					if ("property".equals(node3.getName())) {
						if ("userName".equals(node3.attributeValue("name"))) {
							userName = node3.attributeValue("value");
							map.put("userName", userName);
						} else if ("queueName".equals(node3
								.attributeValue("name"))) {
							queueName = node3.attributeValue("value");
							map.put("queueName", queueName);
						} else if ("processId".equals(node3
								.attributeValue("name"))) {
							processId = node3.attributeValue("value");
							map.put("processId", processId);
						} else if ("jobId".equals(node3.attributeValue("name"))) {
							jobId = node3.attributeValue("value");
							map.put("jobId", jobId);
						} else if ("hiveServerAddress".equals(node3
								.attributeValue("name"))) {
							hiveServerAddress = node3.attributeValue("value");
							map.put("hiveServerAddress", hiveServerAddress);
						} else if ("outputDBName".equals(node3
								.attributeValue("name"))) {
							outputDBName = node3.attributeValue("value");
							map.put("outputDBName", outputDBName);
						} else if ("tempHdfsBasePath".equals(node3
								.attributeValue("name"))) {
							tempHdfsBasePath = node3.attributeValue("value");
							map.put("tempHdfsBasePath", tempHdfsBasePath);
						} else if ("departmentId".equals(node3
								.attributeValue("name"))) {
							departmentId = node3.attributeValue("value");
							map.put("departmentId", departmentId);
						}
					}
				}
			}
			// 获取算子参数
			if ("operator".equals(node2.getName())) {
				operatorName = node2.attributeValue("name");
				map.put("operatorName", operatorName);
				Iterator iter2 = node2.elementIterator();
				while (iter2.hasNext()) {
					Element node3 = (Element) iter2.next();
					if ("parameter".equals(node3.getName())) {
						if ("field1".equals(node3.attributeValue("name"))) {
							fieldName = node3.getText();
							map.put("fieldName", fieldName);
						}
						if ("outputTableName".equals(node3.attributeValue("name"))) {
							String tempStr = node3.getText(); // 获得数据库.表格式字符串
							if (!"".equals(tempStr.trim())) {
								String[] arr = tempStr.split("\\.");
								outputDBName = arr[0];
								outputTabName = arr[1];
							}
							map.put("outputDBName", outputDBName);
							map.put("outputTabName", outputTabName);
						}
						if ("inputFilePath"
								.equals(node3.attributeValue("name"))) {
							inputFilePath = node3.getText();
							map.put("inputFilePath", inputFilePath);
						}
						if ("schemaList".equals(node3.attributeValue("name"))) {
							schemaList = node3.getText();
							map.put("schemaList", schemaList);
						}
					}
				}
			}

			// 获取输入数据库
			if ("datasets".equals(node2.getName())) {
				Iterator iter2 = node2.elementIterator();
				while (iter2.hasNext()) {
					Element node3 = (Element) iter2.next();
					if ("inport1".equals(node3.attributeValue("name"))) {
						Iterator iter3 = node3.elementIterator();
						while (iter3.hasNext()) {
							Element node4 = (Element) iter3.next();
							strs = node4.getText();
						}
						if (!"".equals(strs.trim())) {
							String[] arr = strs.split("\\.");
							if(arr.length == 2){
								inputDBName = arr[0];
								inputTabName = arr[1];
							}							
						}
						map.put("inputDBName", inputDBName);
						map.put("inputTabName", inputTabName);
					}
				}
			}
		}
		list.add(map);
		return list;
	}

	@SuppressWarnings("rawtypes")
	public List<Map> parseStdoutXml(String stdinXml) throws Exception {

		// 读取stdin.xml文件
		Configuration conf = new Configuration();
		FileSystem fs = FileSystem.get(conf);
		FSDataInputStream dis = fs.open(new Path(stdinXml));
		InputStreamReader isr = new InputStreamReader(dis, "utf-8");
		BufferedReader read = new BufferedReader(isr);
		String tempString = "";
		String xmlParams = "";
		while ((tempString = read.readLine()) != null) {
			xmlParams += "\n" + tempString;
		}
		read.close();
		xmlParams = xmlParams.substring(1);

		String outputDBName = null;
		String outputTabName = null;
		String jobinstanceid = null;

		List<Map> list = new ArrayList<Map>();
		Map<String, String> map = new HashMap<String, String>();
		Document document = DocumentHelper.parseText(xmlParams); // 将字符串转化为xml
		Element node1 = document.getRootElement(); // 获得根节点
		Iterator iter1 = node1.elementIterator(); // 获取根节点下的子节点
		while (iter1.hasNext()) {
			Element node2 = (Element) iter1.next();
			if ("jobinstanceid".equals(node2.getName())) {
				jobinstanceid = node2.getText();
				map.put("jobinstanceid", jobinstanceid);
				System.out.println("====jobinstanceid=====" + jobinstanceid);
			}
			// 获取输入数据库
			if ("datasets".equals(node2.getName())) {
				Iterator iter2 = node2.elementIterator();
				while (iter2.hasNext()) {
					Element node3 = (Element) iter2.next();
					if ("outport1".equals(node3.attributeValue("name"))) {
						Iterator iter3 = node3.elementIterator();
						String strs = null;
						while (iter3.hasNext()) {
							Element node4 = (Element) iter3.next();
							strs = node4.getText();
						}
						if (!"".equals(strs.trim())) {
							String[] arr = strs.split("\\.");
							outputDBName = arr[0];
							outputTabName = arr[1];
						}
						map.put("outputDBName", outputDBName);
						map.put("outputTabName", outputTabName);
					}
				}
			}
		}
		list.add(map);
		return list;
	}

	/* 生成stdout.Xml文件 */
	@SuppressWarnings("rawtypes")
	public void genStdoutXml(String fileName, Map listOut) {

		String jobinstance = null;
		String outputDBName = null;
		String outputTabName = null;

		jobinstance = listOut.get("jobinstanceid").toString();
		outputDBName = listOut.get("outputDBName").toString();
		outputTabName = listOut.get("outputTabName").toString();

		Document document = DocumentHelper.createDocument();
		Element response = document.addElement("response");
		Element jobinstanceid = response.addElement("jobinstanceid");
		jobinstanceid.setText(jobinstance);
		Element datasets = response.addElement("datasets");
		Element dataset = datasets.addElement("dataset");
		dataset.addAttribute("name", "outport1");
		Element row = dataset.addElement("row");
		row.setText(outputDBName + "." + outputTabName);

		try {
			Configuration conf = new Configuration();
			FileSystem fs = FileSystem.get(URI.create(fileName), conf);
			OutputStream out = fs.create(new Path(fileName),
					new Progressable() {
						public void progress() {
						}
					});
			OutputFormat format = OutputFormat.createPrettyPrint();
			format.setEncoding("UTF-8");
			XMLWriter xmlWriter = new XMLWriter(out, format);
			xmlWriter.write(document);
			xmlWriter.close();
		} catch (IOException e) {
			System.out.println(e.getMessage());
		}

	}

	/* 生成stderr.xml文件 */
	@SuppressWarnings("rawtypes")
	public void genStderrXml(String fileName, Map listOut) {

		String jobinstance = null;
		String errorMessage = null;
		String errotCode = null;
		jobinstance = listOut.get("jobinstanceid").toString();
		errorMessage = listOut.get("errorMessage").toString();
		errotCode = listOut.get("errotCode").toString();

		Document document = DocumentHelper.createDocument();
		Element response = document.addElement("error");
		Element jobinstanceid = response.addElement("jobinstanceid");
		jobinstanceid.setText(jobinstance);
		Element code = response.addElement("code");
		code.setText(errotCode);
		Element message = response.addElement("message");
		message.setText(errorMessage);

		try {
			Configuration conf = new Configuration();
			FileSystem fs = FileSystem.get(URI.create(fileName), conf);
			OutputStream out = fs.create(new Path(fileName),
					new Progressable() {
						public void progress() {
						}
					});
			OutputFormat format = OutputFormat.createPrettyPrint();
			format.setEncoding("UTF-8");
			XMLWriter xmlWriter = new XMLWriter(out, format);
			xmlWriter.write(document);
			xmlWriter.close();
		} catch (IOException e) {
			System.out.println(e.getMessage());
		}
	}

	/**
	 * 获得表模式
	 * 
	 * @param dbName
	 * @param tblName
	 * @return
	 */
	public HCatSchema getHCatSchema(String dbName, String tblName) {
		Job outputJob = null;
		HCatSchema schema = null;
		try {
			outputJob = Job.getInstance();
			outputJob.setJobName("getHCatSchema");
			outputJob.setOutputFormatClass(SerHCatOutputFormat.class);
			outputJob.setOutputKeyClass(WritableComparable.class);
			outputJob.setOutputValueClass(SerializableWritable.class);
			SerHCatOutputFormat.setOutput(outputJob,
					OutputJobInfo.create(dbName, tblName, null));
			schema = SerHCatOutputFormat.getTableSchema(outputJob
					.getConfiguration());
		} catch (IOException e) {
			e.printStackTrace();
		}
		return schema;
	}

	/**
	 * 指定数据库、表名,创建表
	 * 
	 * @param dbName
	 * @param tblName
	 * @param schema
	 */
	public void createTable(String dbName, String tblName, HCatSchema schema) {
		HiveMetaStoreClient client = null;
		try {
			HiveConf hiveConf = HCatUtil.getHiveConf(new Configuration());
			client = HCatUtil.getHiveClient(hiveConf);
		} catch (MetaException | IOException e) {
			e.printStackTrace();
		}
		try {
			if (client.tableExists(dbName, tblName)) {
				client.dropTable(dbName, tblName);
			}
		} catch (TException e) {
			e.printStackTrace();
		}
		// 获取表模式
		List<FieldSchema> fields = HCatUtil.getFieldSchemaList(schema
				.getFields());
		// 生成表对象
		Table table = new Table();
		table.setDbName(dbName);
		table.setTableName(tblName);

		StorageDescriptor sd = new StorageDescriptor();
		sd.setCols(fields);
		table.setSd(sd);
		sd.setInputFormat(RCFileInputFormat.class.getName());
		sd.setOutputFormat(RCFileOutputFormat.class.getName());
		sd.setParameters(new HashMap<String, String>());
		sd.setSerdeInfo(new SerDeInfo());
		sd.getSerdeInfo().setName(table.getTableName());
		sd.getSerdeInfo().setParameters(new HashMap<String, String>());
		sd.getSerdeInfo().getParameters()
				.put(serdeConstants.SERIALIZATION_FORMAT, "1");
		sd.getSerdeInfo().setSerializationLib(
				org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe.class
						.getName());
		Map<String, String> tableParams = new HashMap<String, String>();
		table.setParameters(tableParams);
		try {
			client.createTable(table);
			System.out.println("Create table successfully!");
		} catch (TException e) {
			e.printStackTrace();
			return;
		} finally {
			client.close();
		}
	}
	
	/**
     * 以行为单位读取文件,常用于读面向行的格式化文件
     */
    public static void readFileByLines(String fileName) {
        File file = new File(fileName);
        BufferedReader reader = null;
        try {
            System.out.println("以行为单位读取文件内容,一次读一整行:");
            reader = new BufferedReader(new FileReader(file));
            String tempString = null;
            int line = 1;
            // 一次读入一行,直到读入null为文件结束
            while ((tempString = reader.readLine()) != null) {
                // 显示行号
                System.out.println("line " + line + ": " + tempString);
                line++;
            }
            reader.close();
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            if (reader != null) {
                try {
                    reader.close();
                } catch (IOException e1) {
                }
            }
        }
    }

}



输入表数据:

hive> select * from test_in; 
OK
1 20
2 20
3 21
4 20
5 21
6 20
7 21
8 19
9 19
10 21

输出表数据:

hive> select * from test_out;
OK
19 2
21 4
20 4

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值