(1)将Hadoop的hdfs-site.xml 和core-site.xml文件复制到spark/conf目录下
(2)追加如下内容到 spark-defaults.conf文件
- spark.files file:///home/hadoop/spark/conf/hdfs-site.xml,file:///home/hadoop/spark/conf/core-site.xml
spark.files file:///home/hadoop/spark/conf/hdfs-site.xml,file:///home/hadoop/spark/conf/core-site.xml
如果不加这个,会有如下问题发生:
Java.lang.IllegalArgumentException: java.NET.UnknownHostException: mycluster
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:418)
at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:231)
at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:139)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:510)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:453)
(3)读取hdfs中的lzo文件,并且分片来执行
- import org.apache.hadoop.io._
- import com.hadoop.mapreduce._
- val data = sc.newAPIHadoopFile[LongWritable, Text, LzoTextInputFormat]("hdfs://mycluster/user/hive/warehouse/logs_app_nginx/logdate=20160322/loghost=70/var.log.nginx.access_20160322.log.70.lzo")
- data.count()
import org.apache.hadoop.io._
import com.hadoop.mapreduce._
val data = sc.newAPIHadoopFile[LongWritable, Text, LzoTextInputFormat]("hdfs://mycluster/user/hive/warehouse/logs_app_nginx/logdate=20160322/loghost=70/var.log.nginx.access_20160322.log.70.lzo")
data.count()
(4)读取hdfs中的通配符表示的目录和子目录下文件,并且分片来执行
- import org.apache.hadoop.io._
- import com.hadoop.mapreduce._
- val dirdata = sc.newAPIHadoopFile[LongWritable, Text, LzoTextInputFormat]("hdfs://mycluster/user/hive/warehouse/logs_app_nginx/logdate=20160322/loghost=*/")
- dirdata.count()