spark-sql测试总结

spark-sql测试总结
最近倒腾spark-sql,原来测试都是很小的数据,由于自己的是6个虚拟机资源有限,也不能太大,于是在找了帖子。

http://colobu.com/2014/12/11/spark-sql-quick-start/
Spark SQL 初探: 使用大数据分析2000万数据


############## 不要问我数据怎么下载的,自己搜索,我用完就删了。
1、文件检查,shell中wc和awk命令帮忙检查一下行和列。

############ head 一下文件,得知,都有列头,逗号分隔。因为涉及名字隐私信息,只打印列头,第二行开始是具体记录。

[hue@snn 2000w]$ head -1 1-200W.csv
Name,CardNo,Descriot,CtfTp,CtfId,Gender,Birthday,Address,Zip,Dirty,District1,District2,District3,District4,District5,District6,
FirstNm,LastNm,Duty,Mobile,Tel,Fax,EMail,Nation,Taste,Education,Company,CTel,CAddress,CZip,Family,Version,id
[hue@snn 2000w]$

############ wc 检查一下行数

[hadoop@snn 2000w]$ cat 1000W-1200W.csv | wc -l
2000050
[hadoop@snn 2000w]$ cat 1200W-1400W.csv | wc -l
2000205
[hadoop@snn 2000w]$ cat 1-200W.csv | wc -l
2000094
[hadoop@snn 2000w]$

############ awk 检查一下列数,33列

[hadoop@snn 2000w]$ awk 'BEGIN {FS=","}END{print "Filename:" FILENAME ",Linenumber:" NR ",Columns:" NF}' 1000W-1200W.csv

Filename:1000W-1200W.csv,Linenumber:2000050,Columns:33


####################################

2、hdfs创建文件夹,并put文件上去

[hue@snn ~]$ hadoop fs -mkdir /user/hue/external/2000w
[hue@snn ~]$ hadoop fs -put /opt/2000w/* /user/hue/external/2000w/
[hue@snn ~]$ hadoop fs -ls -R /user/hue/external/2000w/
-rw-r--r--   3 hue hue  348173735 2015-12-17 14:36 /user/hue/external/2000w/1-200W.csv
-rw-r--r--   3 hue hue  317365192 2015-12-17 14:36 /user/hue/external/2000w/1000W-1200W.csv
-rw-r--r--   3 hue hue  307266272 2015-12-17 14:36 /user/hue/external/2000w/1200W-1400W.csv
-rw-r--r--   3 hue hue  319828719 2015-12-17 14:36 /user/hue/external/2000w/1400W-1600W.csv
-rw-r--r--   3 hue hue  310125772 2015-12-17 14:37 /user/hue/external/2000w/1600w-1800w.csv
-rw-r--r--   3 hue hue  298454235 2015-12-17 14:37 /user/hue/external/2000w/1800w-2000w.csv
-rw-r--r--   3 hue hue  311349431 2015-12-17 14:38 /user/hue/external/2000w/200W-400W.csv
-rw-r--r--   3 hue hue  311013782 2015-12-17 14:38 /user/hue/external/2000w/400W-600W.csv
-rw-r--r--   3 hue hue  308703632 2015-12-17 14:38 /user/hue/external/2000w/600W-800W.csv
-rw-r--r--   3 hue hue  310797175 2015-12-17 14:38 /user/hue/external/2000w/800W-1000W.csv
-rw-r--r--   3 hue hue    7487744 2015-12-17 14:38 /user/hue/external/2000w/last_5000.csv
[hue@snn ~]$

####################################

3、创建外部表,不用挪动文件,即可查询。

Create external table IF NOT EXISTS external_2000w
(
Name String,
CardNo String,
Descriot String,
CtfTp String,
CtfId String,
Gender String,
Birthday String,
Address String,
Zip String,
Dirty String,
District1 String,
District2 String,
District3 String,
District4 String,
District5 String,
District6 String,
FirstNm String,
LastNm String,
Duty String,
Mobile String,
Tel String,
Fax String,
EMail String,
Nation String,
Taste String,
Education String,
Company String,
CTel String,
CAddress String,
CZip String,
Family String,
Version String,
id int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
LOCATION '/user/hue/externa
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值