第一种方式
通过thrift接口,这种方式是最简单的,但是访问速度慢,而且thrift接口socket是由超时的
用Python操作HBase之HBase-Thrift
这种方式遍历rdd会出问题,当rdd特别大的时候。
通过happybase增强thrift接口
安装happyhbase
安装过程失败,尝试修正方法,centos7
yum install python-devel
安装happybase也失败了。看了只有使用原生的thrift接口了。
第二种方式
通过newAPIHadoopRDD接口,尝试了好几次,没有搞通
经过研究,这种方式终于搞通了,参考链接
错误处理方法
分布式需要在每个节点上都拷贝这个文件,修改配置。
通过scp拷贝方式最为简介。
scp /var/lib/spark/jars/hbase/spark-examples_2.11-1.6.0-typesafe-001.jar root@192.168.100.13:/var/lib/spark/jars/hbase/spark-examples_2.11-1.6.0-typesafe-001.jar
第三种方式
经过很多折腾,这种方式也终于搞通了。
修改启动参数:
pyspark2 --conf spark.kryoserializer.buffer.max=1024m --conf spark.driver.maxResultSize=20G --conf spark.driver.memory=20G --total-executor-cores=100 --executor-memory=10G --executor-cores=2 --packages com.hortonworks:shc-core:1.1.1-2.1-s_2.11 --repositories http://repo.hortonworks.com/content/groups/public/
需要很长的时间,只在启动的spark-shell节点上操作。
import datetime
from pyspark import SparkConf, SparkContext
#driver意思为连接spark集群的机子,所以配置host要配置当前编写代码的机子host
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql import Row, functions as F
from pyspark.sql.window import Window
from pyspark.sql.functions import udf
from pyspark.sql.types import *
conf = SparkConf()
conf.set("dfs.socket.timeout", 300000)
sqlContext.setConf("spark.sql.shuffle.partitions","400")
root="/user/XieHongjun/"
file=root + "all.parquet"
df = spark.read.parquet(file)
df.head(10)
```catalog = ''.join("""{
"table":{"namespace":"default", "name":"test"},
"rowkey":"key",
"columns":{
"rowkey":{"cf":"rowkey", "col":"key", "type":"string"},
"code":{"cf":"result", "col":"code", "type":"string"},
"date":{"cf":"result", "col":"date", "type":"string"},
"time":{"cf":"result", "col":"time", "type":"string"},
"price":{"cf":"result", "col":"price", "type":"float"},
"ratio":{"cf":"result", "col":"ratio", "type":"float"},
"bigratio":{"cf":"result", "col":"bigratio", "type":"float"},
"timestamp":{"cf":"result", "col":"timestamp", "type":"string"}
}
}""".split())
data_source_format = 'org.apache.spark.sql.execution.datasources.hbase'
df1.write.options(catalog=catalog) \
.mode('overwrite') \
.format("org.apache.spark.sql.execution.datasources.hbase") \
.option("zookeeper.znode.parent", "/hbase-unsecure") \
.option("hbase.zookeeper.quorum", "cdh-192-168-100-11,cdh-192-168-100-12,cdh-192-168-100-13") \
.option("hbase.zookeeper.property.clientPort", "2181") \
.option("newTable", "5") \
.option("hbase.cluster.distributed",True) \
.save()
需要先用habse shell创建hbase表
对于小数据写入没有问题了,但如果数据量巨大时,写入habse会出错。
WARN zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
at org.apache.zookeeper.ClientCnxn
S
e
n
d
T
h
r
e
a
d
.
r
u
n
(
C
l
i
e
n
t
C
n
x
n
.
j
a
v
a
:
1081
)
19
/
06
/
1018
:
50
:
21
W
A
R
N
z
o
o
k
e
e
p
e
r
.
R
e
c
o
v
e
r
a
b
l
e
Z
o
o
K
e
e
p
e
r
:
P
o
s
s
i
b
l
y
t
r
a
n
s
i
e
n
t
Z
o
o
K
e
e
p
e
r
,
q
u
o
r
u
m
=
l
o
c
a
l
h
o
s
t
:
2181
,
e
x
c
e
p
t
i
o
n
=
o
r
g
.
a
p
a
c
h
e
.
z
o
o
k
e
e
p
e
r
.
K
e
e
p
e
r
E
x
c
e
p
t
i
o
n
SendThread.run(ClientCnxn.java:1081) 19/06/10 18:50:21 WARN zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=localhost:2181, exception=org.apache.zookeeper.KeeperException
SendThread.run(ClientCnxn.java:1081)19/06/1018:50:21WARNzookeeper.RecoverableZooKeeper:PossiblytransientZooKeeper,quorum=localhost:2181,exception=org.apache.zookeeper.KeeperExceptionConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid
参考对habse性能进行优化,没有解决regionserver线程挂掉的问题。
19/06/12 15:05:18 INFO client.AsyncProcess: #4, table=test, attempt=25/35 failed=3833ops, last exception: java.net.ConnectException: Connection refused on cdh-192-168-100-16,60020,1560260036428, tracking started null, retrying after=10032ms, replay=3833ops
19/06/12 15:05:18 INFO client.AsyncProcess: #3, table=test, attempt=25/35 failed=3833ops, last exception: java.net.ConnectException: Connection refused on cdh-192-168-100-16,60020,1560260036428, tracking
进行无数次百度之后,无果,最后还是查看regionserver日志,发现OOM日志。
- export ‘HBASE_REGIONSERVER_OPTS=-Xms52428800 -Xmx52428800 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -XX:ReservedCodeCacheSize=256m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hbase_hbase-REGIONSERVER-00ccac38cdc5566a0bbb251eb51faae5_pid20975.hprof -XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh’
- HBASE_REGIONSERVER_OPTS=’-Xms52428800 -Xmx52428800 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -XX:ReservedCodeCacheSize=256m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hbase_hbase-REGIONSERVER-00ccac38cdc5566a0bbb251eb51faae5_pid20975.hprof -XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh’
因此调整了JVM设置:
才解决了问题。
从hbase读出到df
df = spark.read.options(catalog=catalog) \
.format("org.apache.spark.sql.execution.datasources.hbase") \
.option("zookeeper.znode.parent", "/hbase-unsecure") \
.option("hbase.zookeeper.quorum", "cdh-192-168-100-11,cdh-192-168-100-12,cdh-192-168-100-13") \
.option("hbase.zookeeper.property.clientPort", "2181") \
.option("hbase.cluster.distributed",True) \
.load()
df.show()
报错:
Py4JJavaError: An error occurred while calling o225.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 11, cdh-192-168-100-14, executor 3): java.lang.NoSuchMethodError: org.apache.hadoop.hbase.client.Scan.setCaching(I)Lorg/apache/hadoop/hbase/client/Scan;
接口测试例子:
catalog = ''.join("""{
"table":{"namespace":"test", "name":"test_table"},
"rowkey":"key",
"columns":{
"col0":{"cf":"rowkey", "col":"key", "type":"string"},
"col1":{"cf":"result", "col":"class", "type":"string"}
}
}""".split())
data_source_format = 'org.apache.spark.sql.execution.datasources.hbase'
df = sc.parallelize([('a', '1.0'), ('b', '2.0')]).toDF(schema=['col0', 'col1'])
df.show()
df.write.options(catalog=catalog,newTable="5").format(data_source_format).save()
df_read = spark.read.options(catalog=catalog).format(data_source_format).load()
df_read.show()