外部jar包_大数据系列之PySpark读写外部数据库

本文以MySQL和HBASE为例,简要介绍Spark通过PyMySQL和HadoopAPI算子对外部数据库的读写操作


1、PySpark读写MySQL

MySQL环境准备参考“数据库系列之MySQL主从复制集群部署”部分

1.1 PyMySQL和MySQLDB模块

PyMySQL是在Python3.x版本中用于连接MySQL服务器的一个库,Python2中则使用mysqldb,目前在Python 2版本支持PyMySQL。使用以下命令安装PyMysql模块:

pip install PyMySQL

f6c49d25edd70662ec20ea9cc863766a.png

连接到MySQL数据库

import pymysql
# 打开数据库连接
db = pymysql.connect("localhost","testuser","test123","TESTDB" )
# 使用 cursor() 方法创建一个游标对象
cursor cursor = db.cursor()
# 使用 execute() 方法执行 SQL 查询
cursor.execute("SELECT VERSION()")
# 使用 fetchone() 方法获取单条数据.
data = cursor.fetchone() print ("Database version : %s " % data)
# 关闭数据库连接
db.close()
1.2 Spark数据写入MySQL

1)启动MySQL服务并检查

[root@tango-01 bin]# ./mysqld_safe &
[root@tango-01 bin]# 180814 15:50:02 mysqld_safe Logging to '/usr/local/mysql/data/error.log'.
180814 15:50:02 mysqld_safe Starting mysqld daemon with databases from /usr/local/mysql/data
[root@tango-01 bin]# ps -ef|grep mysql

2)创建MySQL表

[root@tango-01 bin]# ./mysql -u root -proot
mysql> use test;
mysql> create table test_spark(id int(4),info char(8),name char(20),sex char(2));
mysql> show tables;
+----------------+
| Tables_in_test |
+----------------+
| test_spark |
+----------------+
2 rows in set (0.00 sec)

3)向MySQL中写入数据

  • 启动ipython notebook

PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook"  HADOOP_CONF_DIR=/usr/local/spark/hadoop-2.9.0/etc/hadoop  pyspark
  • 建立MySQL连接,写入数据

from pyspark import SparkContext
from pyspark import SparkConf
import pymysql

rawData=['1,info1,tango,F','2,info2,zhangsan,M']
conn = pymysql.connect(user="root",passwd="xxxxxx",host="192.168.112.10",db="test",charset="utf8")
cursor=conn.cursor()
for i in range(len(rawData)):
retData=rawData[i].split(',')
id = retData[0]
info = retData[1]
name = retData[2]
sex = retData[3]
sql = "insert into test_spark(id,info,name,sex) values('%s','%s','%s','%s')" %(id,info,name,sex)
cursor.execute(sql)
conn.commit()
conn.close()

31023cfa1ba72611e4ab2d7ed945dcf3.png

  • 查询MySQL表数据

21e7a5e14e520aa855dca8f2f1954b52.png

1.3 Spark读取MySQL数据

1)下载mysql-connect-java驱动,并存放在spark目录的jars下

5f7f83ae9e8435a04d07ab13f1957b55.png

2)运行pyspark,执行以下语句

[root@tango-spark01 jars]# pyspark
>>> from pyspark.sql import SQLContext
>>> sqlContext = SQLContext(sc)
>>> dataframe_mysql = sqlContext.read.format("jdbc").\
... options(url="jdbc:mysql://192.168.112.10:3306/test", driver="com.mysql.jdbc.Driver",
... dbtable="test_spark", user="root", password="xxxxxx").load()
>>> dataframe_mysql.show()

5f5740303fe03e317f21c8e0a0b1997d.png

2、PySpark读写HBASE

HBASE环境准备参考“大数据系列之HBASE集群环境部署”部分,HBASE版本为1.2.6,Hadoop版本为2.9.0,Spark版本为2.3.0。注:使用高版本的HBASE如2.1.0出现NotFoundMethod接口问题。

2.1 Spark读写HBASE模块

1)saveAsNewAPIHadoopDataset模块

Spark算子saveAsNewAPIHadoopDataset使用新的Hadoop API将RDD输出到任何Hadoop支持的存储系统,为该存储系统使用Hadoop Configuration对象。saveAsNewAPIHadoopDataset参数说明如下:

saveAsNewAPIHadoopDataset(conf=conf,keyConverter=keyConv,valueConverter=valueConv)
- conf:HBASE的配置文件
- keyConverter:key值的输出类型
- valueConverter:value值的输出类型

2)newAPIHadoopRDD模块

使用新的Hadoop API读取数据,参数如下:

newAPIHadoopRDD(inputFormatClass, keyClass, valueClass, keyConverter=None, valueConverter=None, conf=None, batchSize=0)
- inputFormatClass :Hadoop InputFormat class名称- keyClass:key Writable class名称- valueClass:value Writable class名称- keyConverter:key值的输入类型- valueConverter:value值的输入类型- conf:HBASE的配置文件- batchSize:Python对象作为单个Java对象个数,默认为0,自动选择
2.2 Spark数据写入HBASE

1)启动HBASE服务

[root@tango-spark01 hbase-2.1.0]# ./bin/start-hbase.sh

在Master和Slave服务器使用jps查看HMaster和HRegionServer进程:

[root@tango-spark01 logs]# jps
1859 ResourceManager
1493 NameNode
4249 HMaster
5578 Jps
1695 SecondaryNameNode
[root@tango-spark02 conf]# jps
1767 NodeManager
3880 HRegionServer
1627 DataNode
4814 Jps

注:启动HBASE之前需先启动zookeeper集群和Hadoop集群环境

2)创建HBASE表

hbase(main):027:0> create 'spark_hbase','userinfo'
Created table spark_hbase
Took 2.6556 seconds
=> Hbase::Table - spark_hbase
hbase(main):028:0> put 'spark_hbase','2018001','userinfo:name','zhangsan'
Took 0.0426 seconds
hbase(main):029:0> put 'spark_hbase','2018001','userinfo:age','16'
Took 0.0079 seconds
hbase(main):030:0> put 'spark_hbase','2018001','userinfo:sex','M'

3)配置Spark 在Spark 2.0版本上缺少相关把hbase的数据转换python可读取的jar包,需要另行下载https://mvnrepository.com/artifact/org.apache.spark/spark-examples_2.11/1.6.0-typesafe-001

b86645d5413866b5804cea14b4c8be09.png

  • 上传jar包到spark lib库

[root@tango-spark01 jars]# pwd
/usr/local/spark/spark-2.3.0/jars
[root@tango-spark01 jars]# mkdir hbase
[root@tango-spark01 jars]# cd hbase
[root@tango-spark01 hbase]# ls
spark-examples_2.11-1.6.0-typesafe-001.jar
  • 编辑spark-env.sh,添加以下内容:

export SPARK_DIST_CLASSPATH=$(/usr/local/spark/hadoop-2.9.0/bin/hadoop classpath):$(/usr/local/spark/hbase-2.1.0/bin/hbase classpath):/usr/local/spark/spark-2.3.0/jars/hbase/*
  • 拷贝HBASE下的lib库到spark下

[root@tango-spark01 lib]# pwd
/usr/local/spark/hbase-2.1.0/lib
[root@tango-spark01 lib]# cp -f hbase-* /usr/local/spark/spark-2.3.0/jars/hbase/
[root@tango-spark01 lib]# cp -f guava-11.0.2.jar /usr/local/spark/spark-2.3.0/jars/hbase/
[root@tango-spark01 lib]# cp -f htrace-core-3.1.0-incubating.jar /usr/local/spark/spark-2.3.0/jars/hbase/
[root@tango-spark01 lib]# cp -f protobuf-java-2.5.0.jar /usr/local/spark/spark-2.3.0/jars/hbase/
  • 重启HBASE

[root@tango-spark01 hbase-2.1.0]# ./bin/stop-hbase.sh
[root@tango-spark01 hbase-2.1.0]# ./bin/start-hbase.sh

4)向HBASE中写入数据

  • 启动ipython notebook

PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook"  HADOOP_CONF_DIR=/usr/local/spark/hadoop-2.8.3/etc/hadoop  pyspark
  • 配置初始化

zk_host="192.168.112.101"
table = "spark_hbase"
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
conf = {"hbase.zookeeper.quorum": zk_host,"hbase.mapred.outputtable": table,
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.TableOutputFormat",
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}
  • 初始化数据并序列化转换为RDD

rawData = ['2018003,userinfo,name,Lily','2018004,userinfo,name,Tango','2018003,userinfo,age,22','2018004,userinfo,age,28']
print(rawData)
rddRow = sc.parallelize(rawData).map(lambda x: (x[0:7],x.split(',')))
rddRow.take(5)

eafb3ef31cfe74eee402d58bb8e35fe5.png

  • 调用saveAsNewAPIHadoopDataset模块写入HBASE

rddRow.saveAsNewAPIHadoopDataset(conf=conf,keyConverter=keyConv,valueConverter=valueConv)
  • 查询HBASE中表数据,看到插入数据

b4093e8a7c8c8475407cfc3c8b7006bc.png

2.3 Spark读取HBASE数据

Spark读取HBASE数据使用newAPIHadoopRDD模块

1)配置初始化

host = '192.168.112.101'
table = 'spark_hbase'
conf = {"hbase.zookeeper.quorum": host, "hbase.mapreduce.inputtable": table}
keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"

2)调用newAPIHadoopRDD模块读取HBASE数据

hbase_rdd = sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat","org.apache.hadoop.hbase.io.ImmutableBytesWritable","org.apache.hadoop.hbase.client.Result",keyConverter=keyConv,valueConverter=valueConv,conf=conf)
count = hbase_rdd.count()
hbase_rdd.cache()
output = hbase_rdd.collect()
for (k, v) in output:
print (k, v)

输出结果如下:

4a1e4b48555c9329fa9d6e983a69456c.png


参考资料

  1. http://spark.apache.org/docs/latest/api/python/pyspark.html

  2. 数据库系列之MySQL主从复制集群部署

  3. 大数据系列之HBASE集群环境部署

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值