PySpark学习
pyspark写入数据到mysql
-
将mysql 驱动包 拷贝到 spark jars目录
-
我的mysql 驱动包:
commons-configuration-1.6.jar
,需要根据mysql版本下载对应的版本; -
我的spark jars目录:
/Users/shylin/Downloads/spark-2.4.5-bin-hadoop2.7/jars
-
-
写入mysql代码
from pyspark.sql import SparkSession spark = SparkSession.builder.master("local[*]").appName("pyspark_sql").getOrCreate() # 读取本地文件 df = spark.read.json("file:///Users/shylin/Desktop/work/spark_study/people.json") # 写入DF数据到mysql url = "jdbc:mysql://127.0.0.1:3306/zxl_test" table = 'person' mode = 'overwrite' properties = {'user': 'root', 'password': 'root'} df.write.jdbc(url, table, mode, properties)
pyspark读取hive数据
-
将hive及hdfs配置文件拷贝到spark conf目录
-
hive及hdfs配置文件: core-site.xml、hdfs-site.xml、hive-site.xml
-
我的spark conf 目录:
/Users/shylin/Downloads/spark-2.4.5-bin-hadoop2.7/conf
-
-
pycharm设置环境变量
HADOOP_USER_NAME=zhang.xl
-
读取hive代码
from pyspark.sql import SparkSession spark = SparkSession.builder.master("local[*]").appName("pyspark_hive").enableHiveSupport().getOrCreate() spark.sql("select count(*) from dc_ods.bi_mdm_org_organ_brand_ods").show()
Pyspark-streaming 集成kafka
- 去maven仓库下载 spark-streaming-kafka-0-8 spark-streaming_2.11-2.4.5.jar包并防止到sprak jars目录;
- 执行代码会报错
java.lang.ClassNotFoundException: kafka.common.TopicAndPartition
- 需要下载0.8版本的kafka libs下拷贝 四个jar包 kafka_2.11-0.8.2.2.jar 、kafka-clients-0.8.2.2.jar、 metrics-core-2.2.0.jar、 zkclient-0.3.jar到spark jars目录