使用PySpark将数据导入MySQL
url = 'jdbc:mysql://127.0.0.1:3306/test?autoReconnect=true'
table = "000001"
mode = "overwrite"
properties = {"user":"root",
"password":"123456",
"driver":"com.mysql.jdbc.Driver"}
df_spark.write.jdbc(url, table, mode, properties)
print("success")
然后在终端中运行
# 必须加入mysql-connector-java-5.1.44-bin.jar包
spark-submit --master yarn --driver-class-path /home/hadoop/apps/spark-2.3.1/mysql-connector-java-5.1.44-bin.jar --jars /home/hadoop/apps/spark-2.3.1/mysql-connector-java-5.1.44-bin.jar 导入文件.py
出现以下错误:
1.CommunicationsException: Communications link failure
数据库连接失败,在url中加入自动重连,url = ‘jdbc:mysql://127.0.0.1:3306/stock2013?autoReconnect=true’
2.org.apache.spark.SparkException: Job aborted due to stage failure:
这里遇到的问题主要是因为数据源数据量过大,而机器的内存无法满足需求,导致长时间执行超时断开的情况,数据无法有效进行交互计算,因此有必要增加内存。可能是某个节点内存不足,因此不在YARN上执行程序,改为在本地执行。
访问HDFS数据,并将数据写入CSV文件
from hdfs import *
import os
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext()
sqlContext = SQLContext(sc)
client = Client("http://127.0.0.1:50070",root="/",timeout=100,session=False)
file_list = client.list('/user/test')
csv = os.path.join('/user/test', file_list[0])
name = filename.split('.')[0]
df_spark = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load(csv)
df = df_spark.toPandas()
print(df.info())
# 不能添加路径,直接写入csv文件可以,添加路径的话会提示没有找到此路径
df.to_csv("%s.csv" % name, header=True)