在用pyspark做数据处理时,经常遇到这样的坑在此作个记录:
(1)配置文件:当字段数太多时,需要配置字段数长度,注意其中的数字是字符串,不然会报错。
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:5825
错误代码如下:
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.master('local[*]') \
.config("spark.some.config.option", "some-value") \
.config('spark.debug.maxToStringFields', 50) \
.getOrCreate()
正确代码
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.master('local[*]') \
.config("spark.some.config.option", "some-value") \
.config('spark.debug.maxToStringFields', '50') \
.getOrCreate()
(2)在做数据处理时,脏数据太多,不符合每行的处理规范,也会报这样的错误。一定要用filter严格过滤数据
dataUrlDecode = dataInit.filter(
lambda x: ('json_data=' in x) and ('client_id' in x) and ('url' in x) and ('id' in x)).map(
lambda x: data_urldecode(x))