1. spark 提交试用虚拟环境(安装包)
安装 anaconda 虚拟环境比如:py37
source activate
conda activate py37
pip install package
- 进入spark的安装目录下,有一个spark-env.sh文件,例如:
/opt/spark/spark-2.1.1-bin-hadoop2.7/conf/spark-env.sh
- 在环境变量中增加PYSPARK_PYTHON,例如添加:
export PYSPARK_PYTHON=/home/q/anaconda/envs/py37/bin/python
- 提交:
/home/q/spark-2.3.3-bin-hadoop2.7/bin/spark-submit --files /home/q/spark-2.3.3-bin-hadoop2.7/conf/hive-site.xml --executor-memory 18G --num-executors 30 --executor-cores 4 spark_test.py
2. spark 初始化
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.sql.window import Window
from pyspark.sql.types import StructField, StringType, StructType, DoubleType, LongType, DateType, ArrayType
spark = SparkSession \
.builder \
.enableHiveSupport() \
.appName('test') \
.getOrCreate()
3. pyspark.sql.utils.AnalysisException: 'Table or view not found
提交的时候添加
--files /home/q/spark-2.3.3-bin-hadoop2.7/conf/hive-site.xml
参数
4. spark sql 换行
使用 """"""
三双引号写sql,就可以正常换行了
cities = spark.sql("""select *
from db.table
where dt = 'xxx'
""")
# 放入内存
cities.cache().count()
5. spark 临时文件路径修改
$SPARK_HOME/conf/spark-env.sh
下增加export SPARK_LOCAL_DIRS=xxx
6. Total size of serialized results of 16 tasks is bigger than spark.driver.maxResultSize
设置 spark-submit --conf spark.driver.maxResultSize=4g