- Pandas ValueError: setting an array element with a sequence
原先是想通过np.vectorize() 逐行处理DataFrame, 并返回几个新的字段,出现错误ValueError: setting an array element with a sequence.
def test():
arr = np.random.randn(4,4)
cols = ['a', 'b', 'c']
df = pd.DataFrame(data=arr,columns=['e','f','g','h'])
def func(a,b,c):
output1 = a+1
output2 = b*2
output3 = c-4
return pd.Series([output1,output2,output3])
vfunc = np.vectorize(func)
df[cols] = vfunc(df['e'],df['f'],df['g'])
print(df)
test()
报错原因是赋值的df[cols],与vfunc返回的维度不一致所致,返回的数据帧和结果之间的形状不匹配, 使用apply解决,参数result_type=“expand”,表示结果将转换为列, 返回的每个值都会作为结果DataFrame一列中的值, apply(func)中,func返回的结果数量要与df[col] 中的col列数相同
def test():
arr = np.random.randn(4,4)
cols = ['a', 'b', 'c']
df = pd.DataFrame(data=arr,columns=['e','f','g','h'])
def func(row):
a,b,c = row['e'],row['f'],row['g']
output1 = a+1
output2 = b*2
output3 = c-4
return output1,output2,output3
df[cols] = df.apply(func,axis=1, result_type="expand")
print(df)
test()
输出
e f g h a b c
0 0.493280 -0.092513 -3.014135 -0.361842 1.493280 -0.185027 -7.014135
1 0.300695 -0.745392 0.591653 -1.752471 1.300695 -1.490785 -3.408347
2 -0.033944 -1.556307 -0.359979 1.808213 0.966056 -3.112615 -4.359979
3 0.701741 -0.272337 0.041114 0.150049 1.701741 -0.544674 -3.958886
对于单个列来说
df['id']
与
ID = ['id']
df[ID]
得到的结果是不一样的,前者是[1,2,3,4], 后者[[1],[2],[3],[4]]
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
- Spark 报错
配置jupyter + spark 时遇到的坑。
Exception: Python in worker has different version 2.7 than that in driver 3.6, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
**这是因为本地的python 环境与spark 集群使用的python 版本不一致所致,需要将其改成相同的python版本;
解决方案 ,
(1) 检查本地jupyter kernel版本
jupyter kernelspec list
检查jupyter 界面python 版本, 注意,界面显示的kernel 如python 2 实际可能是python 3(真坑)
import sys
print(sys.version)
(2) 配置环境变量
#配置pysark 使用的python环境变量
export PYSPARK_PYTHON=/usr/bin/python
export PYSPARK_DRIVER_PYTHON=“jupyter”
export PYSPARK_DRIVER_PYTHON_OPTS=“notebook --no-browser --ip 1.1.1.1 --port 8088 —log-level 10 py_spark”
(3) 重新启动jupyter**
20220525更新
1.linux下pyspark安装
java8+python3.6+spark-3.2.0-bin-hadoop3.2
参考链接:
https://blog.csdn.net/js010111/article/details/122755433
https://blog.csdn.net/qq_42363032/article/details/115098416
配置环境变量 vi ~/.bashrc source ~/.bashrc
export JAVA_HOME=‘/root/tools/jdk1.8.0_321’
export PATH=
P
A
T
H
:
PATH:
PATH:JAVA_HOME/bin
export CLASSPATH=
:
C
L
A
S
S
P
A
T
H
:
:CLASSPATH:
:CLASSPATH:JAVA_HOME/lib/
export SPARK_PYTHON=/usr/bin/python3
export SPARK_HOME=/root/tools/spark-3.2.0-bin-hadoop3.2
export PYTHON_HOME=/root/xx/Python-3.6.8
export PATH=
P
Y
T
H
O
N
H
O
M
E
/
b
i
n
:
PYTHON_HOME/bin:
PYTHONHOME/bin:PATH
export SPARK_PYTHON=$PYTHON3_PATH
jupyter配置:
c.NotebookApp.allow_remote_access = True
c.NotebookApp.ip = “*”
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8888
c.NotebookApp.notebook_dir = “/root/workspace”
c.NotebookApp.allow_root =True
c.NotebookApp.token = ‘DEEPlearning+688’
2.The SPARK_HOME env variable is set but Jupyter Notebook doesn’t see it.
export PYSPARK_SUBMIT_ARGS=“–master local[2] pyspark-shell”
import findspark
findspark.init(‘C:/spark’)
py4j-0.10.9.2-src.zip
https://stackoverflow.com/questions/31841509/pyspark-exception-java-gateway-process-exited-before-sending-the-driver-its-po
3.Java gateway process exited before sending its port number
import os
os.environ[‘JAVA_HOME’] = ‘//root/tools/jdk1.8.0_321’
问题解决:https://blog.csdn.net/hejp_123/article/details/106784906
jupyter测试
import os
os.environ[‘JAVA_HOME’] = ‘//root/tools/jdk1.8.0_321’
from pyspark.sql import SparkSession
spark = SparkSession
.builder
.appName(“my_app_spark”)
.getOrCreate()
spark.sql(“select 1”).show()
5.安装mysql
卸载:
https://blog.csdn.net/weixin_45525272/article/details/107774348
删除mysql的数据文件
sudo rm /var/lib/mysql/ -r
删除mysql的配置文件
sudo rm /etc/mysql/ -r
https://blog.csdn.net/leacock1991/article/details/110406708
https://www.yisu.com/ask/4053.html