目录
1. 跑王喆的https://github.com/wzhe06/SparrowRecSys代码,自己配置的版本有冲突
1. 跑王喆的https://github.com/wzhe06/SparrowRecSys代码,自己配置的版本有冲突
一直报这个错Exception in thread "main" java.lang.IllegalArgumentException: Unsupported class file major version 55,去网上搜了一下,是版本不一致的问题,后来却是发现IDEA设置时java版本的不一致,项目中用的是11(File -> Project Structure ->Project Settings -> project, IntelliJ IDEA -> 用的是8),均改成1.8版本,完美解决。
2.maven搭建spark环境出错
Error:(3, 12) object apache is not a member of package org
import org.apache.spark.SparkConf
3.编译spark出错
(1)Spark Project Parent POM ........................... FAILURE
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-clean-plugin:3.0.0:clean (default-clean) on project spark-parent_2.11: Execution default-clean of goal org.apache.maven.plugins:maven-clean-plugin:3.0.0:clean failed: Plugin org.apache.maven.plugins:maven-clean-plugin:3.0.0 or one of its dependencies could not be resolved: Could not transfer artifact org.codehaus.plexus:plexus-component-annotations:jar:1.5.5 from/to central (https://repo.maven.apache.org/maven2): transfer failed for https://repo.maven.apache.org/maven2/org/codehaus/plexus/plexus-component-annotations/1.5.5/plexus-component-annotations-1.5.5.jar: Operation timed out (Read failed) ->
解决方法:在编译之前,命令行线运行:mvn clean -Dmaven.clean.failOnError=false
(2)[INFO] Spark Project Launcher ............................. FAILURE
[ERROR] Failed to execute goal on project spark-launcher_2.11: Could not resolve dependencies for project org.apache.spark:spark-launcher_2.11:jar:2.3.0: Could not find artifact org.apache.hadoop:hadoop-client:jar:2.6.0-cdh5.7.0 in central (https://repo.maven.apache.org/maven2) -> [Help 1]
解决方法:pom.xml添加
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
4.spark Standalone 模式提交出错
Exception: Python in worker has different version 2.7 than that in driver 3.7, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
解决方法:这是由于python版本导致的,我们在conf目录下的spark-env.sh配置一下PYSPARK_PYTHON、PYSPARK_DRIVER_PYTHON,我自己电脑上是3.7版本的python,查看安装目录,可以使用which python3.7,将显示的路径贴过来就好,我的路径:/Users/hh/anaconda3/bin/python3.7,所以spark-env.sh添加如下语句
PYSPARK_PYTHON=/Users/hh/anaconda3/bin/python3.7
PYSPARK_DRIVER_PYTHON=/Users/hh/anaconda3/bin/python3.7
21/06/05 16:07:58 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
解决方法:查看说资源不够,看了一下Memory in use: 7.0 GB Total, 1024.0 MB Used,我这边同一台机子上还跑了pySpark,停掉pySpark就可以了。
5.spark yarn 模式提交出错
Exception in thread "main" java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
解决方法,官网上说Ensure that HADOOP_CONF_DIR
or YARN_CONF_DIR
points to the directory which contains the (client side) configuration files for the Hadoop cluster.,我们去conf目录下的spark-env.sh配置一下
HADOOP_CONF_DIR=
/Users/hh/app/hadoop-2.6.0-cdh5.15.1/etc/hadoop
6. nlp pytorch data Field
torch版本:1.10.1、spacy版本:3.2.1、torchtext版本:0.11.1
import torch
from torchtext.legacy import data
from spacy.lang import en
TEXT = data.Field(tokenize='spacy', tokenizer_language='en')
报错:Can't find model 'en'. It looks like you're trying to load a model from a shortcut, which is obsolete as of spaCy v3.0.
解决方法:pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz