我试图创建udf来检查名称字符串是全部大写还是小写。为什么它没有产生我所期望的?例如:
def check_case(name):
if name.isupper() : check="yes"
else : check="no"
return check
my_udf = udf(lambda x: check_case(name), StringType())
df.withColumn("casecheck",my_udf(col("firstName"))).select("firstName","casecheck").show()
输出低于此值显然是错误的。我尝试使用islower(),istitle(),也产生了错误的结果。(它将为所有记录返回全是或全否)。知道为什么它不能在udf中工作吗?
谢谢!
+---------+---------+
|firstName|casecheck|
+---------+---------+
| GRETCHEN| no|
| IFswkG| no|
| April| no|
我也尝试过这个:
def check_case(name):
if name.isupper() : check="yes"
else : check="no"
return check
my_udf = udf(check_case, StringType())
df.withColumn("casecheck",my_udf("firstName")).select("firstName","casecheck").show()
现在我得到错误:
Py4JJavaError: An error occurred while calling o1046.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 385.0 failed 4 times, most recent failure: Lost task 0.3 in stage 385.0 (TID 9580, ip-10-22-10-102.ec2.internal, executor 32): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/mnt/yarn/usercache/zeppelin/appcache/application_1598626762284_0001/container_1598626762284_0001_01_000061/pyspark.zip/pyspark/worker.py", line 377, in main
process()
File "/mnt/yarn/usercache/zeppelin/appcache/application_1598626762284_0001/container_1598626762284_0001_01_000061/pyspark.zip/pyspark/worker.py", line 372, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/mnt/yarn/usercache/zeppelin/appcache/application_1598626762284_0001/container_1598626762284_0001_01_000061/pyspark.zip/pyspark/worker.py", line 248, in
func = lambda _, it: map(mapper, it)
File "", line 1, in
File "/mnt/yarn/usercache/zeppelin/appcache/application_1598626762284_0001/container_1598626762284_0001_01_000061/pyspark.zip/pyspark/worker.py", line 85, in
return lambda *a: f(*a)
File "/mnt/yarn/usercache/zeppelin/appcache/application_1598626762284_0001/container_1598626762284_0001_01_000061/pyspark.zip/pyspark/util.py", line 113, in wrapper
return f(*args, **kwargs)
File "", line 5, in check_case
AttributeError: 'NoneType' object has no attribute 'isupper'
更多编辑:
def check_case(name):
if name != None and name.isupper() : check="yes"
elif name!= None and name.islower() : check="no"
else : check= None
return check
my_udf = udf(check_case, StringType())
df.withColumn("casecheck",my_udf("firstName")).select("firstName","casecheck").show()
输出是
+---------+---------+
|firstName|casecheck|
+---------+---------+
| GRETCHEN| yes|
| GRETCHEN| yes|
| GRETCHEN| yes|
| Christos| null|
| IFswkG| null|
| April| null|
| MATTHEW| yes|
| riUj| null|
| HARRY| yes|
解决方案
首先,您传递的name不是xlambda函数,您只需在udf中指定该函数,就不需要lambda了。
my_udf = udf(check_case, StringType())
在您的函数中,您需要处理None和isupper islower条件,如
def check_case(name):
if name!= None and (name.isupper() or name.islower()):
check = "yes"
else :
check= "no"
return check
另外,通过创建这样的列,您可以拥有一个更简单有效的解决方案(udf可能会更昂贵)
df.withColumn("casecheck",
when((col("firstName") != None)
& (col("firstname").isupper() | col("firstname").islower()), "yes")
.otherwise("no"))
.select("firstName","casecheck").show()