问题:
jieba打包zip后上传spark运行jieba.analyse包中tfidf报错:
IOError: [Errno 20] Not a directory: 'XXXX/jieba.zip/jieba/analyse/idf.txt'
解决方案:
修改analyse包下的tf_idf.py如下(代码参考自:https://github.com/fxsjy/jieba/pull/539/files):
# encoding=utf-8
from __future__ import absolute_import
import os
import jieba
import jieba.posseg
from operator import itemgetter
from .._compat import get_module_res
_get_abs_path = jieba._get_abs_path
DEFAULT_IDF = "analyse/idf.txt"
class KeywordExtractor(object):
STOP_WORDS = set((
"the", "of", "is", "and", "to", "in", "that", "we", "for", "an", "are",
"by", "be", "as", "on", "with", "can", "if", "from"