jieba源碼研讀筆記（二） - Python2/3相容

最新推荐文章于 2023-05-25 05:30:00 发布

keineahnung2345

最新推荐文章于 2023-05-25 05:30:00 发布

阅读量442

点赞数 1

分类专栏： NLP 機器學習 jieba源碼研讀筆記文章标签： jieba nlp

本文链接：https://blog.csdn.net/keineahnung2345/article/details/86604151

版权

機器學習同时被 3 个专栏收录

23 篇文章 0 订阅

订阅专栏

NLP

18 篇文章 0 订阅

订阅专栏

jieba源碼研讀筆記

18 篇文章 2 订阅

订阅专栏

jieba源碼研讀筆記（二） - Python2/3相容

前言
_compat.py檔案
參考連結

前言

jieba的主程序是__init__.py，定義了cut, cut_for_search等用於分詞的函數。
在正式介紹分詞函數以前，先來看看_compat.py這個檔案，它用於處理Python2/3之間相容的問題。
這個檔案中定義了get_module_res,strdecode,resolve_filename等讀檔時會用到的函數，它們會在__init__.py中頻繁地被調用。

_compat.py檔案

_compat.py裡定義了讀取字典時會用到的函數，它處理了Python2/3的相容性問題。

get_module_res函數

try:
    import pkg_resources
    get_module_res = lambda *res: pkg_resources.resource_stream(__name__,
                                                                os.path.join(*res))
except ImportError:
    get_module_res = lambda *res: open(os.path.normpath(os.path.join(
                            os.getcwd(), os.path.dirname(__file__), *res)), 'rb')

以上的代碼中有幾個知識點，逐一介紹如下：

函數名(*res)
參考What does ** (double star/asterisk) and * (star/asterisk) do for parameters?
如果函數有個帶*號的參數，這就代表在呼叫該函數時可以傳入任意個引數。

下面是一個連乘函數的例子：
```
def multiply(*nums):
    # print(type(nums)) # <class 'tuple'>
    # print(nums) # (1, 5, 2, 7)
    product = 1
    for num in nums:
        product*=num
    return product

print(multiply(1,5,2)) #10
print(multiply(1,5,2,7)) #70
```
在上例中傳入任意個參數到函數multiply中都可以得到正確的結果。
__name__, __file__
根據Two double underscore variables及 What does if __name__ == “__main__”: do?，__file__變數指的是當前的.py檔案的路徑，而__name__則是當前由python import的模組的名稱。其中__name__變數的值是會根據導入模組方式的不同而改變的。

做個小實驗，在jieba/_compat.py裡加入以下兩行：
```
print(__file__)
print(__name__)
```
然後刪除__pycache__資料夾。
- 如果直接運行_compat.py，它會輸出：
  
  D:/D_Document/Github/jieba/jieba/_compat.py
  __main__
- 如果在jieba/__init__.py中使用以下敘述調用jieba/_compat.py。
```
from .._compat import *
```
  它會輸出：
  
  D:\xxx\xxx\jieba\jieba_compat.py
  jieba._compat
從以上實驗可以看到，如果_compat.py是被當作主程式來運行，那麼__name__變數的值就會是__main__；反之，如果是被當作模組，在另外一個程式裡被導入的話，它就會變成jieba._compat。
pkg_resources.resource_stream
以下是pkg_resources.resource_stream函數的說明（來自Package Discovery and Resource Access using pkg_resources）：

resource_stream(package_or_requirement, resource_name)：
Return a readable file-like object for the specified resource;
it may be an actual file, a StringIO, or some similar object.
The stream is in “binary mode”,
in the sense that whatever bytes are in the resource will be read as-is.

來自Package Discovery and Resource Access using pkg_resources：

In the following methods, the package_or_requirement argument may be either a
Python package/module name (e.g. foo.bar) or a Requirement instance.
If it is a package or module name, the named module or package must be importable
(i.e., be in a distribution or directory on sys.path),
and the resource_name argument is interpreted relative to the named package.
Note that if a module name is used, then the
resource name is relative to the package immediately containing the named module.

pkg_resources.resource_stream函數有兩個參數，分別是package_or_requirement及resource_name。如果傳入的package_or_requirement是一個模組的名字，那麼這個函數會以該模組名為參考，找到resource_name，然後載入並回傳它的檔案物件。
os.path.normpath
來自os.path — Common pathname manipulations：

os.path.normpath(path)
Normalize a pathname by collapsing redundant separators
and up-level references so that A//B, A/B/, A/./B and A/foo/…/B all become A/B.
This string manipulation may change the meaning of a path that contains symbolic links.
On Windows, it converts forward slashes to backward slashes.
To normalize case, use normcase().

它會將傳入的參數path中多餘的/或\移除(即正規化)後回傳。

接下來看看get_module_res函數是如何被調用的。在jieba/__init__.py及之後會介紹的jieba/finalseg/__init__.py，jieba/posseg/__init__.py三個檔案中都會調用這個函數。

在jieba/__init__.py中：

DEFAULT_DICT_NAME = "dict.txt"

def get_dict_file(self):
    if self.dictionary == DEFAULT_DICT:
        return get_module_res(DEFAULT_DICT_NAME)
    else:
        return open(self.dictionary, 'rb')

可以從上面的代碼中看出，get_module_res(DEFAULT_DICT_NAME)的功能與open(self.dictionary, 'rb')一樣，都是開啟文字檔後回傳檔案物件。

在jieba/finalseg/__init__.py中：

PROB_START_P = "prob_start.p"
PROB_TRANS_P = "prob_trans.p"
PROB_EMIT_P = "prob_emit.p"

def load_model():
    start_p = pickle.load(get_module_res("finalseg", PROB_START_P))
    trans_p = pickle.load(get_module_res("finalseg", PROB_TRANS_P))
    emit_p = pickle.load(get_module_res("finalseg", PROB_EMIT_P))
    return start_p, trans_p, emit_p

回傳三個己開啟的.p檔物件。

在jieba/posseg/__init__.py中：

PROB_START_P = "prob_start.p"
PROB_TRANS_P = "prob_trans.p"
PROB_EMIT_P = "prob_emit.p"
CHAR_STATE_TAB_P = "char_state_tab.p"

def load_model():
    # For Jython
    start_p = pickle.load(get_module_res("posseg", PROB_START_P))
    trans_p = pickle.load(get_module_res("posseg", PROB_TRANS_P))
    emit_p = pickle.load(get_module_res("posseg", PROB_EMIT_P))
    state = pickle.load(get_module_res("posseg", CHAR_STATE_TAB_P))
    return state, start_p, trans_p, emit_p

代碼與finalseg/__init__.py中的雷同。作用是回傳四個己開啟的.p檔物件。

統一Python2/3函數的名稱

PY2 = sys.version_info[0] == 2

default_encoding = sys.getfilesystemencoding()

if PY2:
    text_type = unicode
    string_types = (str, unicode)

    iterkeys = lambda d: d.iterkeys()
    itervalues = lambda d: d.itervalues()
    iteritems = lambda d: d.iteritems()

else:
    text_type = str
    string_types = (str,)
    xrange = range

    iterkeys = lambda d: iter(d.keys())
    itervalues = lambda d: iter(d.values())
    iteritems = lambda d: iter(d.items())

這裡首先判斷Python版本是否為Python2，並將它存到PY2這個變數裡。
接著是依據Python2/3的特性，一一定義text_type及stringe_types等。

在Python3中，xrange變成range，遍歷字典的方式也跟Python2有所不同。
這裡建立了數個函數，並將它們指向Python3中相對應功能的函數。這樣一來，我們就可以統一以Python2的方式來呼叫他們。

strdecode函數

if PY2:
    text_type = unicode
    #...
else:
    text_type = str
    #...
    
def strdecode(sentence):
    if not isinstance(sentence, text_type):
        try:
            sentence = sentence.decode('utf-8')
        except UnicodeDecodeError:
            sentence = sentence.decode('gbk', 'ignore')
    return sentence

在使用Python 3的情況下，如果傳入的sentence不是字串型別，而是bytes型別，就將它以utf-8編碼轉換成字串。如果解碼失敗，則改以gbk編碼(簡體中文)來轉換。
所以這個函數的作用就是確保sentence是字串型別後回傳。

resolve_filename函數

使用f = open('xxx.txt', 'r')會得到一個_io.TextIOWrapper型別的對象f。
resolve_filename接受一個_io.TextIOWrapper型別的對象當作參數，獲取它的檔名後回傳。

def resolve_filename(f):
    try:
        return f.name
    except AttributeError:
        return repr(f)

範例：
讀取檔案：

f = open('a.txt', 'r')

獲取檔案名稱：

f.name # 'a.txt'

如果name屬性不存在，則改用：

repr(f) # "<_io.TextIOWrapper name='a.txt' mode='r' encoding='UTF-8'>"
# str(f)也有一樣的效果

參考連結

What does ** (double star/asterisk) and * (star/asterisk) do for parameters?
Two double underscore variables。
What does if name == “main”: do?
Package Discovery and Resource Access using pkg_resources
os.path — Common pathname manipulations

keineahnung2345

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
jieba源碼研讀筆記（二） - Python2/3相容

jieba的主程序是__init__.py，定義了cut, cut_for_search等用於分詞的函數。在正式介紹分詞函數以前，先來看看_compat.py這個檔案，它用於處理Python2/3之間相容的問題。這個檔案中定義了get_module_res,strdecode,resolve_filename等讀檔時會用到的函數，它們會在__init__.py中頻繁地被調用。
复制链接

扫一扫

专栏目录