Python利用结巴分词进行中文分词

本文介绍了如何使用Python中的结巴分词库进行中文分词,特别是选择了全模式进行分词处理。进一步,文章详细阐述了如何构建词倒排索引,并演示了如何执行多词查询和短语查询操作。
摘要由CSDN通过智能技术生成

利用结巴分词进行中文分词,选择全模式,建立词倒排索引,并实现一般多词查询和短语查询

# -*- coding: utf-8 -*-
import jieba
'''
Created on 2015-11-23
'''

def word_split(text):
    """
    Split a text in words. Returns a list of tuple that contains
    (word, location) location is the starting byte position of the word.
    """
    word_list = []
    windex = 0
    word_primitive = jieba.cut(text, cut_all = True)
    for word in word_primitive:
        if len(word) > 0:
            word_list.append((windex, word))
            windex += 1
    return word_list

def inverted_index(text):
    """
    Create an Inverted-Index of the specified text document.
        {word:[locations]}
    """
    inverted = {}
    for index, word in word_split(text):
        locations = inverted.setdefault(word, [])
        locations.append(index)
    return inverted
    

def inverted_index_add(inverted, doc_id, doc_index):
    """
    Add Invertd-Index doc_index of the document doc_id to the 
    Multi-Document Inverted-Index (inverted), 
    using doc_id as document identifier.
        {word:{doc_id:[locations]}}
    """
    for word, locations in doc_index.iteritems():
        indices = inverted.setdefault(word, {})
        indices[doc_id] = locations
    return inverted

def search_a_word(inverted, word):
    """
    search one word
    """   
    word = word.decode('utf-8')
    if word not in inverted:
        return None
    else:
        word_index = inverted[word]
    return word_index
    
def search_words(inverted, wordList):
    """
    search more than one word
    """
    wordDic = []
    docRight = []
    for word in wordList:
        if isinstance(word, str):
            word = word.decode('utf-8')
        if word not in inverted:
            return None
        else:
            element = inverted[wor
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值