英文分词(不用类似re等工具)

梦幻精灵_cq

已于 2024-01-16 18:47:59 修改

阅读量999

点赞数

分类专栏：练习文章标签： python

于 2024-01-15 23:53:09 首次发布

本文链接：https://blog.csdn.net/m0_57158496/article/details/135613713

版权

练习专栏收录该内容

156 篇文章 4 订阅

订阅专栏

不用类似re等工具，将输入英文文本，拆分成一个个有意义的单词。

(笔记模板由python脚本于2024年01月15日 23:34:05创建，本篇笔记适合会基础编程，熟悉python字符串的coder翻阅)

【学习的细节是欢悦的历程】

Python 官网：https://www.python.org/
Free：大咖免费“圣经”教程《 python 完全自学教程》，不仅仅是基础那么简单……
地址：https://lqpybook.readthedocs.io/

自学并不是什么神秘的东西，一个人一辈子自学的时间总是比在学校学习的时间长，没有老师的时候总是比有老师的时候多。
—— 华罗庚

My CSDN主页、My HOT博、My Python 学习个人备忘录
好文力荐、老齐教室

将输入英文文本 英文分词 (拆分成有意义的单词)

本文质量分：

【 $97$ 】
本文地址： https://blog.csdn.net/m0_57158496/article/details/135613713

CSDN质量分查询入口：http://www.csdn.net/qc

目录

◆ 英文分词
- 1、念想萌芽
- 2、算法解析
- - 2.1 去除非字母字符
  - 2.2 统计词频
  - 2. 分词
- 3、完整源码(Python)

◆ 英文分词

1、念想萌芽

今天在 $c s d n$ 看到 $j i e b a$ ，脑中居然浮现出一个想法：“我可不可以撰写一段代码，实现 $j i e b a$ 一样的分词效果”。于是，我就开始了尝试……

回页目录

2、算法解析

$英文分词$ ，实现起来比中文分词相对容易，因为英文句子中的单词都是被非字母分隔开来的。只要把非字母字符替换成“同一字符”，就可以很方便地使用字符串方法 $s t r . s p l i t ()$ 把英文句子中的单词分割开来。再将 $s t r . s o l i t ()$ 方法返回的列表中的空字符串’'剔除，就达成了 $英文分词$ 的目的。如果将无实际意义的代码词介词等单词清理，就准备好了做词云图片的文本数据了咯。😜

$6 k +$ 字符的试码文本 $英文美文 . t x t$
实现效果截屏图片

$分词列表$

$词频统计$

中间部分略

2.1 去除非字母字符

列表解析遍历输入文本参数，用 $三元操作语句$ 把非字母字符替换成一个英文空格’ '，返回“无缝”拼接的字符串_{(只有空格分隔的英文单词字符串)}，完成了输入文本的“预处理”。

Python代码


    def _isletter(self):
        ''' 剔除非字母字符 '''
        lowers = ''.join(chr(i) for i in range(ord('a'), ord('z')+1)) # 生成26个小写字母字符串。
        letters = tuple(lowers+lowers.upper())
        #input(letters) # 校验字母列表。
        words = [i if i in letters else ' ' for i in self.words] # 把非字母替换成英文空格字符。
        
        return ''.join(words)

回页目录

2.2 统计词频

词频统计，一般用字典比较方便。一次遍历输入分词列表就可以完成统计，将遍历到的单词在统计字典中的相应键值 $+ 1$ ，遍历完分词列表也就完成了词频统计。

我在今天的代码中，用了另一种“ $算法$ ”——用 $p y t h o n$ 集合 $s e t$ 的唯一特性对分词列表去重作为遍历序列，遍历分词列表单词“种类”，用 $l i s t . c o u n t (w o r d)$ 方法来统计词频数据，词频统计数据结构可以列表 $l i s t$ 、元组 $t u p l e$ 、字典 $d i c t$ ，根据需要任意选择。

Python代码


    def _count(self, words):
        ''' 统计词频 '''
        words = [(i, words.count(i)) for i in set(words)] # 列表解析式统计词频。
        words.sort(key=lambda x: x[0]) # 按单词排序。
        words.sort(key=lambda x: x[-1], reverse=True) # 按词频排逆序。
        
        return words

回页目录

2. 分词

“预处理”_{(英文空格字符替换非字母字符)}好了的输入文本，直接用 $s t r . s p l i t ()$ 默认缺省参数就可以拆分单词了。再一步剔除拆分出和空字符串’’，就算是完成“英文分词”。

返回结果可以自行定制：
$a .$ 直接输出分词；
$b .$ 统计词频；
$c .$ 去除无实义单词；
$d .$ $c\&b$ ，既去除无实义单词又统计词频。

我在这里的代码选择采用了第四种返回形式。我能想到的无实义单词列表如下：

$I\ me\ my\ main\ you\ your\ hers\ she$
$her\ hers\ he\ his\ him\ we\ our\ ours$
$they\ their\ them\ its\ it\ a\ an\ m\ s\ d$
$did\ do\ doing\ does\ done\ can\ would$
$am\ is\ was\ are\ were\ be\ have\ has$
$often\ always\ to\ too\ very\ many\ any$
$in\ on\ with\ at\ of\ up\ down\ go\ goes$
$went\ for\ about\ now\ if\ but\ re\ from$
$the\ there\ this\ that\ than\ when\ what$
$where\ who\ why\ so\ as\ yes\ no\ not$
$jion\ or\ and\ by\ but$

Python代码


    def split(self):
        ''' 分词 '''
        nowords = ('I', 'me', 'my', 'main', 'you', 'your', 'hers', 'she', 'her', 'hers', 'he', 'his', 'him', 'we', 'our', 'ours', 'they', 'their', 'them', 'its', 'it', 'a', 'an', 'm', 's', 'd', 'did', 'do', 'doing', 'does', 'done', 'can', 'would', 'am', 'is', 'was', 'are', 'were', 'be', 'have', 'has', 'often', 'always', 'to', 'too', 'very', 'many', 'any', 'in', 'on', 'with', 'at', 'of', 'up', 'down', 'go', 'goes', 'went', 'for', 'about', 'now', 'if', 'but', 're','from', 'the', 'there', 'this', 'that', 'than', 'when', 'what', 'where', 'who', 'why', 'so', 'as', 'yes', 'no', 'not', 'jion', 'or', 'and', 'by', 'but')
        nowords = list(nowords) + [i.title() for i in nowords]
        #input(nowords) # 校验无效单词列表。
        words = [i for i in self._isletter().split() if i and i not in nowords] # 去除空格和无效单词。
        #print(words) # 打印分词列表。
        
        return self._count(words)

回页目录

3、完整源码(Python)

(源码较长，点此跳过源码)


#!/sur/bin/nve python
# coding: utf-8


'''
英文分词
'''

class EnSplit:
    
    def __init__(self, text):
        self.words = text
        
    def _isletter(self):
        ''' 剔除非字母字符 '''
        lowers = ''.join(chr(i) for i in range(ord('a'), ord('z')+1)) # 生成26个小写字母字符串。
        letters = tuple(lowers+lowers.upper())
        #input(letters) # 校验字母列表。
        words = [i if i in letters else ' ' for i in self.words] # 把非字母替换成英文空格字符。
        
        return ''.join(words)
        
    def _count(self, words):
        ''' 统计词频 '''
        words = [(i, words.count(i)) for i in set(words)] # 列表解析式统计词频。
        words.sort(key=lambda x: x[0]) # 按单词排序。
        words.sort(key=lambda x: x[-1], reverse=True) # 按词频排逆序。
        
        return words

    def split(self):
        ''' 分词 '''
        nowords = ('I', 'me', 'my', 'main', 'you', 'your', 'hers', 'she', 'her', 'hers', 'he', 'his', 'him', 'we', 'our', 'ours', 'they', 'their', 'them', 'its', 'it', 'a', 'an', 'm', 's', 'd', 'did', 'do', 'doing', 'does', 'done', 'can', 'would', 'am', 'is', 'was', 'are', 'were', 'be', 'have', 'has', 'often', 'always', 'to', 'too', 'very', 'many', 'any', 'in', 'on', 'with', 'at', 'of', 'up', 'down', 'go', 'goes', 'went', 'for', 'about', 'now', 'if', 'but', 're','from', 'the', 'there', 'this', 'that', 'than', 'when', 'what', 'where', 'who', 'why', 'so', 'as', 'yes', 'no', 'not', 'jion', 'or', 'and', 'by', 'but')
        nowords = list(nowords) + [i.title() for i in nowords]
        #input(nowords) # 校验无效单词列表。
        words = [i for i in self._isletter().split() if i and i not in nowords] # 去除空格和无效单词。
        print(words) # 打印分词列表。

        return self._count(words)


if __name__ == '__main__':
    text = '''
    I'm a old man. I love Python.
    我是一个老男人，我爱Python。
    '''
    text = open('/sdcard/Documents/英文美文.txt').read()
    en = EnSplit(text)
    print('\n'.join([f"{i[0]}: {i[-1]}" for i in en.split()]))