NLP学习记录一:文本预处理

一、思路实现

测试所用的文本数据集如下:

Look, if you had one shot, or one opportunity
To seize everything you ever wanted in one moment
Would you capture it or just let it slip?
Yo
His palms are sweaty, knees weak, arms are heavy
There's vomit on his sweater already, mom's spaghetti
He's nervous, but on the surface he looks calm and ready
To drop bombs, but he keeps on forgettin
What he wrote down, the whole crowd goes so loud
He opens his mouth, but the words won't come out
He's chokin, how everybody's jokin now
The clock's run out, time's up over, bloah!
Snap back to reality, Oh there goes gravity
Oh, there goes Rabbit, he choked
He's so mad, but he won't give up that
Is he? No
He won't have it , he knows his whole back city's ropes
It don't matter, he's dope
He knows that, but he's broke
He's so stacked that he knows
When he goes back to his mobile home, that's when it's
Back to the lab again yo
This whole rhapsody
He better go capture this moment and hope it don't pass him
You better lose yourself in the music, the moment
You own it, you better never let it go
You only get one shot, do not miss your chance to blow
This opportunity comes once in a lifetime yo
You better lose yourself in the music, the moment
You own it, you better never let it go
You only get one shot, do not miss your chance to blow
This opportunity comes once in a lifetime yo
The soul's escaping, through this hole that it's gaping
This world is mine for the taking
Make me king, as we move toward a, new world order
A normal life is borin, but superstardom's close to post mortar
It only grows harder, only grows hotter
He blows us all over these hoes is all on him
Coast to coast shows, he's know as the globetrotter
Lonely roads, God only knows
He's grown farther from home, he's no father
He goes home and barely knows his own daughter
But hold your nose cuz here goes the cold water
These hoes don' t want him no mo', he' s cold product
They moved on to the next schmoe who flows
He nose dove and sold nada
So the soap opera is told and unfolds
I suppose it's old potna, but the beat goes on
Da da dum da dum da da
You better lose yourself in the music, the moment
You own it, you better never let it go
You only get one shot, do not miss your chance to blow
This opportunity comes once in a lifetime yo
You better lose yourself in the music, the moment
You own it, you better never let it go
You only get one shot, do not miss your chance to blow
This opportunity comes once in a lifetime yo
No more games, I'ma change what you call rage
Tear this mothafuckin roof off like 2 dogs caged
I was playin in the beginnin, the mood all changed
I been chewed up and spit out and booed off stage
But I kept rhymin and stepwritin the next cypher
Best believe somebody's payin the pied piper
All the pain inside amplified by the fact
That I can't get by with my 9 to 5
And I can't provide the right type of life for my family
Cuz man, these goddam food stamps don't buy diapers
And it's no movie, there's no Mekhi Phifer, this is my life
And these times are so hard and it's getting even harder
Tryin to feed and water my seed, plus
See dishonor caught up bein a father and a prima donna
Baby mama drama's screamin on and
Too much for me to wanna
Stay in one spot, another jam or not
Has gotten me to the point, I'm like a snail
I've got to formulate a plot fore I end up in jail or shot
Success is my only mothafuckin option, failure's not
Mom, I love you, but this trail has got to go
I cannot grow old in Salem's lot
So here I go is my shot.
Feet fail me not cuz maybe the only opportunity that I got
You better lose yourself in the music, the moment
You own it, you better never let it go
You only get one shot, do not miss your chance to blow
This opportunity comes once in a lifetime yo
You better lose yourself in the music, the moment
You own it, you better never let it go
You only get one shot, do not miss your chance to blow
This opportunity comes once in a lifetime yo
You can do anything you set your mind to, man

在程序上,先按行读取给定文本,得到类似如下所示的数据结构:

['look if you had one shot or one opportunity', 'to seize everything you ever wanted in one moment',...]

第二步,将上面数据结构中每个元素中的单词分解出来,得到给定文本中的所有token:

[['look', 'if', 'you', 'had', 'one', 'shot', 'or', 'one', 'opportunity'], ['to', 'seize', 'everything', 'you', 'ever', 'wanted', 'in', 'one', 'moment'],...]

 第三步,统计每个token出现的频率,按照出现频率从大到小排序:

[('the', 33), ('you', 31), ('s', 29), ('he', 26), ('it', 23),...]

 统计token的出现频率可以帮助我们过滤掉一些出现频率较少的token。

第四步,设置一个知道index后能够返回对应token的数据结构:

['<unk>', 'the', 'you', 's', 'he', 'it', 'to', 'in', 'and', 'i', 'this', 'better', 'a', 'only', 'one',...]

此处我们手动添加了一个未知词元'<unk>'

第五步,设置一个知道token后能够返回对应index的数据结构:

{'<unk>': 0, 'the': 1, 'you': 2, 's': 3,...}

二、代码实现

preprocessing.py:

import collections
import re

# 按行读取给定文本
def read_txt(path):
    with open(path, 'r') as f:
        lines = f.readlines()

    # 把不是大写字母和小写字母的东西全部变成空格
    return [re.sub('[^A-Za-z]+', ' ', line).strip().lower() for line in lines]

# 从每行文本中分解出token
def tokenize(lines, type='word'):
    if type == 'word':
        return [line.split() for line in lines]
    elif type == 'char':
        return [list(line) for line in lines]
    else:
        print('错误:未知令牌类型:' + type)

# 统计tokens中每个token出现的次数
def count_corpus(tokens):
    if len(tokens) == 0 or isinstance(tokens[0], list):
        tokens = [token for line in tokens for token in line]
    return collections.Counter(tokens)

class Vocab:
    def __init__(self, tokens=None, min_freq=0, reserved_tokens=None):
        if tokens is None:
            tokens = []
        if reserved_tokens is None:
            reserved_tokens = []

        # 按出现频率排序
        counter = count_corpus(tokens)

        # [('the', 33), ('you', 31), ('s', 29), ('he', 26), ('it', 23),...]
        self._token_freqs = sorted(counter.items(), key=lambda x: x[1], reverse=True) 

        # 未知词元的索引为0
        # idx_to_token:知道index,返回token,比如:'<unk>'的index是0
        self.idx_to_token = ['<unk>'] + reserved_tokens

        # token_to_idx:是一个字典,知道token,返回它的index
        # 例如:{'<unk>': 0, 'the': 1, 'you': 2, 's': 3,...}
        self.token_to_idx = {token: idx for idx, token in enumerate(self.idx_to_token)}

        for token, freq in self._token_freqs:
            if freq < min_freq:
                break
            if token not in self.token_to_idx:
                self.idx_to_token.append(token)
                self.token_to_idx[token] = len(self.idx_to_token) - 1

    def __len__(self):
        return len(self.idx_to_token)

    # 给定tokens列表,返回其中token的index
    def __getitem__(self, tokens):
        if not isinstance(tokens, (list, tuple)):
            return self.token_to_idx.get(tokens, self.unk)
        return [self.__getitem__(token) for token in tokens]

    # 给定index列表,返回tokens列表
    def to_tokens(self, indices):
        if not isinstance(indices, (list, tuple)):
            return self.idx_to_token[indices]
        return [self.idx_to_token[index] for index in indices]

    @property
    def unk(self):
        return 0

    @property
    def token_freqs(self):
        return self._token_freqs

# 其它模块可以通过调用该函数实现该模块已实现的功能
def preprocessing(path, max_tokens=-1, token_mode='word'):
    lines = read_txt(path)
    tokens = tokenize(lines, token_mode)

    # 待返回的Vocab对象
    vocab = Vocab(tokens)

    # corpus是由给定文本中每个token对应的索引组成的列表
    corpus = [vocab[token] for line in tokens for token in line]

    if max_tokens > 0:
        corpus = corpus[:max_tokens]

    return corpus, vocab

参考链接:

《动手学深度学习》 — 动手学深度学习 2.0.0 documentationicon-default.png?t=O83Ahttps://zh-v2.d2l.ai/

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值