Keras.Tokenizer:文本与序列预处理

Tokenizer类

成员变量

  • document_count 处理的文档数量
  • word_index 一个dict,保存所有word对应的编号id,从1开始
  • word_counts 一个dict,保存每个word在所有文档中出现的次数
  • word_docs 一个dict,保存每个word出现的文档的数量
  • index_docs 一个dict,保存word的id出现的文档的数量

示例:

本例中text为长度等于16104的列表,每个元素为1个imdb的新闻评论,例如:

Passport to Pimlico is a real treat for all fans of British cinema. Not only is it an enjoyable and thoroughly entertaining comedy, but it is a cinematic flashback to a bygone age, with attitudes and scenarios sadly now only a memory in British life.<br /><br />Stanley Holloway plays Pimlico resident Arthur Pemberton, who after the accidental detonation of an unexploded bomb, discovers a wealth of medieval treasure belonging to the 14th Century Duke of Burgundy that has been buried deep underneath their little suburban street these last 600 years.<br /><br />Accompanying the treasure is an ancient legal decree signed by King Edward IV of England (which has never been officially rescinded) to state that that particular London street had been declared Burgandian soil, which means that in the eyes of international law, Pemberton and the other local residents are no longer British subjects but natives of Burgundy and their tiny street an independent country in it's own right and a law unto itself.<br /><br />This sets the war-battered and impoverished residents up in good stead as they believe themselves to be outside of English law and jurisdiction, so in an act of drunken defiance they burn their ration books, destroy and ignore their clothing coupons, flagrantly disregard British licencing laws etc, declaring themselves fully independent from Britain.<br /><br />However, what then happens is ever spiv, black marketeer and dishonest crook follows suit and crosses the 'border' into Burgundy as a refuge from the law and post-war restrictions to sell their dodgy goods, and half of London's consumers follow them in order to dodge the ration, making their quiet happy little haven, a den of thieves and a rather crowded one at that.<br /><br />Appealing to Whitehall for assistance, they are told that due to developments this is "now a matter of foreign policy, which His Majesty's Government is reluctant to become involved" which leaves the residents high and dry. They do however declare the area a legal frontier and as such set up a fully equipped customs office at the end of the road, mainly to monitor smuggling than to ensure any safety for the residents of Pimlico.<br /><br />Eventually the border is closed altogether starting a major siege, with the Bugundian residents slowly running out of water and food, but never the less fighting on in true British style. As one Bugundian resident quotes, "we're English and we always were English, and it's just because we are English, we are fighting so hard to be Bugundians" <br /><br />A sentiment that is soon echoed throughout the capital as when the rest of London learn of the poor Bugundians plight they all feel compelled to chip in and help them, by throwing food and supplies over the barbed wire blockades.<br /><br />Will Whitehall, who has fought off so may invaders throughout the centuries finally be brought to it's knees by this new batch of foreigners, especially as these ones are English!!!! <br /><br />Great tale, and great fun throughout. Not to be missed.
from keras import preprocessing

tokenizer = preprocessing.text.Tokenizer(num_words=10000)

# 第一个参数num_words:需要保留的最大词数,基于词频。只有最常出现的 num_words 词会被保留。

# 通过列表text来构建Tokenizer类
tokenizer.fit_on_texts(texts=text)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
# 此处打印:Found 71016 unique tokens.
# 构建类时,会把所有的单词都用上,但是使用texts_to_sequences转换成sequences时,会只保留在实例化
# Tokenizer类时传入的第一个参数(即:num_words)的数量的单词
sequences = tokenizer.texts_to_sequences(text)

 

 

  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值