Tricks while feature_extracting text: Extend the vectorizer with NLTK's stemmer

2 篇文章 0 订阅
2 篇文章 0 订阅

This is a reading note from 'Building Machine learning System with Python'.@P59

Train_data=

           ['This is a toy post about machine learning. Actually, it contains not much interesting stuff.',
            'Imaging databases provide storage capabilities.',
            'Most imaging databases safe images permanently.',
            'Imaging databases store data.',
            'Imaging databases store data. Imaging databases store data. Imaging databases store data.',
            'Does imaging databases store data?']

There are some tricks while sklearn.feature_extracting text,:

1.Normalization: if we would like to consider frequency instead of counts, just define a normalize function.

def normalize(a):
    return a/sp.linalg.norm(a)

As a result, Train_data[3]==Train_data[4]


2.Removing less import words: Words such as "most" appear very often in all sorts of different contexts, and words such as this are called "stop words". The best option would be to remove all words that are so frequent that they do not help to distinguish between different texts.

e.g.  vectorizer=CountVectorizer(min_df=1, stop_words='english')


3.[Most Important!!!!]Stemming: We count similar words in different variants as different words, for instance,'imaging' and 'images'. It would make sense to count them together. After all, it is the same concept they are referring to. That's why we need NlTK.

import nltk.stem

s=nltk.stem.SnowballStemmer('english')

s.stem('imaging')  #u'imag'

s.stem('image')       #u'imag'

s.stem('imagination')  #u'imagin'

Then, we extend the vectorizer with NLTK's stemmer.

We need to stem the posts before we feed them into CountVectorizer. The class provides several hooks with which we could customize the preprocessing and tokenization stages. The preprocessor and tokenizer can be set in the constructor as parameters. We do not want to place the stemmer into any of them, because we would then have to do the tokenization and normalization by ourselves. Instead, we overwrite the method build_analyzer as follows.


import nltk.stem
english_stemmer=nltk.stem.SnowballStemmer('english')
class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer=super(StemmedCountVectorizer,self).build_analyzer()
        return lambda doc:(english_stemmer.stem(w)for w in analyzer(doc))
vectorizer=StemmedCountVectorizer(min_df=1,stop_words='english')


new_post=["imaging databases provides storage capabilities and store image."]

vectorizer.fit_trainsform(new_post).toarray()  # We got [[1 1 2 1 1 1]]

print vectorizer.get_feature_names()   # We got [u'capabl', u'databas', u'imag', u'provid', u'storag', u'store']

# We now have one feature less,because "images" and "imaging" collapsed to one.

Super in Class Inheritance

A typical use for calling a cooperative superclass method is:

   class C(B):
       def meth(self, arg):
           super(C, self).meth(arg)

So 'analyzer' calls for CountVectorizer's build_analyzer().


TfidfVectorizer is inherited from CountVectorizer, which considers the Tf-idf algorithm. Similarly, we can inherit from TfidfVectorizer.


4. Drawbacks

Here comes our current text preprocessing phase:

1. tokenizing the text

2.throwing away words that occur way too often to be of any help in detecting relevant posts

3. throwing away words that occur so seldom that there is only a small chance that they occur in future posts

4. counting the remaining words

5. calculating TF-IDF values from the counts, considering the whole text corpus


But, the drawbacks are also obvious:

1. It does not cover word relations, for example, 'Car hits wall' and 'Wall hits car' will both have the same feature vector

2. it does not capture negations correctly. For example, 'I will eat ice cream' and 'I will not eat ice cream' will look very similar.

3. It totally fails with misspelled words. Although it is clear that 'database' and 'databas' convey the same meaning.


The first two drawbacks can be easily solved by n_grams.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值