week1-text preprocessing->feature extraction

最新推荐文章于 2021-04-13 16:37:13 发布

tinymd

最新推荐文章于 2021-04-13 16:37:13 发布

阅读量335

点赞数

分类专栏： nlp-coursera

本文链接：https://blog.csdn.net/tinymd/article/details/83745128

版权

nlp-coursera 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

This week, we focus on text classification

text preprocessing
- Tokenization
- Normalization
transforming tokens into features / text to text vector

tips: text classification can be used to sentiment analysis.

text preprocessing

Tokenization

How to process text depends on what you think of text as.
a sequence of

characters
words
phrases and named entities
sentences
paragraphs

Here, we think of text as a sequence of words because we reckon that a word is a meaningful sequence of characters.
Therefore, we should extract all words from a sentence.This process is called tokenization. So what’s the boundary of words?
Here, we mainly talk about English.
In English we can split a sentence by spaces or punctuation.
Three methods of tokenization are built in Python ntlk liberary.

whitespace tokenizer
puctuation tokenizer
treebankword tokenizer

examples

Normalization

stemming
lemmatization

examples

transforming tokens into features / text to text vector

== bag of words==

count occurrences of a particular token in our text

problems:

loose word order
counters are not normalized

so, for word order, we count token pairs, triplets,etc. n-gram
therefore, there are too many features
then, we remove some n-grams based on their occurrence frequency in documents of our corpus(df).(remove too high or too low)
and then, all features we have are moderately appearing among documents of our corpus. Next, we should focus on the value of feature columns.–or term frequency.
and then more accurately, we can get df in detail, not just medium df.