extraction: features from raw data
transformation: scaling, converting or modifying features
selection: selecting a subset from features
locality sensitive hashing: combining feature transformation with other algorithms
feature extractors:
tf-idf
1 tf-idf term frequency-inverse document frequency
单词, 文本, 文库
tf是一个文本中单词出现的次数, df是文库中存在某单词的文本数量
tf-idf = tf * idf
2 HashingTF和CountVectorizer都可以用来生成tf矢量
word2vec
1 接受文档的单词序列,训练出word2vecmodel,word2vec是一个estimator
2 model将单词映射为unique fixed-size vector
countvectorizer
countvectorizer and countvectorizermodel旨在将文本文档集合转为token counts的矢量
featurehasher
将特征投射到特定维度的特征矢量