from sklearn.model_selecting import train_test_spilt()
参数stratify: 依据标签y,按原数据y中各类比例,分配给train和test,使得train和test中各类数据的比例与原数据集一样。
例如:A:B:C=1:2:3
split后,train和test中,都是A:B:C=1:2:3
将stratify=X就是按照X中的比例分配
将stratify=y就是按照y中的比例分配
一般都是=y
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
TF-IDF (Term Frequency - Inverse Document Frequency)
TfidfVectorizer 参数意义:
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.build_tokenizer
详细解释:
https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction