版本是2019b,亲测能用
2020年美赛的时候用的是2019a,没有这个工具箱,现学python吃了很多亏。
1.文本预处理可能包括下面内容
-
Variations in case, for example "new" and "New"
-
Variations in word forms, for example "walk" and "walking"
-
Words which add noise, for example "stop words" such as "the" and "of"
-
Punctuation and special characters
-
HTML and XML tags
2.官方示例代码
textData = [
"A large tree is downed and blocking traffic outside Apple Hill."
"There is lots of damage to many car windshields in the parking lot."];
documents = preprocessTextData(textData)
function documents = preprocessTextData(textData)
% Tokenize the text.
documents = tokenizedDocument(textData);
% Lemmatize the words. To improve lemmatization, first use
% addPartOfSpeechDetails.
documents = addPartOfSpeechDetails(documents);
documents = normalizeWords(documents,'Style','lemma');
% Erase punctuation.
documents = erasePunctuation(documents);
% Remove a list of stop words.
documents = removeStopWords(documents);
% Remove words with 2 or fewer characters, and words with 15 or more
% characters.
documents = removeShortWords(documents,2);
documents = removeLongWords(documents,15);
end
3.documents = tokenizedDocument(textData);
标记文本,效果如下:
4.documents = addPartOfSpeechDetails(documents);
为文件添加词性细节
打印出具体的代码
tdetails = tokenDetails(documents);
head(tdetails)
转换前:
转换后:
5.单词化成原形
documents = normalizeWords(documents,'Style','lemma');
6.documents = erasePunctuation(documents);
去除标点
7.documents = removeStopWords(documents);
去除停顿词,如to the等
查询stopwords:
words = stopWords;
reshape(words,[25 9])
自定义StopWords:
customStopWords = [stopWords "thy" "thee" "thou" "dost" "doth"];
documents = removeWords(documents,customStopWords);
documents(1:5)
8.删除长度小于2和大于5的词
documents = removeShortWords(documents,2);
documents = removeLongWords(documents,15);