27.MATLAB NL Text Analytics Toolbox使用之文本预处理

最新推荐文章于 2024-10-11 09:59:44 发布

waiting不是违停

最新推荐文章于 2024-10-11 09:59:44 发布

阅读量1.6k

点赞数 4

本文链接：https://blog.csdn.net/weixin_44737922/article/details/105177597

版权

Text Analytics Toolbox官方文档

版本是2019b，亲测能用

2020年美赛的时候用的是2019a，没有这个工具箱，现学python吃了很多亏。

1.文本预处理可能包括下面内容

Variations in case, for example "new" and "New"
Variations in word forms, for example "walk" and "walking"
Words which add noise, for example "stop words" such as "the" and "of"
Punctuation and special characters
HTML and XML tags

2.官方示例代码

textData = [
    "A large tree is downed and blocking traffic outside Apple Hill."
    "There is lots of damage to many car windshields in the parking lot."];
documents = preprocessTextData(textData)
function documents = preprocessTextData(textData)

% Tokenize the text.
documents = tokenizedDocument(textData);

% Lemmatize the words. To improve lemmatization, first use 
% addPartOfSpeechDetails.
documents = addPartOfSpeechDetails(documents);
documents = normalizeWords(documents,'Style','lemma');

% Erase punctuation.
documents = erasePunctuation(documents);

% Remove a list of stop words.
documents = removeStopWords(documents);

% Remove words with 2 or fewer characters, and words with 15 or more
% characters.
documents = removeShortWords(documents,2);
documents = removeLongWords(documents,15);

end

3.documents = tokenizedDocument(textData);

标记文本，效果如下：

4.documents = addPartOfSpeechDetails(documents);

为文件添加词性细节

打印出具体的代码

tdetails = tokenDetails(documents);
head(tdetails)

转换前：

转换后：

5.单词化成原形

documents = normalizeWords(documents,'Style','lemma');

6.documents = erasePunctuation(documents);

去除标点

7.documents = removeStopWords(documents);

去除停顿词，如to the等

查询stopwords：

words = stopWords;
reshape(words,[25 9])

自定义StopWords：

customStopWords = [stopWords "thy" "thee" "thou" "dost" "doth"];
documents = removeWords(documents,customStopWords);
documents(1:5)

8.删除长度小于2和大于5的词

documents = removeShortWords(documents,2);
documents = removeLongWords(documents,15);

waiting不是违停

关注

4
点赞
踩
13

收藏

觉得还不错? 一键收藏
6
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫