27.MATLAB NL Text Analytics Toolbox使用之文本预处理

Text Analytics Toolbox官方文档

版本是2019b,亲测能用

2020年美赛的时候用的是2019a,没有这个工具箱,现学python吃了很多亏。

1.文本预处理可能包括下面内容

  • Variations in case, for example "new" and "New"

  • Variations in word forms, for example "walk" and "walking"

  • Words which add noise, for example "stop words" such as "the" and "of"

  • Punctuation and special characters

  • HTML and XML tags

2.官方示例代码

textData = [
    "A large tree is downed and blocking traffic outside Apple Hill."
    "There is lots of damage to many car windshields in the parking lot."];
documents = preprocessTextData(textData)
function documents = preprocessTextData(textData)

% Tokenize the text.
documents = tokenizedDocument(textData);

% Lemmatize the words. To improve lemmatization, first use 
% addPartOfSpeechDetails.
documents = addPartOfSpeechDetails(documents);
documents = normalizeWords(documents,'Style','lemma');

% Erase punctuation.
documents = erasePunctuation(documents);

% Remove a list of stop words.
documents = removeStopWords(documents);

% Remove words with 2 or fewer characters, and words with 15 or more
% characters.
documents = removeShortWords(documents,2);
documents = removeLongWords(documents,15);

end

3.documents = tokenizedDocument(textData);

标记文本,效果如下:

4.documents = addPartOfSpeechDetails(documents);

为文件添加词性细节

打印出具体的代码

tdetails = tokenDetails(documents);
head(tdetails)

转换前:

转换后:

5.单词化成原形

documents = normalizeWords(documents,'Style','lemma');

 

6.documents = erasePunctuation(documents);

去除标点

7.documents = removeStopWords(documents);

去除停顿词,如to the等

查询stopwords:

words = stopWords;
reshape(words,[25 9])

自定义StopWords:

customStopWords = [stopWords "thy" "thee" "thou" "dost" "doth"];
documents = removeWords(documents,customStopWords);
documents(1:5)

8.删除长度小于2和大于5的词

documents = removeShortWords(documents,2);
documents = removeLongWords(documents,15);

 

评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值