NLP学习笔记

最新推荐文章于 2022-08-07 21:55:00 发布

戚兆禹

最新推荐文章于 2022-08-07 21:55:00 发布

阅读量248

点赞数

分类专栏： python 文章标签： NLP

本文链接：https://blog.csdn.net/weixin_41018824/article/details/79257996

版权

python 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

import urllib
from bs4 import BeautifulSoup
import nltk
# urllib is use to download the html content of the web link
response = urllib.request.urlopen('http://python.org/')
#You can read the entire content of a file using read() method
html = response.read()
clean = BeautifulSoup(html,"lxml").get_text()
#No more Use nltk.clean_html() Now use bs4
tokens = nltk.word_tokenize(clean)
print(tokens[:100])
Freq_dist_nltk=nltk.FreqDist(tokens)
print(Freq_dist_nltk)
for k,v in Freq_dist_nltk.items():
    print(str(k)+':'+str(v))
#the plot for the frequency distributions
Freq_dist_nltk.plot(50,cumulative=False)
#+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
#English.stop txt  自己想办法
stopwords=[word.strip().lower() for word in open("PATH/english.stop.txt")]
#自己手动输入的
stopwords=['Welcome','@','http','and','of']
clean_tokens=[tok for tok in tokens if len(tok.lower())>1 and (tok.lower() not in stopwords)]
Freq_dist_nltk=nltk.FreqDist(clean_tokens)
Freq_dist_nltk.plot(5,cumulative=False)

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

立即使用

戚兆禹

关注关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
NLP学习笔记

import urllibfrom bs4 import BeautifulSoupimport nltk# urllib is use to download the html content of the web linkresponse = urllib.request.urlopen('http://python.org/')#You can read the entire co
复制链接

扫一扫