我的python世界 豆瓣_python系列之(4)豆瓣图书《平凡的世界》书评及情感分析...

本篇主要是通过对豆瓣图书《平凡的世界》短评进行抓取并进行分析,并用snowNLP对其进行情感分析。

用到的模块有snowNLP,是一个python库,用来进行情感分析。

1.抓取数据

我们把抓取到的数据存储到sqlite,先建表,结构如下:

CREATE TABLE comment(

id integer PRIMARY KEY autoincrement NOT NULL,

commentator VARCHAR(50) NOT NULL,

star INTEGER NOT NULL,

time VARCHAR(50) NOT NULL,

content TEXT NOT NULL

);

然后写python代码进行抓取,如下:

importsysfrom os importpathimporttimeimporturllib3importrequestsimportnumpy as npimportsqlite3from bs4 importBeautifulSoupfrom urllib importparsefrom snownlp importSnowNLPimportmatplotlib.pyplot as pltimportjiebafrom wordcloud importWordCloudfrom PIL importImage

headers=[{'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'},\

{'User-Agent':'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.12 Safari/535.11'},\

{'User-Agent': 'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Trident/6.0)'}]defget_comment():

page_num=0;

total_num=0;while(1):

page_num+=1url= "https://book.douban.com/subject/1200840/comments/hot?p="+str(page_num)print(url)

http=urllib3.PoolManager()

time.sleep(np.random.rand()*5)try:

r= http.request("GET", url, headers=headers[page_num%len(headers)])

plain_text=r.data.decode()print(plain_text)exceptException as e:print(e)continuesoup= BeautifulSoup(plain_text, features="lxml")

ligroup= soup.find_all("li", class_="comment-item")for item inligroup:try:

commentator= item.find("div", class_="avatar").a.get("title")

spanlists= list(item.find("div", class_="comment").find("span", class_="comment-info"))while "\n" inspanlists:

spanlists.remove("\n")if (len(spanlists) == 3) :

stars= spanlists[1].get("title")

stars=switch_case(stars)

commenttime= spanlists[2].stringelse:

stars=0

commenttime= spanlists[1].string

content= item.find("span", class_="short").get_text()

add_comment(commentator, stars, commenttime, content)exceptException as e:print(e)continuepage_num+=1

if page_num > 999:break

defswitch_case(value):

switcher={"力荐":5,"推荐":4,"还行":3,"较差":2,"很差":1}returnswitcher.get(value, 0)defadd_comment(commentator, star, time, content):

conn= sqlite3.connect("spider.db")

cursor=conn.cursor()

cursor.execute("insert into comment values (null, ?, ?, ?, ?)", (commentator, star, time, content))

cursor.close()

conn.commit()

conn.close()

抓取完之后可以在表中看到数据

sqlite> select count(1) fromcomment;8302sqlite> select star,count(1) fromcomment group by star;

0|1359

1|58

2|133

3|643

4|1875

5|4234sqlite> select * from comment order by id desc limit 5;8302|燊栎|4|2014-11-19|经典中的经典8301|Jerryhere|4|2016-03-08|平凡中的不平凡8300|麦田睡觉者|5|2012-08-12|这部小说是我上大学看的第一本小说,它带给我的震撼是无与伦比的。彻底将我从高中时看的那些yy小说里震醒。同时,它真的是一部非常好看的小说,平凡的世界里不平凡的人生,生命总是充满苦痛伤悲,这些苦难让生命愈发的沉重厚实8299|朔望|0|2013-07-29|人生就是如此平凡8298|mindinthesky|0|2012-09-17|不错,中国就是这样子

2.简单分析

数据抓取完了之后我们进行简单分析,看下各个星的占比

然后在把所有的comment导到文件中,进行词云分析

导出如下:

>sqlite3 -header -csv spider.db "select content from comment;" > commentall.csv

完了就会在当前目录下生成一个commentall.csv的文件

然后可以对其词云分析,代码如下:

defmake_cloud():

text= open('commentall.txt', 'r', encoding='utf-8').read()

cut_text=jieba.cut(text)

result= " ".join(cut_text)

wc=WordCloud(

font_path='Deng.ttf', #字体路劲

background_color='white', #背景颜色

width=2000,

height=1200,

max_font_size=100, #字体大小

min_font_size=10,

mask=plt.imread('timg.jpeg'), #背景图片

max_words=1000)

wc.generate(result)

wc.to_file('jielun.png') #图片保存

plt.figure('jielun') #图片显示的名字

plt.imshow(wc)

plt.axis('off') #关闭坐标

plt.show()

分析完之后的图片输出是下图:

3.情感分析

我们从打的五星和一星能清楚的看到情感,但是对零星的就不太好判断,现在主要是用snowNLP对零星的做情感分析。要想分析,就先得训练,因为目前的是针对电商的评论,不适合现在的场景,怎么训练呢?

首先,我们把5颗星的评论导出存为pos.txt,作为积极的评论,把1颗星的评论导出存为neg.txt作为消极的评论;

然后,利用pos.txt和neg.txt进行训练

最后,在利用训练完的模型对0颗星的进行分析

好了,开始吧

首先导出

>sqlite3 -header -csv spider.db "select conent from comment where star = 5 limit 100;" >pos.csv>sqlite3 -header -csv spider.db "select conent from comment where star = 1 limit 100;" > neg.csv

然后找到snownlp的安装路径,如下方法:

kumufengchunMacBook-Pro:douban kumufengchun$ python

Python3.6.4 (default, Jun 6 2019, 17:59:50)

[GCC4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.10.44.4)] on darwin

Type"help", "copyright", "credits" or "license" formore information.>>> importsnownlp>>>snownlp

找到了之后,把刚才的pos.scv和neg.csv拷贝到

/Users/kumufengchun/.pyenv/versions/3.6.4/lib/python3.6/site-packages/snownlp/sentiment/下边的pos.txt和neg.txt,为了保持名字一致,我们可以把之前的csv后缀的改为txt

然后开始用刚才导出的数据进行训练,训练完之后的输出保存为commentsentiment.marshal

from snownlp importsentiment

sentiment.train('neg.txt', 'pos.txt')

sentiment.save('commentsentiment.marshal')

然后把训练完输出的文件,在init文件中修改,文件训练完之后输出的是commentsentiment.marshal.3,后缀3是版本的意思,不用管他,在引用的时候不要加3,否则会报错,修改代码如下:

/Users/kumufengchun/.pyenv/versions/3.6.4/lib/python3.6/site-packages/snownlp/sentiment/__init__.py

data_path = os.path.join(os.path.dirname(os.path.abspath(__file__)),'commentsentiment.marshal')

好了,训练完之后,我们可以进行简单的测试

from snownlp importSnowNLP

str= "好很好"s=SnowNLP(str)print(s.words)print(s.tags)print(s.sentiments)

输出如下:

['好', '很', '好']

0.6088772592136402

4.用训练的模型进行情感分析

代码如下:

defget_comment_bypage(offset, limit):

conn= sqlite3.connect("spider.db")

cursor=conn.cursor()

cursor.execute('select content from comment where star=0 limit ?,?', (offset, limit))

values=cursor.fetchall()

cursor.close()

conn.close()returnvaluesdefanalyse_sentiment():

offset=0

limit= 50commentcounts={}while (offset < 1400):

comments=get_comment_bypage(offset, limit)for comment incomments:

s= SnowNLP(''.join(comment))print(s.sentiments)

sentiment= round(s.sentiments, 2)if sentiment incommentcounts:

commentcounts[sentiment]+= 1

else:

commentcounts[sentiment]= 1offset+=limitprint(commentcounts)return commentcounts

然后我们把所有的输出做个图如下:

可以看到每个输出所占的数量,如何判断是积极还是消极呢,一般采取0.3,大于0.3的为积极,否则为消极,也可以把之前的数据都跑一遍,定义个区间。

完整的代码如下:

importsysfrom os importpathimporttimeimporturllib3importrequestsimportnumpy as npimportsqlite3from bs4 importBeautifulSoupfrom urllib importparsefrom snownlp importSnowNLPimportmatplotlib.pyplot as pltimportjiebafrom wordcloud importWordCloudfrom PIL importImage

headers=[{'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'},\

{'User-Agent':'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.12 Safari/535.11'},\

{'User-Agent': 'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Trident/6.0)'}]defget_comment():

page_num=0;

total_num=0;while(1):

page_num+=1url= "https://book.douban.com/subject/1200840/comments/hot?p="+str(page_num)print(url)

http=urllib3.PoolManager()

time.sleep(np.random.rand()*5)try:

r= http.request("GET", url, headers=headers[page_num%len(headers)])

plain_text=r.data.decode()print(plain_text)exceptException as e:print(e)continuesoup= BeautifulSoup(plain_text, features="lxml")

ligroup= soup.find_all("li", class_="comment-item")for item inligroup:try:

commentator= item.find("div", class_="avatar").a.get("title")

spanlists= list(item.find("div", class_="comment").find("span", class_="comment-info"))while "\n" inspanlists:

spanlists.remove("\n")if (len(spanlists) == 3) :

stars= spanlists[1].get("title")

stars=switch_case(stars)

commenttime= spanlists[2].stringelse:

stars=0

commenttime= spanlists[1].string

content= item.find("span", class_="short").get_text()

add_comment(commentator, stars, commenttime, content)exceptException as e:print(e)continuepage_num+=1

if page_num > 999:break

defswitch_case(value):

switcher={"力荐":5,"推荐":4,"还行":3,"较差":2,"很差":1}returnswitcher.get(value, 0)defadd_comment(commentator, star, time, content):

conn= sqlite3.connect("spider.db")

cursor=conn.cursor()

cursor.execute("insert into comment values (null, ?, ?, ?, ?)", (commentator, star, time, content))

cursor.close()

conn.commit()

conn.close()defget_comment_bypage(offset, limit):

conn= sqlite3.connect("spider.db")

cursor=conn.cursor()

cursor.execute('select content from comment where star=0 limit ?,?', (offset, limit))

values=cursor.fetchall()

cursor.close()

conn.close()returnvaluesdefanalyse_sentiment():

offset=0

limit= 50commentcounts={}while (offset < 1400):

comments=get_comment_bypage(offset, limit)for comment incomments:

s= SnowNLP(''.join(comment))print(s.sentiments)

sentiment= round(s.sentiments, 2)if sentiment incommentcounts:

commentcounts[sentiment]+= 1

else:

commentcounts[sentiment]= 1offset+=limitprint(commentcounts)returncommentcountsdefmake_cloud():

text= open('commentall.txt', 'r', encoding='utf-8').read()

cut_text=jieba.cut(text)

result= " ".join(cut_text)

wc=WordCloud(

font_path='Deng.ttf', #字体路劲

background_color='white', #背景颜色

width=2000,

height=1200,

max_font_size=100, #字体大小

min_font_size=10,

mask=plt.imread('timg.jpeg'), #背景图片

max_words=1000)

wc.generate(result)

wc.to_file('jielun.png') #图片保存

plt.figure('jielun') #图片显示的名字

plt.imshow(wc)

plt.axis('off') #关闭坐标

plt.show()if __name__=='__main__':

get_comment()

analyse_sentiment()

make_cloud()

参考资料:

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值