python 豆瓣评论删除_Python抓取豆瓣《白夜追凶》的评论并且分词

最近网剧《白夜追凶》在很多朋友的推荐下,开启了追剧模式,自从琅琊榜过后没有看过国产剧了,此剧确实是良心剧呀!一直追下去,十一最后两天闲来无事就抓取豆瓣的评论看一下

相关代码提交到github上

个人github上相关python的项目:https://github.com/bytename/learnPy

#-*-coding:utf-8-*-

import requests

from lxml import etree

import jieba

header ={

"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",

"Accept-Encoding":"gzip, deflate, br",

"Accept-Language":"zh-CN,zh;q=0.8,en;q=0.6",

"Connection":"keep-alive",

"Host":"movie.douban.com",

"Referer":"https://movie.douban.com/subject/26883064/reviews?start=20",

"Upgrade-Insecure-Requests":"1",

"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36"

}

def getPageNum(url):

if url:

req = requests.get(url,headers=header)

html = etree.HTML(req.text)

pageNum = html.xpath(u"//div[@class='paginator']/a[last()]/text()")[0]

return pageNum

def getContent(url):

if url:

req = requests.get(url, headers=header)

html = etree.HTML(req.text)

data = html.xpath(u"//div[@class='short-content']/text()")

return data

def getUrl(pageNum):

dataUrl= []

for i in range(1,int(pageNum)):

if pageNum >= 1:

url ="https://movie.douban.com/subject/26883064/reviews?start=%d" %(((i - 1) *20),)

dataUrl.append(url)

return dataUrl

if __name__ == '__main__':

url = "https://movie.douban.com/subject/26883064/reviews?start=0"

pageNum =getPageNum(url)

data = getUrl(pageNum)

datas = []

dic = dict()

for u in data:

for d in getContent(u):

jdata = jieba.cut(d)

for i in jdata:

if len(i.strip()) > 1:

datas.append(i.strip())

for i in datas:

if datas.count(i) > 1:

dic[i] = datas.count(i)

for key,values in dic.items():

print "%s===%d" %(key,values)

抓取了评论并分词统计:

C:\Anaconda2\python.exe D:/PycharmProjects/LearnPy/lesson01/SpriderDouBan.py

Building prefix dict from the default dictionary ...

Loading model from cache c:\users\rc\appdata\local\temp\jieba.cache

Loading model cost 0.379 seconds.

Prefix dict has been built succesfully.

结合体===2

星期一===2

出来===21

第二===2

还要===3

应该===28

刘副队===3

案件===33

发生===7

成分===3

诚然===2

惊喜===7

两天===5

正常===10

全剧===4

看似===2

关系===5

坐等===2

仿佛===2

有理有据===2

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值