python关于用BeautifulSoup爬取网易动态评论

最新推荐文章于 2024-07-11 20:20:14 发布

guang_mang

最新推荐文章于 2024-07-11 20:20:14 发布

阅读量2.6k

点赞数 1

分类专栏： python中Beautifulsoup python中的正则表达式文章标签： BeautifulSoup python html 正则表达式 json

本文链接：https://blog.csdn.net/guang_mang/article/details/53746768

版权

python中Beautifulsoup 同时被 2 个专栏收录

3 篇文章 0 订阅

订阅专栏

python中的正则表达式

3 篇文章 0 订阅

订阅专栏

1关于python爬取网易json格式的动态评论，因为这个使用json格式来编写的，所以就是要先是要找到这个json格式的文件的链接，先是在用F12出现这个页面

2网易跟帖上这个网站就是这个hotlist，最热跟帖，和newslist最新跟贴，现在我们是要爬这个最热跟帖

3然后就是要打开这个链接，然后机会出现下面这个页面。

4然后就可以利用这个ison的特性来取出来自己想要的信息了。

# coding:utf-8
import urllib
import re
import json #必须先要引入json
def getpage():
    for z in range(1,3):#我是爬的最新跟帖，有好几页，要先找到页数的规律来，如果点击下一页，会再出现一个文件newslist
        i = 0
        url='http://comment.news.163.com/api/v1/products/a2869674571f77b5a0867c3d71db5856/threads/C6BUSTPO000187VI/
comments/newList?offset='+str(z)+'&limit=30&showLevelThreshold=72&headLimit=1&tailLimit=2&callback=getData&ibc=news
pc&_=1479812321476'
        z+=30
        page=urllib.urlopen(url)
        html=page.read()
        return html
def getItems(html):
    reg = re.compile("getData\(")，#先是要去掉这个头和尾，才会有一个字典的格式，会有key和value
    data = reg.sub(' ', html)
    reg3 = re.compile('\);')
    data = reg3.sub('', data)
    data = json.loads(data)
    for i in data['commentIds']:#然后我是用这个for循环来提取出这个data里面的key，然后去掉里面十位数的数字
        pp=re.compile('\d{10}')
        zz=re.findall(pp,i)#然后就是用这个数字来当做key来找出value
        for n in zz:#再用for循环提取出来，赋值给n
            try:
                    w.write(data['comments'][n]['user']['nickname'].encode('utf-8')+'|')#这个就是转一下码
                    w.write(data['comments'][n]['content'].encode('utf-8')+'|')
                    w.write(data['comments'][n]['user']['location'].encode('utf-8')+'|')
                    w.write(data['comments'][n]['createTime'].encode('utf-8')+'|'+'\n')
            except:
                w.write("null")
w=open('wypinglun.text','w')
html=getpage()
getItems(html)
w.close()