前言
鉴于最近在做观点挖掘的相关工作,观点的数据源是网络评论数据,于是第一个想到的就是新闻观点数据,一个热门的新闻可能一晚上就会有上万条评论,所以如何分析并利用好这些评论信息,将会是一件非常有意思的事情,观点挖掘是我研究的目的,当然要想很好解决这个问题,所以我自然要解决数据源的问题,于是乎,我就想到了去爬取腾讯新闻的评论数据。下面我会介绍一下这个过程,这个过程还是非常有意思的哦。
为什么爬的是腾讯新闻的数据
我从网上查阅了许多爬取新闻数据的相关技术帖,发现除了腾讯的之外,还有新浪,网易的比较多,但是他们的请求链接都不是那么好破解,腾讯新闻的稍稍简单一点,而且初步分析了一下,可以利用技术的手段去构造请求,从而获取评论数据。先来看一个例子链接,这个也是我从网上找的。
http://coral.qq.com/article/1004703995/comment?commentid=0&reqnum=20&tag=&callback=mainComment&_=1389623278900
链接附带的参数还是有点多的,下面给出参数的各个意思:
http://coral.qq.com/article/评论页ID(即cmt_id)/comment?commentid=起始ID&reqnum=显示数目&tag=&callback=mainComment&_=时间戳+3位随机整数
最后一位随机值其实没什么用处了。然后点击链接,我们截取其中的一条评论数据,获取到的数据是这样的:
mainComment({"errCode":0,"data":{"targetid":1004703995,"display":1,"total":14000,"reqnum":20,"retnum":20,"maxid":"5990116449200978034","first":"5990116449200978034","last":"5840477226068943893","hasnext":true,"commentid":[{"id":"5990116449200978034","rootid":"0","targetid":1004703995,"parent":"0","timeDifference":"04\u670804\u65e5 21:44:12","time":1428155052,"content":"\u65e9\u8be5\u7528\u56fd\u4ea7\u7684\u8f66\u4e86\uff0c\u7279\u522b\u662f\u7ea2\u65d7\u8001\u724c\u5b50\uff0c\u6240\u6709\u7684\u516c\u8f66\u5e94\u8be5\u90fd\u7528\u56fd\u4ea7\u7684\uff0c\u4f60\u770b\u97e9\u56fd\u4eba\u6240\u6709\u7528\u7684\u90fd\u4ee5\u56fd\u4ea7\u4e3a\u4e3b","title":"","up":"0","rep":"0","type":"1","hotscale":"0","checktype":"1","checkstatus":"1","isdeleted":"0","tagself":"","taghost":"","source":"2","location":"","address":"","rank":"-1","custom":"","extend":{"at":0,"ut":0},"orireplynum":"0","richtype":0,"userid":"171498810","poke":0,"