微博舆情的分析,最首要的工作是获取微博原始数据,所以最近在爬取微博数据,昨天遇到一个问题,就是当抓取指定URL用户的微博信息时,如果采用查看源码的方式,只能获取15条微博,后面的微博是延时显示的,一页内容并不是一次实现,而是通过滚动条2次加载实现的,实际上是
通过新浪自己STK库中的lazy_load来完成动态加载的,通过抓包分析后两次动态加载需要如下参数:
创建获取加载中微博数据代码如下:
在main.py中import上边的py文件,然后在main函数中部分代码如下,
运行时,发现有错误,只能抓取URLlist.txt文本文件中URL列表中的最后一个url,很苦恼,弄一天没弄出来什么原因,第二天猛然发现是每行回车键的问题,Python逐行处理文本文件最快最简单的方法是利用for循环:
body={
<span style="font-family: Arial, Helvetica, sans-serif; font-size: 12px;">'__rnd':访问这一页面的时间,以秒表示的13位整数</span>
'_k':本次登录第一次访问此微薄的时间,16位整数
'_t':0
'count':第二次和第二次访问时都是15,第一次访问时是45
'end_id':最新的这一项微博的mid
'max_id':已经访问到的,也就是lazyload上面的这一项最旧的微博的mid
'page':要访问的页码
'pagebar':第二次是0,第三次是1,第一次没有这项
'pre_page':第二次和第三次都是本页页码,第一次访问是上页页码
'uid':博主的uid
}end_id和_k在接连浏览同一博主的不同页面时,都是不需要更新的</span>
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import urllib
import urllib2
import sys
import time
reload(sys)
sys.setdefaultencoding('utf-8')
class getWeiboPage:
body = {
'__rnd':'',
'_k':'',
'_t':'0',
'count':'45',
'end_id':'',
'max_id':'',
'page':1,
'pagebar':'',
'pre_page':'0',
'uid':''
}
uid_list = []
charset = 'utf8'
def get_msg(self,uid):
getWeiboPage.body['uid'] = uid
url = self.get_url(uid)
self.get_firstpage(url)
self.get_secondpage(url)
self.get_thirdpage(url)
def get_firstpage(self,url):
getWeiboPage.body['pre_page'] = getWeiboPage.body['page']-1
url = url +urllib.urlencode(getWeiboPage.body)
req = urllib2.Request(url)
result = urllib2.urlopen(req)
text = result.read()
self.writefile('./output1/text1.txt',text)
def get_secondpage(self,url):
getWeiboPage.body['count'] = '15'
# getWeiboPage.body['end_id'] = '3490160379905732'
# getWeiboPage.body['max_id'] = '3487344294660278'
getWeiboPage.body['pagebar'] = '0'
getWeiboPage.body['pre_page'] = getWeiboPage.body['page']
url = url +urllib.urlencode(getWeiboPage.body)
req = urllib2.Request(url)
result = urllib2.urlopen(req)
text = result.read()
self.writefile('./output1/text1.txt',text)
def get_thirdpage(self,url):
getWeiboPage.body['count'] = '15'
getWeiboPage.body['pagebar'] = '1'
getWeiboPage.body['pre_page'] = getWeiboPage.body['page']
url = url +urllib.urlencode(getWeiboPage.body)
req = urllib2.Request(url)
result = urllib2.urlopen(req)
text = result.read()
self.writefile('./output1/text1.txt',text)
def get_url(self,uid):
url = 'http://weibo.com/u/'+uid+'?from=feed&loc=nickname'
return url
def get_uid(self,filename):
fread = file(filename)
for line in fread:
getWeiboPage.uid_list.append(line)
print line
time.sleep(1)
def writefile(self,filename,content):
fw = file(filename,'a')
fw.write(content)
fw.close()
if __name__ == '__main__':
reload(sys)
sys.setdefaultencoding( "utf-8" )
WBmsg = getWB_Page.getWeiboPage()
#parameter
listpath = 'D:\\Python27\\wbloaddatas\\URLlist.txt'
LastPage = 2
dir = 'D:\\Python27\\wbloaddatas\\output1\\text1.txt'
file_test = open('D:\\Python27\\wbloaddatas\\weibodata\\'+str(time.strftime('%Y_%m_%d_%H_%M',time.localtime(time.time())))+'.txt', 'a')
for user_url in fileinput.input(listpath):
try:
#added by li 15/03/11
print 'do with '+user_url
<span style="color:#ff0000;">#user_url = user_url.rstrip('\n')</span>
for page in xrange(1,LastPage+1):
print 'page:'+str(page)+'\n'
WBmsg.body['page'] = page
WBmsg.body['pre_page'] = str(page - 1)
WBmsg.get_firstpage(user_url)
WBmsg.get_secondpage(user_url)
WBmsg.get_thirdpage(user_url)
f = open('D:\\Python27\\wbloaddatas\\output1\\text1.txt','r')
doc = f.read()
doc = Page_init(doc)
#fr2=open('E:\\992.txt','w')
#fr2.write(doc)
perInfo = do_PersonInfo(doc)
midfor = re.compile(r'tbinfo="ouid=(.*?)<div node-type="feed_list_repeat" class="WB_feed_repeat S_bg1" style="display:none;"></div>\\n </div>')
rmidfor = midfor.findall(doc)
for x in rmidfor:
do_WBList(x, file_test, perInfo)
f.close()
os.remove(dir)
time.sleep(1)
except Exception, e:
file_log = open('runLog.txt', 'a')
file_log.write(str(e))
file_log.write('\n'+user_url+'\n')
file_log.write(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))+'\n')
file_log.close()
pass
time.sleep(5)
file_test.close()
print 'catch succeeded!\n'
运行时,发现有错误,只能抓取URLlist.txt文本文件中URL列表中的最后一个url,很苦恼,弄一天没弄出来什么原因,第二天猛然发现是每行回车键的问题,Python逐行处理文本文件最快最简单的方法是利用for循环:
for user_url in fileinput.input(listpath):
process
这种方法会在每行末尾留下"\n"符号,导致读取url发生错误,解决办法:在for循环的主体部分加一句:
user_url = user_url.rstrip('\n')
要是想去除每行末尾的空白符(不只是'\n'),常见办法:
user_url = user_url.rstrip()
上述代码把红色部分代码取消注释即可。