这次抓取新浪微博好友数据的方法只是个人实验之作,不具有通用性,只是为后面学习模拟登陆抓取微博数据打一个基础。
import requests
import re
import pandas as pd
url1 = "http://weibo.com/******************page=" #这就是为什么不具有通用性的原因,个人玩玩还可以
url2 = "#PL**********************"
这里采用的方法是url分为两块,中间用page连接,page=1代表第一页.
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36 QIHU 360SE',
'Accept':'text/html;q=0.9,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, sdch',
'Accept-Language':'zh-CN,zh;q=0.8',
'Connection':'keep-alive',
'Pragma':'no-cache',
'Referer':'http://weibo.com/*************',
}
cookie = {
'ALF':'1525759072',
'Apache':'2971422490663.8267.1494222711802',
'SCF':'AsBhIJmo__OkAuhgojd0xI5JknPN0XDu-4SKjNJb-YXfggEFf_bc9opqHN4VUwI5nbYlEDOJQadZgqIAgSqEfM4.',
'SINAGLOBAL':'4931631747167.558.1492939824220',
'SSOLoginState':'1494223072',
'SUB':'_2A250FHiwDeRhGeRI6FUU9yzMyz-IHXVXYO14rDV8PUNbmtBeLRL9kW-FkGLc6LOHOHypVuJFRcsdRwWsBw..',
'SUHB':'0HTgks1rybumwV',
'ULV':'1494222711819:16:8:2:2971422490663.8267.1494222711802:1494130360115',
'login_sid_t':'bb143148183a80ae8df1969a1d362d5a',
'UM_distinctid':'15b9a25dacea8-0ea80d0c8-b0c2725-15f900-15b9a25dacf12',
'un':'18243181125',
'wvr':'6'
}
#headers和cookie的设置可以打开网页,右键点击审查元素,在network中查找。
file_sina = open("sina.txt",'w')
for i in range(1,5):
i = str(i)
url = url1 + i + url2
r = requests.get(url = url,cookies = cookie,headers = headers)
html = r.content
file_sina.write(html)
file_sina.close()
#把抓取下来的html保存在一个txt文件中,用的时候打开就行了。
nickname=re.findall(r'&screen_name=(.*?)&',html) #提取昵称
sex_weibo = re.findall(r'&screen_name=.*?&sex=(.*?)\\',html) #提取性别
friend_name_sex = re.findall(r'&screen_name=(.*?)&sex=(.*?)\\',html) #提取昵称和性别
file_friend = open("sina_friend.txt",'w') #将数据保存在txt文件中
for item in friend_name_sex:
file_friend.write(item[0].decode('utf-8').encode('gbk') + item[1] +"\n") #我这里直接输出中文会显示乱码,因此需要转码一下
file_friend.close()
另外还可以将数据保存在csv表格中
nickName = []
for item in nickname:
nickName.append(item.decode('utf-8').encode('gbk'))
weibo = pd.DataFrame({'nickname':nickName,
'sex':sex_weibo
})
weibo.to_csv("weibo.csv")