代码地址:GitHub
参考:博客
通过scrapy框架爬取指定账号的信息和微博
截止到目前(2019年01月15日)的微博账号粉丝排名:
爬取方法:提取网页版的微博接口
1.重写start_request方法
def start_requests(self):
weibo_id = [1195354434, ]
for wid in weibo_id:
print('https://m.weibo.cn/api/container/getIndex?type=uid&value=' + str(wid))
yield Request('https://m.weibo.cn/api/container/getIndex?type=uid&value=' + str(wid), callback=self.parse_userInfo, dont_filter=True,
meta={'uid': str(wid)})
2.解析个人信息,并获取containerid
3.爬取博主的微博信息,和他关注的人
# 解析微博列表
def parse_weibo_list(self, response):
# 取相关信息,方便爬取下一页
next_page = str(int(response.meta['page']) + 1)
uid = response.meta['uid']
containerid = response.meta['containerid']
data = response.text
content = json.loads(data).get('data')
cards = content.get('cards')
if (len(cards) > 0):
print("-----正在爬取第%s页-----" % str(response.meta['page']))
for j in range(len(cards)):
card_type = cards[j].get('card_type')
# 微博
# if card_type == 9:
# mblog = cards[j].get('mblog')
# attitudes_count = mblog.get('attitudes_count') # 点赞数
# comments_count = mblog.get('comments_count') # 评论数
# created_at = self.date_format(mblog.get('created_at')) # 发布时间
# reposts_count = mblog.get('reposts_count') # 转发数
# scheme = cards[j].get('scheme') # 微博地址
# # 替换换行后 提取字符串
# text = etree.HTML(str(mblog.get('text')).replace('<br />', '\n')).xpath('string()') # 微博内容
# pictures = mblog.get('pics') # 正文配图,返回list
# pic_urls = [] # 存储图片url地址
# if pictures:
# for picture in pictures:
# pic_url = picture.get('large').get('url')
# pic_urls.append(pic_url)
# uid = response.meta['uid']
# # 保存数据
# sinaitem = SinaItem()
# sinaitem["uid"] = uid
# sinaitem["text"] = text
# sinaitem["scheme"] = scheme
# sinaitem["attitudes_count"] = attitudes_count
# sinaitem["comments_count"] = comments_count
# sinaitem["created_at"] = created_at
# sinaitem["reposts_count"] = reposts_count
# sinaitem["pictures"] = pic_urls
# yield sinaitem
# 关注信息
if card_type == 11:
# 获取他关注的人的地址
# https://m.weibo.cn/p/index?containerid=231051_-_followers_-_1195354434_-_1042015%3AtagCategory_050&luicode=10000011&lfid=1076031195354434 查看该网页的请求过程
fllow_url = str(cards[j]['card_group'][0]['scheme']).replace('https://m.weibo.cn/p/index?', 'https://m.weibo.cn/api/container/getIndex?')
print(fllow_url, '----')
yield Request(url=fllow_url, callback=self.parse_fllow)
# 下一页链接
# weibo_list_url = 'https://m.weibo.cn/api/container/getIndex?type=uid&value=' + uid + '&containerid=' + containerid + '&page=' + next_page
# response.meta['page'] = next_page
# yield Request(weibo_list_url, callback=self.parse_weibo_list, meta=response.meta)
4.根据他关注的人的ID,再次重复此过程
# 获取关注者的信息
def parse_fllow(self, response):
data = response.text
content = json.loads(data).get('data')
cards = content.get('cards')
# if len(cards) > 0:
for card in cards:
if card.get('title') == '他的全部关注':
for tmp in card.get('card_group'):
user = tmp.get('user')
# 获取关注的人的ID
uid = user.get('id')
yield Request('https://m.weibo.cn/api/container/getIndex?type=uid&value=' + str(uid), callback=self.parse_userInfo, dont_filter=True,
meta={'uid': str(uid)})
由于此过程是个循环,需要采取一定的控制条件才能爬取完成(如果不被封IP的话)
可先筛选出你感兴趣的用户,再爬取他的微博
防封的话建议采取代理IP的方式,在下载中间件中添加即可