python爬虫学习笔记...
本文采用m.weibo.cn站点完成抓取,通过分析api提取数据,数据存储在MongoDB中。
爬虫文件
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
from ..items import *
import json
from pyquery import PyQuery as pq
class WeiboSpiderSpider(scrapy.Spider):
name = 'weibo_spider'
allowed_domains = ['m.weibo.cn']
# start_urls = ['http://m.weibo.cn/']
# 用户
user_url = 'https://m.weibo.cn/api/container/getIndex?uid={uid}&type=uid&value={uid}&containerid=100505{uid}'
# 微博
weibo_url = 'https://m.weibo.cn/api/container/getIndex?uid={uid}&type=uid&page={page}&containerid=107603{uid}'
# 关注
follow_url = 'https://m.weibo.cn/api/container/getIndex?containerid=231051_-_followers_-_{uid}&page={page}'
# 粉丝 注意 粉丝页码参数是since_id=,而不是关注页码中page=
fan_url = 'https://m.weibo.cn/api/container/getIndex?containerid=231051_-_fans_-_{uid}&since_id={page}'
start_uids = [
'2803301701', # 人民日报
'1699432410', # 新华社
'1974576991', # 环球时报
'5476386628', # 侠客岛
]
def start_requests(self):
for uid in self.start_uids:
yield Request(self.user_url.format(uid=uid), callback=self.parse_user)
我们首先修改Spider,配置各个Ajax 的URL ,选取几个大V ,将他们的ID赋值成一个列表,重写start_requests ( )方法,也就是依次抓取各个大V的个人详情页,然后用parse_user( )进行解析
items.py
from scrapy import Item, Field
class UserItem(Item):
collection = 'users'
id = Field() # 用户id
name = Field() # 昵称
profile_image = Field() # 头像图片
cover_image = Field() # 背景图片
verified_reason = Field() # 认证
description = Field() # 简介
fans_count = Field() # 粉丝数
follows_count = Field() # 关注数
weibos_count = Field() # 微博数
mbrank = Field() # 会员等级
verified = Field() # 是否认证
verified_type = Field() # 认证类型
verified_type_ext = Field() # 以下不知道
gender = Field()
mbtype = Field()
urank = Field()
crawled_at = Field() # 抓取时间戳 在pipelines.py中
class UserRelationItem(Item):
collection = 'UserRelation'
id = Field()
follows = Field()
fans = Field()
class WeiboItem(Item):
collection = 'weibos'
id = Field()
idstr = Field()
edit_count = Field()
created_at = Field()
version = Field()
thumbnail_pic = Field()
bmiddle_pic = Field()
original_pic = Field()
source = Field()
user = Field()
text = Field()
crawled_at = Field()
这里定义了collection 字段,指明保存的Collection的名称。用户的关注和粉丝列表直接定义为一个单独的UserRelationitem ,其中id 就是用户的ID, follows 就是用户关注列表, fans 是粉丝列表。
提取数据
接着提取数据,解析用户信息,实现parse_user( )方法
# 解析用户信息
def parse_user(self, response):
self.logger.debug(response)
result = json.loads(response.text)
if result.get('data').get('userInfo'):
user_info = result.get('data').get('userInfo')
user_item = UserItem()
user_item['id'] = user_info.get('id') # 用户id
user_item['name'] = user_info.get('screen_name') # 昵称
user_item['profile_image'] = user_info.get('profile_image_url') # 头像图片
user_item['cover_image'] = user_info.get('cover_image_phone') # 背景图片
user_item['verified_reason'] = user_info.get('verified_reason') # 微博认证
user_item['description'] = user_info.get('description') # 简介
user_item['weibos_count'] = user_info.get('statuses_count') #