官方文档:feapder官方文档|feapder-document
安装命令:pip install feapder
完整版本安装命令:
pip3 install feapder[all]
两者区别不大,完整版支持内存去重。
创建爬虫项目命令:
feapder create -p spider-douyin
这里创建了一个名为 spider-douyin 的爬虫项目
然后使用cd命令进入到spiders目录下,使用命令创建爬虫:
spiders> feapder create -s douyin_pinglun #创建名为douyin_pinglun的爬虫
spider为分布式爬虫,airspider为一般的轻量级爬虫,选择第一个轻量级的爬虫就可以了。
创建的文件如下
生成的文件简单,方法 start_requests 初始化了url,最开始的链接解析都是从这里解析的,不用
callback方法指定方法函数的话,会默认交由parse方法来解析。
如果需要传参,如cookie,headers怎么办呢?
这里需要自定义中间件,download_midware:
这里可以设置代理,headers,cookies等,以字典的键值对形式传递,feapder封装的所有的请求都会走这里,传递加密密文可以从这里传递。
def download_midware(self, request): request.headers = {'User-Agent': ""} request.proxies = {"https":"https://12.12.12.12:6666"} request.cookies = {} return request
在mysql数据库中建表,设置好自己想要抓取的字段,然后来到爬虫框架设置里面:
将mysql链接配置下来。除了mysql链接,这里还可以设置随机ua,爬虫并发数量,日志信息等。
配置好后,cd到items里面。
输入命令创建item:
feapder create -i douyin_pinlun #douyin_pinlun 为mysql数据库存储的表名字
就会创建一个item文件。需要注意的是如果设置自增id的话,需要把self.id给注销了。
爬虫引用item的话,需要导入。
from items import douyin_pinlun_item #这里引用会提示报错,但是不影响使用
使用起来要构造实例:item = douyin_pinlun_item.DouyinPinlunItem()
item = douyin_pinlun_item.DouyinPinlunItem() item['text_pinglun'] = i.get('text') item['digg_count'] = i.get('digg_count', 'null') item['nickname'] = i.get("user").get("nickname") yield item
爬虫文件可以创建单个或多个。单个爬虫的话文件直接启动就好了,多个需要到main里面配置。
启动时的线程数量为:thread_count 这里填多少个就是多少个。
main封装的话简单的可以直接引入文件导出,这样就可以一键启动多个爬虫。需要注意的是导入douyin_me.js的文件会因为目录的不同报错,同级目录可以copy下来多放一个。
爬虫异常重试与数据处理:
这里需要另外一个自定义中间件validate 来进行处理。
这里返回的状态码不为200的话就主动抛出个异常,然后就会重新请求这次url,这个时候可以更换代理或者看看有没有别的验证,如滑块什么的;
如果没有数据,就返回的为False的话就抛弃当前请求,不往下面的流程走了,简单基本的逻辑判断可以写在这里
@summary: 校验函数, 可用于校验response是否正确 若函数内抛出异常,则重试请求 若返回True 或 None,则进入解析函数 若返回False,则抛弃当前请求 可通过request.callback_name 区分不同的回调函数,编写不同的校验逻辑
请求返回的 response 直接xpath,json,css选择器等。
url = response .xpath("./@href").extract() #返回的匹配出来的文本,以列表保存
如果不加extract() ,返回的也是一个可以迭代的对象,后面循环的时候需要用extract_first()来获取内容数据。
简单来说extract()获取xpath里面所有的迭代对象,extract_first()则返回一个。
自定义请求:通过日志,发现请求走的的都是get,如果我想用post请求过别的自定义请求,不想用他默认的自定义中间件download_midware的话改怎么做呢?
导包Request:
from feapder import Requestz
Request这是封装的reques方法,request方法支持的参数他都支持,指定post请求如下:
在 请求里设置method为POST method='POST'
import feapder from feapder import Request class TestAirSpiders(feapder.AirSpider): __custom_setting__ = dict( USE_SESSION=True, TASK_MAX_CACHED_SIZE=10, ) def start_requests(self): data = {'www': 'eee'} headers = {} yield Request('https://www.baidu.com/', method='POST', data=data, headers=headers, callback=self.start_callbacks) def start_callbacks(self,request, response): print(response.text) print("爬虫开始") if __name__ == "__main__": TestAirSpiders(thread_count=1).start()
访问百度可以看到日志携带了参数。需要注意的是自定义请求,callback指定解析的函数
def start_callbacks(self,request, response)
传参需要的参数除了self应该还有两个,不然这里会报错。
代码示例:
# -*- coding: utf-8 -*-
"""
Created on 2023-04-01 10:56:27
---------
@summary:
---------
@author: 13008
"""
from py_mini_racer import MiniRacer
import feapder
from items import douyin_pinlun_item
class DouyinPinglun(feapder.AirSpider):
def vmrun(self, url_):
ctx = MiniRacer()
with open('./douyin_me.js', mode='r', encoding='utf-8') as f:
code = f.read()
ctx.eval(code)
sig_url = ctx.call("get_cookie", url_)
return sig_url
def start_requests(self):
self.aweme_id = '7177232220380286263' # 视频id
for i in range(0, 2000, 20):
url = f"/aweme/v1/web/comment/list/?device_platform=webapp&aid=6383&channel=channel_pc_web&aweme_id={self.aweme_id}&cursor={str(i)}&count=20&item_type=0&insert_ids=&rcFT=&pc_client_type=1&version_code=170400&version_name=17.4.0&cookie_enabled=true&screen_width=1536&screen_height=864&browser_language=zh-CN&browser_platform=Win32&browser_name=Chrome&browser_version=97.0.4692.71&browser_online=true&engine_name=Blink&engine_version=97.0.4692.71&os_name=Windows&os_version=10&cpu_core_num=16&device_memory=8&platform=PC&downlink=10&effective_type=4g&round_trip_time=100&webid=7146895385552848424"
url_2 = self.vmrun(url)
yield feapder.Request(f"https://www.douyin.com" + url_2)
def download_midware(self, request):
request.headers = {
'bd-ticket-guard-client-csr': 'LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0NCk1JSUJEVENCdFFJQkFEQW5NUXN3Q1FZRFZRUUdFd0pEVGpFWU1CWUdBMVVFQXd3UFltUmZkR2xqYTJWMFgyZDENCllYSmtNRmt3RXdZSEtvWkl6ajBDQVFZSUtvWkl6ajBEQVFjRFFnQUVCRnJRaGw0TkxiYkxoWEJWZTFEUm9CajUNCkRzWEl5eEdiaUZCY0U3QlJDZllFR3B4OVd3RmJXVW55YjgvSmZZLzkxdkJsODkwNHI3WXBZSjVSWjlJNEk2QXMNCk1Db0dDU3FHU0liM0RRRUpEakVkTUJzd0dRWURWUjBSQkJJd0VJSU9kM2QzTG1SdmRYbHBiaTVqYjIwd0NnWUkNCktvWkl6ajBFQXdJRFJ3QXdSQUlnUWVJSWJTWVpHTVVuaWJ1REJybllCM2wyTTR2eFQ5Q2hoQTYyME1BaVNJY0MNCklFV3BaSHQ0aUJUamlJME9WL0F2MU9aNzFjTzdMSktzellwMmZrL1BJaXM1DQotLS0tLUVORCBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0NCg==',
'sec-ch-ua-mobile': '?0',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36',
'referer': f'https://www.douyin.com/discover?modal_id={self.aweme_id}',
'cookie': 'ttwid=1%7CXc3vhkHxi5GB-1tUeehhRU5p5pj3py5fOWZN5yxlJYs%7C1664016270%7C78148e57c65267c6ba64426ff64f012cf4c034afcb4a4b67f6feb91594ae73ce; douyin.com; passport_csrf_token=be7a44d9d1e987a8fc3a81632a4494d5; passport_csrf_token_default=be7a44d9d1e987a8fc3a81632a4494d5; s_v_web_id=verify_lfxcvumk_18USuvw8_iN9F_4yCB_8l5r_YUXsKsbmDgOb; csrf_session_id=a127a12b0dce1d8495284dc6af43071f; VIDEO_FILTER_MEMO_SELECT=%7B%22expireTime%22%3A1680924196256%2C%22type%22%3A1%7D; download_guide=%223%2F20230401%22; strategyABtestKey=%221680430517.569%22; __ac_nonce=064295cac00a877a95111; __ac_signature=_02B4Z6wo00f01ROBItwAAIDAcIvitEgq83UToSZAACDLM6a6ogezNWp1z0quFGNsh5aE-ETyixb3tLrKDg2jkApA2hnKLZKG82ma9Y9K2msGuSV14Eo17pYf72jZ0MRzY3zOzriUjM8hYOm6ab; home_can_add_dy_2_desktop=%221%22; msToken=TpdjVDe0UXBttERpn4pnc8ZmFhZAXrQuLSfFZ57iA94rIz3TmNTD437X-NzoafOgM4Pg6POPi55Hg8GoQKlPUO9RX1sfGbq1ohcoIQ0CttViqvKESO2rYlc=; tt_scid=S2eYwo65hsY77ld8I1ZFMIJZPhZSDJ.zDk8GP7.iS0rVmszVYpOznO6QXGEnUauxd711; d_ticket=7d056a6264e34ad92bfa1930807d65d25b97c; passport_assist_user=CkEoM6Oza0lRv9LQawD3TVktl002CQx_PvcEuM7VZuZ2-ANWXV59GXnEaX6Nu5s87ShkHw5sk8eDl7BmxtcDrPAOOhpICjwV4Jy2o-WYdhyKT7C2mvpSzx7_460hh-kfswRKYLmt8BaZmttjgWE6gMbThx8-U2wE_zEflBybuPO1FF0QybWtDRiJr9ZUIgEDTd3k5Q%3D%3D; n_mh=AVPM_2zR50Xj37sUHvJJ3kubTvbsSRblrQivnnHL0VU; sso_auth_status=b8f6f1a6bf4be96e27920baea5c8f659; sso_auth_status_ss=b8f6f1a6bf4be96e27920baea5c8f659; sso_uid_tt=43ef6c83a74358f6911aabd38408aa58; sso_uid_tt_ss=43ef6c83a74358f6911aabd38408aa58; toutiao_sso_user=1e20a827e4d6181b109ee4e16fe76ff9; toutiao_sso_user_ss=1e20a827e4d6181b109ee4e16fe76ff9; sid_ucp_sso_v1=1.0.0-KDlmMGFiYWI5ZmEwN2FjNDIwNTQ5NmJhNjA3YmNjZjA0NWMwNTNhMDAKHwjMsbDvroy_BBD8uqWhBhjvMSAMMI6_u5kGOAJA7AcaAmhsIiAxZTIwYTgyN2U0ZDYxODFiMTA5ZWU0ZTE2ZmU3NmZmOQ; ssid_ucp_sso_v1=1.0.0-KDlmMGFiYWI5ZmEwN2FjNDIwNTQ5NmJhNjA3YmNjZjA0NWMwNTNhMDAKHwjMsbDvroy_BBD8uqWhBhjvMSAMMI6_u5kGOAJA7AcaAmhsIiAxZTIwYTgyN2U0ZDYxODFiMTA5ZWU0ZTE2ZmU3NmZmOQ; msToken=8pGzTdrbxXaw3S8O-6eqGTQMug5sSjiGypvAG6XzcpMY1IHpTSEa3JWSoqBwFUGA3lP-KGyTAOeREl3Gnue3EbM_iC5WL4emUFG9MCQ5c3FqsE_2mHuxPjo=; odin_tt=82ccd7301c6e6437db6f1b6900696d0095901c2f46f30eef47ece0cdd5d4b1bd4db85116364cfdc6aa0b2f3d174ba18e92d528756f1a4d9b4377d3b648839c79; passport_auth_status=c8b7c174e1e776c5aed9736fd422637b%2C93cce2a85124d839b6ed0d613660d30b; passport_auth_status_ss=c8b7c174e1e776c5aed9736fd422637b%2C93cce2a85124d839b6ed0d613660d30b; uid_tt=bd3b018ebfb6b519a5ed79b11f7e2825; uid_tt_ss=bd3b018ebfb6b519a5ed79b11f7e2825; sid_tt=249042a9ff5ef895b72b1139075272b3; sessionid=249042a9ff5ef895b72b1139075272b3; sessionid_ss=249042a9ff5ef895b72b1139075272b3; bd_ticket_guard_client_data=eyJiZC10aWNrZXQtZ3VhcmQtdmVyc2lvbiI6MiwiYmQtdGlja2V0LWd1YXJkLWl0ZXJhdGlvbi12ZXJzaW9uIjoxLCJiZC10aWNrZXQtZ3VhcmQtY2xpZW50LWNlcnQiOiItLS0tLUJFR0lOIENFUlRJRklDQVRFLS0tLS1cbk1JSUNGVENDQWJxZ0F3SUJBZ0lVV0ppSEE0aDhwRW1aOFdmc0YxSVlkcE1TOUdRd0NnWUlLb1pJemowRUF3SXdcbk1URUxNQWtHQTFVRUJoTUNRMDR4SWpBZ0JnTlZCQU1NR1hScFkydGxkRjluZFdGeVpGOWpZVjlsWTJSellWOHlcbk5UWXdIaGNOTWpNd05EQXlNVEEwT0RJNFdoY05Nek13TkRBeU1UZzBPREk0V2pBbk1Rc3dDUVlEVlFRR0V3SkRcblRqRVlNQllHQTFVRUF3d1BZbVJmZEdsamEyVjBYMmQxWVhKa01Ga3dFd1lIS29aSXpqMENBUVlJS29aSXpqMERcbkFRY0RRZ0FFQkZyUWhsNE5MYmJMaFhCVmUxRFJvQmo1RHNYSXl4R2JpRkJjRTdCUkNmWUVHcHg5V3dGYldVbnlcbmI4L0pmWS85MXZCbDg5MDRyN1lwWUo1Ulo5STRJNk9CdVRDQnRqQU9CZ05WSFE4QkFmOEVCQU1DQmFBd01RWURcblZSMGxCQ293S0FZSUt3WUJCUVVIQXdFR0NDc0dBUVVGQndNQ0JnZ3JCZ0VGQlFjREF3WUlLd1lCQlFVSEF3UXdcbktRWURWUjBPQkNJRUlKSWZDc09EUHlZSmZFQXJhQU5DVnBxN2x2SWlCK0oxZENxR2l4dURDNEI1TUNzR0ExVWRcbkl3UWtNQ0tBSURLbForcU9aRWdTamN4T1RVQjdjeFNiUjIxVGVxVFJnTmQ1bEpkN0lrZURNQmtHQTFVZEVRUVNcbk1CQ0NEbmQzZHk1a2IzVjVhVzR1WTI5dE1Bb0dDQ3FHU000OUJBTUNBMGtBTUVZQ0lRQzJiRFNNM2tlMEFsV2NcblBBYXBjOXJwOEhLSVl0UVlPa3ZyYm0zM2wrMXhmQUloQUx2dGM1b3gvZVhROUFxKy9qRlNwcVdKczB4T0t3R1dcblRmYmNMVFQ5MnFLZ1xuLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLVxuIn0=; bd_ticket_guard_server_data=; publish_badge_show_info=%220%2C0%2C0%2C1680432510632%22; LOGIN_STATUS=1; store-region=cn-bj; store-region-src=uid; passport_fe_beating_status=true; sid_guard=249042a9ff5ef895b72b1139075272b3%7C1680432511%7C5183999%7CThu%2C+01-Jun-2023+10%3A48%3A30+GMT; sid_ucp_v1=1.0.0-KDAzOTE3NjhkODJlOWQ5ZWQ0ZTM2MjY0NzZiMGVmMzM0ZWZiMDJhOWYKGwjMsbDvroy_BBD_uqWhBhjvMSAMOAJA7AdIBBoCbGYiIDI0OTA0MmE5ZmY1ZWY4OTViNzJiMTEzOTA3NTI3MmIz; ssid_ucp_v1=1.0.0-KDAzOTE3NjhkODJlOWQ5ZWQ0ZTM2MjY0NzZiMGVmMzM0ZWZiMDJhOWYKGwjMsbDvroy_BBD_uqWhBhjvMSAMOAJA7AdIBBoCbGYiIDI0OTA0MmE5ZmY1ZWY4OTViNzJiMTEzOTA3NTI3MmIz',
}
def validate(self, request, response):
if response.status_code != 200:
raise Exception("response code not 200") #抛出异常则重试
if response.json.get('comments','null') == 'null':
return False #不符合要求的话,返回False 抛弃当前请求
def parse(self, request, response):
comments_list = response.json.get('comments')
print('eee', comments_list)
for i in comments_list:
item = douyin_pinlun_item.DouyinPinlunItem()
item['text_pinglun'] = i.get('text')
cid = i.get('cid')
item['digg_count'] = i.get('digg_count', 'null')
item['nickname'] = i.get("user").get("nickname")
url2 = f"/aweme/v1/web/comment/list/reply/?device_platform=webapp&aid=6383&channel=channel_pc_web&item_id={self.aweme_id}&comment_id={cid}&cursor=0&count=3&item_type=0&pc_client_type=1&version_code=170400&version_name=17.4.0&cookie_enabled=true&screen_width=1536&screen_height=864&browser_language=zh-CN&browser_platform=Win32&browser_name=Chrome&browser_version=97.0.4692.71&browser_online=true&engine_name=Blink&engine_version=97.0.4692.71&os_name=Windows&os_version=10&cpu_core_num=16&device_memory=8&platform=PC&downlink=10&effective_type=4g&round_trip_time=100&webid=7146895385552848424"
url = self.vmrun(url2)
yield feapder.Request(f"https://www.douyin.com" + url, callback=self.xpath)
yield item
def xpath(self, request, response): # 获取二级评论
print('wwwwwww', response.text)
comments = response.json.get('comments')
for i in comments:
item = douyin_pinlun_item.DouyinPinlunItem()
item['text_pinglun'] = i.get('text')
item['digg_count'] = i.get('digg_count', 'null')
item['nickname'] = i.get("user").get("nickname")
yield item
if __name__ == "__main__":
DouyinPinglun(thread_count=1).start()