前言
上次想要用面向对象爬虫写成通用模板,其实还留了一部分工作未完成,今天把它补充完.
构思
按标准的爬虫三步曲来进行:
获取响应体:requests请求,get和post改个单词就完了,没什么好调整的,增加代理
解析数据:按我目前常用的三种静态解析bs4,css,xpath,加上json和正则,各列两行例句算是忘词时的提醒,最终都解析成字典流转到下一步去.
保存数据:调通保存到csv,xlsx,json,redis数据库的设置,附加一个带进度条的二进制下载函数
将单线程和线程池爬取的逻辑分离开.
流程:
1.甩个网址进singlethread,看看响应码,结果有没有返回响应体,没有的话,去network找header信息补充,直到拿到响应体.
2.构建解析的逻辑,最后生成item字典
3.依据需求取消注释,保存数据
4.条件许可就先构建拿到url列表的逻辑,用线程池爬取数据,提高爬取速度
代码
代码如下,部分逻辑借鉴scrapy,但毕竟scrapy是爬虫框架,我弄的比较简陋一些.
其实爬虫玩熟了的人都能写出这个代码来,高手嘛,见笑了请挪步,.这个算是我对爬虫基础部分的一次总复习吧.也希望对小白有一些思路上的借鉴.
import jsonpath
import redis
import requests
from bs4 import BeautifulSoup
import parsel
import re
import csv
import json
import openpyxl
import time
import random
import os
from retrying import retry
from fake_useragent import UserAgent
import datetime
import concurrent.futures
from pprint import pprint
def check_old():
try:
with open(filename,'r',encoding='utf8',newline='') as old_file:
olddatas = old_file.readlines()
return olddatas
except:
return []
def url_encode(key):
key_encode = re.findall('b\'(.*?)\'', str(key.encode('utf-8')), re.S)[0].replace('\\x', '%25').upper()
print('utf-8编码后:', key) # utf-8编码后: PYTHON%25E7%2588%25AC%25E8%2599%25AB
return key_encode
def get_proxy():
proxies = requests.get(url='http://127.0.0.1:5000/getbest').text
return proxies
class Web_spider:
@retry(stop_max_attempt_number=4)
def get_re(self, url):
headers = {
'User-Agent': UserAgent().random
# ,'referer': 'https://www.sporttery.cn/'
# ,'authority': 'webapi.sporttery.cn'
# ,'origin': 'https://www.sporttery.cn'
# ,'Host': 'www.gtgqw.com'
# # ,'cookie': 'urlfrom=121122523; urlfrom2=121122523; adfbid=0; adfbid2=0; x-zp-client-id=35634343-9cfb-44a3-8371-a71e0ddb96ef; sts_deviceid=179112989af38f-07821ed204e4dc-d7e163f-1327104-179112989b033d; sts_sg=1; sts_chnlsid=121122523; zp_src_url=https%3A%2F%2Fwww.baidu.com%2Fbaidu.php%3Fsc.K60000avpXkFvm720P_DWu1e_4t0TU3D9_0sdLXDMB1OxVgTYBZB_w4qQYtaAH54jJAZ2ftR8m43YIKlMsLKxJF3DqUFKR374quLrN_zcT8xvrGQrAvpChnPOT5uLtcqz0P74bixogkFqIhMM4YR_OLEoRUh5nRzQMFEmELCIM-OSAEXDez1z5B6k1iskQY5Styzr8Hx3jZMMFdFq5h6E7LjrQ0_.7D_NR2Ar5Od669BCXgjRzeASFDZtwhUVHf632MRRt_Q_DNKnLeMX5DkgboozuPvHWdsHRy2J7jZZOlsfRymoM4EQ9JuIWxDBaurGtIKnLxKfYt_U_DY2yQvTyjtLsqT7jHzlRL5spy59OPt5gKfYtVKnv-WF_tU2lSMkl32AM-9I7fH7fmCuX8a9G4myIrP-SJFWZWlkLfYXLDkexdlShEIbOdSLOpSHOUS5zxx8zQDk_vyNtThlE-ozTVHQ8gZJyAp7W_zNe57f.U1Yz0ZDqd_xKJVgfkoWPSPx8YnQNYnp30ZKGm1Ys0Zfqd_xKJVgfkoWPSPx8YnQNYnp30A-V5HczPfKM5gK1nsKdpHdBmy-bIykV0ZKGujYkrfKWpyfqn0KVIjYknjD4g1DsnHIxnW0dnNt1nHcsg1DsPjwxnH0zndt1PW0k0AVG5H00TMfqPHns0AFG5HDdr7tznjwxnWDLg1RsnsKVm1Yknj0kg1D4njnkP10sPHFxnW0dnNtknjFxnH0zg17xn0KkTA-b5H00TyPGujYs0ZFMIA7M5H00mycqn7ts0ANzu1Ys0ZKs5HcLPHRznH0Ynjn0UMus5H08nj0snj0snj00Ugws5H00uAwETjYs0ZFJ5HD0uANv5gKW0AuY5H00TA6qn0KET1Ys0AFL5HDs0A4Y5H00TLCq0A71gv-bm1dsTzd8p6KGuAnqHbC0TA9YXHY0IA7zuvNY5Hm1g1KxnHRs0ZwdT1Y3nHR3P1nsP1Rvn10LPHmdP1bs0ZF-TgfqnHmkrHf4njR4nWDYrfK1pyfquWb3rAN9PAmsnjD1nyc4PsKWTvYqwj-7nbfYPWIjnH0znRRdP0K9m1Yk0ZK85H00TydY5H00Tyd15H00XMfqn0KVmdqhThqV5HKxn7tsg1KxnH0YP-tsg100uA78IyF-gLK_my4GuZnqn7tsg1KxnHfdrjnzndtkrj6kPWb4g1Kxn0Ksmgwxuhk9u1Ys0AwWpyfqn0K-IA-b5iYk0A71TAPW5H00IgKGUhPW5H00Tydh5H00uhPdIjYs0A-1mvsqn0KlTAkdT1Ys0A7buhk9u1Yk0Akhm1Ys0AwWmvfqP1KDPjPKPjRdnDuKnbc4nHRznHcsPHIawDR3PHKKPWD0IZF9uARqP1msnW0z0AFbpyfqnRm3PWb3n1-7wj6dPRFKfRR4wRRdPRFAnWI7PRfvrjD0UvnqnfKBIjYs0Aq9IZTqn0KEIjYk0AqzTZfqninsc1nWnBnzPH64nWnzPanznH0sc1cknj08nj0snj0sc1DWnBnsczYWna3snj0snj0Wni3snj0snj00XZPYIHYzP1RLPjTL0Z7xIWYsQWbLg108njKxna3sn7tsQWb1g108rjNxna31ndtsQWcsg1Dzr0KBTdqsThqbpyfqn0KzUv-hUA7M5H00mLmq0A-1gvPsmHYs0APs5H00ugPY5H00mLFW5HnvrHb3%26xst%3DTjYzP1RdnWDsPj010ynqP1KDPjPKPjRdnDuKnbc4nHRznHcsPHIawDR3PHKKPWDKmWYkwW6vrH61rRNDrjRdfb7KwH-7wHRdfbmzPYRdwjm3nf715HDLrH6srjT4nHnLn10knWT1Pjbdg1czPNtk0gTqd_xKJVgfkoWPSPx8YnQNYnp30gDqd_xKJVgfkoWPSPx8YnQNYnp30gRqnWTdP1fLPs7Y5HDvnHbYrH0drHcKUgDqn0cs0BYKmv6quhPxTAnKn1TvPHDsrj6k%26word%3D%26ck%3D5976.8.71.327.382.608.140.1600%26shh%3Dwww.baidu.com%26sht%3D88013251_12_hao_pg%26us%3D2.0.1.0.0.0.0%26wd%3D%26bc%3D110101; sajssdk_2015_cross_new_user=1; acw_tc=2760828816194929702121525e7f6acc186ae5456123e1c15dbb3bc7f7d1e5; FSSBBIl1UgzbN