通用爬虫-续

本文继续探讨通用爬虫的实现,遵循爬虫的获取响应、解析数据和保存数据的三步流程。通过requests库获取响应,支持代理;利用bs4、css、xpath等解析数据并转化为字典;提供csv、xlsx、json、redis等多种保存方式,以及带进度条的二进制下载。文章介绍了从单线程到线程池的爬取逻辑,并分享了简化版的爬虫代码,适合初学者参考。
摘要由CSDN通过智能技术生成

前言

上次想要用面向对象爬虫写成通用模板,其实还留了一部分工作未完成,今天把它补充完.

构思

按标准的爬虫三步曲来进行:
获取响应体:requests请求,get和post改个单词就完了,没什么好调整的,增加代理
解析数据:按我目前常用的三种静态解析bs4,css,xpath,加上json和正则,各列两行例句算是忘词时的提醒,最终都解析成字典流转到下一步去.
保存数据:调通保存到csv,xlsx,json,redis数据库的设置,附加一个带进度条的二进制下载函数
将单线程和线程池爬取的逻辑分离开.

流程:

1.甩个网址进singlethread,看看响应码,结果有没有返回响应体,没有的话,去network找header信息补充,直到拿到响应体.
2.构建解析的逻辑,最后生成item字典
3.依据需求取消注释,保存数据
4.条件许可就先构建拿到url列表的逻辑,用线程池爬取数据,提高爬取速度

代码

代码如下,部分逻辑借鉴scrapy,但毕竟scrapy是爬虫框架,我弄的比较简陋一些.
其实爬虫玩熟了的人都能写出这个代码来,高手嘛,见笑了请挪步,.这个算是我对爬虫基础部分的一次总复习吧.也希望对小白有一些思路上的借鉴.

import jsonpath
import redis
import requests
from bs4 import BeautifulSoup
import parsel
import re
import csv
import json
import openpyxl
import time
import random
import os
from retrying import retry
from fake_useragent import UserAgent
import datetime
import concurrent.futures
from pprint import pprint

def check_old():
    try:
        with open(filename,'r',encoding='utf8',newline='') as old_file:
            olddatas = old_file.readlines()
        return olddatas
    except:
        return []

def url_encode(key):
    key_encode = re.findall('b\'(.*?)\'', str(key.encode('utf-8')), re.S)[0].replace('\\x', '%25').upper()
    print('utf-8编码后:', key)  # utf-8编码后: PYTHON%25E7%2588%25AC%25E8%2599%25AB
    return key_encode


def get_proxy():
    proxies = requests.get(url='http://127.0.0.1:5000/getbest').text
    return proxies


class Web_spider:

    @retry(stop_max_attempt_number=4)
    def get_re(self, url):
        headers = {
   'User-Agent': UserAgent().random
                   #     ,'referer': 'https://www.sporttery.cn/'
                   #     ,'authority': 'webapi.sporttery.cn'
                   #     ,'origin': 'https://www.sporttery.cn'
                   #     ,'Host': 'www.gtgqw.com'
                   #     # ,'cookie': 'urlfrom=121122523; urlfrom2=121122523; adfbid=0; adfbid2=0; x-zp-client-id=35634343-9cfb-44a3-8371-a71e0ddb96ef; sts_deviceid=179112989af38f-07821ed204e4dc-d7e163f-1327104-179112989b033d; sts_sg=1; sts_chnlsid=121122523; zp_src_url=https%3A%2F%2Fwww.baidu.com%2Fbaidu.php%3Fsc.K60000avpXkFvm720P_DWu1e_4t0TU3D9_0sdLXDMB1OxVgTYBZB_w4qQYtaAH54jJAZ2ftR8m43YIKlMsLKxJF3DqUFKR374quLrN_zcT8xvrGQrAvpChnPOT5uLtcqz0P74bixogkFqIhMM4YR_OLEoRUh5nRzQMFEmELCIM-OSAEXDez1z5B6k1iskQY5Styzr8Hx3jZMMFdFq5h6E7LjrQ0_.7D_NR2Ar5Od669BCXgjRzeASFDZtwhUVHf632MRRt_Q_DNKnLeMX5DkgboozuPvHWdsHRy2J7jZZOlsfRymoM4EQ9JuIWxDBaurGtIKnLxKfYt_U_DY2yQvTyjtLsqT7jHzlRL5spy59OPt5gKfYtVKnv-WF_tU2lSMkl32AM-9I7fH7fmCuX8a9G4myIrP-SJFWZWlkLfYXLDkexdlShEIbOdSLOpSHOUS5zxx8zQDk_vyNtThlE-ozTVHQ8gZJyAp7W_zNe57f.U1Yz0ZDqd_xKJVgfkoWPSPx8YnQNYnp30ZKGm1Ys0Zfqd_xKJVgfkoWPSPx8YnQNYnp30A-V5HczPfKM5gK1nsKdpHdBmy-bIykV0ZKGujYkrfKWpyfqn0KVIjYknjD4g1DsnHIxnW0dnNt1nHcsg1DsPjwxnH0zndt1PW0k0AVG5H00TMfqPHns0AFG5HDdr7tznjwxnWDLg1RsnsKVm1Yknj0kg1D4njnkP10sPHFxnW0dnNtknjFxnH0zg17xn0KkTA-b5H00TyPGujYs0ZFMIA7M5H00mycqn7ts0ANzu1Ys0ZKs5HcLPHRznH0Ynjn0UMus5H08nj0snj0snj00Ugws5H00uAwETjYs0ZFJ5HD0uANv5gKW0AuY5H00TA6qn0KET1Ys0AFL5HDs0A4Y5H00TLCq0A71gv-bm1dsTzd8p6KGuAnqHbC0TA9YXHY0IA7zuvNY5Hm1g1KxnHRs0ZwdT1Y3nHR3P1nsP1Rvn10LPHmdP1bs0ZF-TgfqnHmkrHf4njR4nWDYrfK1pyfquWb3rAN9PAmsnjD1nyc4PsKWTvYqwj-7nbfYPWIjnH0znRRdP0K9m1Yk0ZK85H00TydY5H00Tyd15H00XMfqn0KVmdqhThqV5HKxn7tsg1KxnH0YP-tsg100uA78IyF-gLK_my4GuZnqn7tsg1KxnHfdrjnzndtkrj6kPWb4g1Kxn0Ksmgwxuhk9u1Ys0AwWpyfqn0K-IA-b5iYk0A71TAPW5H00IgKGUhPW5H00Tydh5H00uhPdIjYs0A-1mvsqn0KlTAkdT1Ys0A7buhk9u1Yk0Akhm1Ys0AwWmvfqP1KDPjPKPjRdnDuKnbc4nHRznHcsPHIawDR3PHKKPWD0IZF9uARqP1msnW0z0AFbpyfqnRm3PWb3n1-7wj6dPRFKfRR4wRRdPRFAnWI7PRfvrjD0UvnqnfKBIjYs0Aq9IZTqn0KEIjYk0AqzTZfqninsc1nWnBnzPH64nWnzPanznH0sc1cknj08nj0snj0sc1DWnBnsczYWna3snj0snj0Wni3snj0snj00XZPYIHYzP1RLPjTL0Z7xIWYsQWbLg108njKxna3sn7tsQWb1g108rjNxna31ndtsQWcsg1Dzr0KBTdqsThqbpyfqn0KzUv-hUA7M5H00mLmq0A-1gvPsmHYs0APs5H00ugPY5H00mLFW5HnvrHb3%26xst%3DTjYzP1RdnWDsPj010ynqP1KDPjPKPjRdnDuKnbc4nHRznHcsPHIawDR3PHKKPWDKmWYkwW6vrH61rRNDrjRdfb7KwH-7wHRdfbmzPYRdwjm3nf715HDLrH6srjT4nHnLn10knWT1Pjbdg1czPNtk0gTqd_xKJVgfkoWPSPx8YnQNYnp30gDqd_xKJVgfkoWPSPx8YnQNYnp30gRqnWTdP1fLPs7Y5HDvnHbYrH0drHcKUgDqn0cs0BYKmv6quhPxTAnKn1TvPHDsrj6k%26word%3D%26ck%3D5976.8.71.327.382.608.140.1600%26shh%3Dwww.baidu.com%26sht%3D88013251_12_hao_pg%26us%3D2.0.1.0.0.0.0%26wd%3D%26bc%3D110101; sajssdk_2015_cross_new_user=1; acw_tc=2760828816194929702121525e7f6acc186ae5456123e1c15dbb3bc7f7d1e5; FSSBBIl1UgzbN
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值