celery爬取58同城二手车数据及若干问题

celery爬取58同城二手车数据及若干问题

今天分享一下celery分布式爬取58同城二手车(以下简称58)的方法。

反爬

58中的反爬主要有字体加密和验证码验证。
先说字体加密,真实的字体文件经过base64加密后放在了网页源代码中,将其用匹配下来后进行解密,根据坐标数据和真实数据进行映射,创建一个字典。由于每次请求字体文件都不一样,所以每次都要匹配出字体文件,根据刚才创建的字典获取真实数据,然后替换掉加密字符,最后解析需要的数据,字体加密就解决了。然后详细说一下验证码的问题。

验证码

在爬取过程中,会不时的跳转到验证页面进行验证,验证方式主要为滑动验证,有时滑动验证之后还会弹出点选验证,这个应该需要darknet训练模型然后识别,这里先不探讨。本来想用上篇文章中的方法解决滑动验证,尴尬的是58中的验证码图片处理之后,干扰识别的轮廓太多,试了几次都不成功,而且每次小滑快(图1)的方向也是随机的,想了一下,可以使用aircv中模板匹配的方法进行解决,先将小滑块也进行灰度转换,二值化处理,然后截取小滑块中的某个特征部分,作为子图;完整的验证码图片经过灰度转换,二值化(阈值大约在185左右)处理后类似图2,基本都包含小滑块的整体轮廓,可以将其作为模板。现在的问题是截取小滑块中的哪个特征部分作为子图,由于每次验证小滑块的方向和形状都会变,经过比较和试验,将图1中红色圈出部分作为子图成功率较高,但这个应该还不是最好的,截取处理之后类似图3,这里不是很清楚,实际上这个小图包含图2小滑块轮廓外面的白色部分和滑块轮廓的黑色部分,这样将其作为子图进行匹配时基本都可以找到滑块在验证码中的坐标,但是找到的坐标可能是滑块轮廓的左边,也有可能是右边,所以还要设置一个偏移量,验证两次。
图1 小滑块

图2 处理后的验证码图片
图3 截取的特征部分

代码

1.验证码部分:verifycaptcha.py

from time import sleep
import random, math
import aircv as ac
import cv2
import numpy as np
from PIL import Image
from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC



SCRIPT = 'Object.defineProperty(navigator,"webdriver",{get:()=>undefined})'


class VerifyCaptcha:

    def __init__(self, threshold=180):
        # 设置二值化阈值
        self.threshold = 175

        # imgsrc=原始图像,imgobj=待查找的图片
        self.imgsrc = "./imgs/handled.png"
        self.imgobj = "./imgs/handleds.png"

    def getTrack(self, gap, offset):
        # 生成滑动轨迹
        track = []
        gap = gap + offset
        # 当前位移
        current = 0
        # 减速阈值
        mid = gap * 4 / 5  # 前4/5段加速 后1/5段减速
        # 计算间隔
        t = 0.2
        # 初速度
        v = random.randint(1,4)

        while current < gap:
            if current < mid:
                a = 3  # 加速度为+3
            else:
                a = -3  # 加速度为-3

            # 初速度v0
            v0 = v
            # 当前速度
            v = v0 + a * t
            # 移动距离
            move = v0 * t + math.sin(1 / 2 * a * t * t) * 15
            # 当前位移
            current += move
            # 加入轨迹
            track.append(round(move))

        return track


    # 处理图片
    def handle(self, path, threshold):
        img = Image.open(path)
        img = img.convert('L')
        table = []
        for i in range(256):
            if i < threshold:
                table.append(0)
            else:
                table.append(1)

        bim = img.point(table, '1')
        bim.save('./imgs/handled.png')

    def getImg(self, url):
        global browser
        browser = webdriver.Chrome()
        wait = WebDriverWait(browser, 10)
        browser.get(url)
        # 切换到账号登录标签
        btn_ver = wait.until(EC.presence_of_element_located(
            (By.CSS_SELECTOR, '.code_num > input')))

        btn_ver.click()

        # 获取验证码图片
        canvas = wait.until(EC.presence_of_element_located(
            (By.CSS_SELECTOR, '.dvc-captcha__bgImg')))
        btn = wait.until(EC.presence_of_element_located(
            (By.CSS_SELECTOR, '.dvc-slider__handler')))
        self.handle_slider(canvas, browser, btn, wait)
        sleep(3)
        browser.close()
    
    def handle_slider(self, canvas, browser, btn, wait):
        canvas.screenshot('./imgs/1.png')
        self.cropImg()
        self.handle('./imgs/croped.png', self.threshold)
        gap = self.matchImg()
        print(gap)
        # 这里设置了三个偏移量,增加成功率
        for offset in [57, 13, 11]:
        # 获取滑块轨迹
            track = self.getTrack(gap, offset)
            # 移动滑块
            ActionChains(browser).click_and_hold(btn).perform()
            for x in track:
                y = random.uniform(-3, 3)
                ActionChains(browser).move_by_offset(xoffset=x, yoffset=y).perform()
            ActionChains(browser).release(btn).perform()
            btn = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.dvc-slider__handler')))
            if btn:
                if offset == 11:
                # 这里判断三次尝试验证后是否成功,如果失败,继续验证,重复上面操作
                    return self.handle_slider(canvas, browser, btn, wait)
                continue
            else:
                return
        
    def cropImg(self):
        # 剪切图片,去除小图对轮廓识别的影响
        base_crop = 70
        img1 = Image.open('./imgs/1.png')
        box = (base_crop, 30, img1.size[0]-20, img1.size[1]-25)
        img = img1.crop(box)
        img.save('./imgs/croped.png')

    # 从移动滑块中截取一小块作为子图,从拼图中进行模板匹配,从而找到小滑块的坐标
    def matchImg(self, confidencevalue=0.5):  
        imsrc = ac.imread(self.imgsrc)
        imobj = ac.imread(self.imgobj)

        match_result = ac.find_template(imsrc, imobj, confidencevalue)  
        if match_result is not None:
            match_result['shape'] = (imsrc.shape[1], imsrc.shape[0])  
        return match_result['result'][0]

if __name__ == "__main__":
    vc = VerifyCaptcha()
    # 测试用验证码页面的连接
    url = 'https://callback.58.com/antibot/verifycode?serialId=52e2ca88845f157e29a6d26349ef0344_6a059b1477814bb8baf7ee04e2b61764&code=22&sign=235448960cb8f4d6cd710b06c61dc57a&namespace=usdt_infolist_car&url=https%3A%2F%2Finfo5.58.com%3A443%2Ftj%2Fershouche%2F%3FPGTID%3D0d100000-0001-2e0e-cd43-870ea87150b4%26ClickID%3D8'
    # url = 'https://bj.58.com/ershouche/pn2'
    vc.getImg(url)

2.celery生成任务部分:celery_client.py

from celery_con import crawl, app
from city import generate_url

for url in generate_url():
    app.send_task('celery_con.crawl', args=(url,))

3.celery app部分:celert_con.py

from celery import Celery
from city import generate_url
from download_v3 import CarSpider
import celeryconfig


app = Celery('tasks')
app.config_from_object('celeryconfig')
carspider = CarSpider()

@app.task
def crawl(url):
    carspider.run(url)


4.celery配置部分:celeryconfig.py

BROKER_URL = 'amqp://admin:yourpassword@ip:5672/'
CELERY_RESULT_BACKEND = 'amqp://'

5.生成url部分:city.py

CITY = ['bj', 'sh', 'tj', 'cq', 'hf', 'wuhu', 'bengbu', 'fy', 'hn', 'anqing', 'suzhou', 'la', 'huaibei', 'chuzhou', 'mas', 'tongling', 'xuancheng', 'bozhou', 'huangshan', 'chizhou', 'ch', 'hexian', 'hq', 'tongcheng', 'ningguo', 'tianchang', 'dongzhi', 'wuweixian', 'fz', 'xm', 'qz', 'pt', 'zhangzhou', 'nd', 'sm', 'np', 'ly', 'wuyishan', 'shishi', 'jinjiangshi', 'nananshi', 'longhai', 'shanghangxian', 'fuanshi', 'fudingshi', 'anxixian', 'yongchunxian', 'yongan', 'zhangpu', 'sz', 'gz', 'dg', 'fs', 'zs', 'zh', 'huizhou', 'jm', 'st', 'zhanjiang', 'zq', 'mm', 'jy', 'mz', 'qingyuan', 'yj', 'sg', 'heyuan', 'yf', 'sw', 'chaozhou', 'taishan', 'yangchun', 'sd', 'huidong', 'boluo', 'haifengxian', 'kaipingshi', 'lufengshi', 'nn', 'liuzhou', 'gl', 'yulin', 'wuzhou', 'bh', 'gg',
'qinzhou', 'baise', 'hc', 'lb', 'hezhou', 'fcg', 'chongzuo', 'guipingqu', 'beiliushi', 'bobaixian', 'cenxi', 'gy', 'zunyi', 'qdn', 'qn', 'lps', 'bijie', 'tr', 'anshun', 'qxn', 'renhuaishi', 'qingzhen', 'lz', 'tianshui', 'by', 'qingyang', 'pl', 'jq', 'zhangye', 'wuwei', 'dx', 'jinchang', 'ln', 'linxia', 'jyg', 'gn', 'dunhuang', 'haikou', 'sanya', 'wzs', 'sansha', 'qh', 'wenchang', 'wanning', 'tunchang', 'qiongzhong', 'lingshui', 'df', 'da', 'cm', 'baoting', 'baish', 'danzhou', 'zz', 'luoyang', 'xx', 'ny', 'xc', 'pds', 'ay', 'jiaozuo', 'sq', 'kaifeng', 'puyang', 'zk', 'xy', 'zmd', 'luohe', 'smx', 'hb', 'jiyuan', 'mg', 'yanling', 'yuzhou', 'changge', 'lingbaoshi', 'qixianqu', 'ruzhou', 'xiangchengshi', 'yanshiqu', 'changyuan', 'huaxian', 'linzhou', 'qinyang', 'mengzhou', 'wenxian', 'weishixian', 'lankaoxian', 'tongxuxian', 'lyxinan', 'yichuan', 'mengjinqu', 'lyyiyang', 'wugang', 'yongcheng', 'suixian', 'luyi', 'yingchixian', 'shenqiu', 'taikang', 'shangshui', 'qixianq', 'junxian', 'fanxian', 'gushixian', 'huaibinxian', 'dengzhou', 'xinye', 'hrb', 'dq', 'qqhr', 'mdj', 'suihua', 'jms', 'jixi', 'sys', 'hegang', 'heihe', 'yich', 'qth', 'dxal', 'shanda', 'shzhaodong', 'zhaozhou', 'wh', 'yc', 'xf', 'jingzhou', 'shiyan', 'hshi', 'xiaogan', 'hg', 'es', 'jingmen', 'xianning', 'ez', 'suizhou', 'qianjiang', 'tm', 'xiantao', 'snj', 'yidou', 'hanchuan', 'zaoyang', 'wuxueshi', 'zhongxiangshi', 'jingshanxian', 'shayangxian', 'songzi', 'guangshuishi', 'chibishi', 'laohekou', 'gucheng', 'yichengshi', 'nanzhang', 'yunmeng', 'anlu', 'dawu', 'xiaochang', 'dangyang', 'zhijiang', 'jiayuxian', 'suixia', 'cs', 'zhuzhou', 'yiyang', 'changde', 'hy', 'xiangtan', 'yy', 'chenzhou', 'shaoyang', 'hh', 'yongzhou', 'ld', 'xiangxi', 'zjj', 'liling', 'lixian', 'czguiyang', 'zixing', 'yongxing', 'changningshi', 'qidongxian', 'hengdong', 'lengshuijiangshi', 'lianyuanshi', 'shuangfengxian', 'shaoyangxian', 'shaodongxian', 'yuanjiangs', 'nanxian', 'qiyang', 'xiangyin', 'huarong', 'cilixian', 'zzyouxian', 'sjz', 'bd', 'ts', 'lf', 'hd', 'qhd', 'cangzhou', 'xt', 'hs', 'zjk', 'chengde', 'dingzhou', 'gt', 'zhangbei', 'zx', 'zd', 'qianan', 'renqiu', 'sanhe', 'wuan', 'xionganxinqu', 'lfyanjiao', 'zhuozhou', 'hejian', 'huanghua', 'cangxian', 'cixian', 'shexian', 'bazhou', 'xianghe', 'lfguan', 'zunhua', 'qianxixian', 'yutianxian', 'luannanxian', 'shaheshi', 'su', 'nj', 'wx', 'cz', 'xz', 'nt', 'yz', 'yancheng', 'ha', 'lyg', 'taizhou', 'suqian', 'zj', 'shuyang', 'dafeng', 'rugao', 'qidong', 'liyang', 'haimen', 'donghai', 'yangzhong', 'xinghuashi', 'xinyishi', 'taixing', 'rudong', 'pizhou', 'xzpeixian', 'jingjiang', 'jianhu', 'haian', 'dongtai', 'danyang', 'baoyingx', 'guannan', 'guanyun', 'jiangyan', 'jintan', 'szkunshan', 'sihong', 'siyang', 'jurong', 'sheyang', 'funingxian', 'xiangshui', 'xuyi', 'jinhu', 'jiangyins', 'nc', 'ganzhou', 'jj', 'yichun', 'ja', 'sr', 'px', 'fuzhou', 'jdz', 'xinyu', 'yingtan', 'yxx', 'lepingshi', 'jinxian', 'fenyi', 'fengchengshi', 'zhangshu', 'gaoan', 'yujiang', 'nanchengx', 'fuliangxian', 'cc', 'jl', 'sp', 'yanbian', 'songyuan', 'bc', 'th', 'baishan', 'liaoyuan', 'gongzhuling', 'meihekou', 'fuyuxian', 'changlingxian', 'huadian', 'panshi', 'lishu', 'sy', 'dl', 'as', 'jinzhou', 'fushun', 'yk', 'pj', 'cy', 'dandong', 'liaoyang',
'benxi', 'hld', 'tl', 'fx', 'pld', 'wfd', 'dengta', 'fengcheng', 'beipiao', 'kaiyuan', 'yinchuan', 'wuzhong', 'szs', 'zw', 'guyuan', 'hu', 'bt', 'chifeng', 'erds', 'tongliao', 'hlbe', 'bycem', 'wlcb', 'xl', 'xam', 'wuhai', 'alsm', 'hlr', 'xn', 'hx', 'haibei', 'guoluo', 'haidong', 'huangnan', 'ys', 'hainan', 'geermushi', 'qd', 'jn', 'yt', 'wf', 'linyi', 'zb', 'jining', 'ta', 'lc', 'weihai', 'zaozhuang', 'dz', 'rizhao', 'dy', 'heze', 'bz', 'lw', 'zhangqiu', 'kl', 'zc', 'shouguang', 'longkou', 'caoxian', 'shanxian', 'feicheng', 'gaomi', 'guangrao', 'huantaixian', 'juxian', 'laizhou', 'penglai', 'qingzhou', 'rongcheng', 'rushan', 'tengzhou', 'xintai', 'zhaoyuan', 'zoucheng', 'zouping', 'linqing', 'chiping', 'hzyc', 'boxing', 'dongming', 'juye', 'wudi', 'qihe', 'weishan', 'yuchengshi', 'linyixianq', 'leling', 'laiyang', 'ningjin', 'gaotang', 'shenxian', 'yanggu', 'guanxian', 'pingyi', 'tancheng', 'yiyuanxian', 'wenshang', 'liangshanx', 'lijin', 'yinanxian', 'qixia', 'ningyang', 'dongping', 'changyishi', 'anqiu', 'changle', 'linqu', 'juancheng', 'ty', 'linfen', 'dt', 'yuncheng', 'jz', 'changzhi', 'jincheng', 'yq', 'lvliang', 'xinzhou', 'shuozhou', 'linyixian', 'qingxu', 'liulin', 'gaoping', 'zezhou', 'xiangyuanxian', 'xiaoyi', 'xa', 'xianyang', 'baoji', 'wn', 'hanzhong', 'yl', 'yanan', 'ankang', 'sl', 'tc', 'shenmu', 'hancheng', 'fugu', 'jingbian', 'dingbian', 'cd', 'mianyang', 'deyang', 'nanchong', 'yb', 'zg', 'ls', 'luzhou', 'dazhou', 'scnj', 'suining', 'panzhihua', 'ms', 'ga', 'zy', 'liangshan', 'guangyuan', 'ya', 'bazhong', 'ab', 'ganzi', 'anyuexian', 'guanghanshi', 'jianyangshi', 'renshouxian', 'shehongxian', 'dazu', 'xuanhan', 'qux', 'changningx', 'xj', 'changji', 'bygl', 'yili', 'aks', 'ks', 'hami', 'klmy', 'betl', 'tlf', 'ht', 'shz', 'kzls', 'ale', 'wjq', 'tmsk', 'kel', 'alt', 'tac', 'lasa', 'rkz', 'sn', 'linzhi', 'changdu', 'nq', 'al', 'rituxian', 'gaizexian', 'km', 'qj', 'dali', 'honghe', 'yx', 'lj', 'ws', 'cx', 'bn', 'zt', 'dh', 'pe', 'bs', 'lincang', 'diqing', 'nujiang', 'milexian', 'anningshi', 'xuanwushi', 'hz', 'nb', 'wz', 'jh', 'jx', 'tz', 'sx', 'huzhou', 'lishui',
'quzhou', 'zhoushan', 'yueqingcity', 'ruiancity', 'yiwu', 'yuyao', 'zhuji', 'xiangshanxian', 'wenling', 'tongxiang', 'cixi', 'changxing', 'jiashanx', 'haining', 'deqing', 'dongyang', 'anji', 'cangnanxian', 'linhai', 'yongkang', 'yuhuan', 'pinghushi', 'haiyan', 'wuyix', 'shengzhou', 'xinchang', 'jiangshanshi', 'pingyangxian']
URL_TEMPLATE = 'https://{addr}.58.com/ershouche/pn{page}'
def city_list():
    for i in CITY[:10]:
        yield i
# 测试只爬取前十个城市的数据
def generate_url():
    for i in CITY[:10]:
    # 测试只爬取前十个页面的数据
        for p in range(1, 10):
            yield URL_TEMPLATE.format(addr=i, page=p)

5.不使用celery的爬虫代码:download_v4.py

import requests
import re
import base64
from fontTools.ttLib import TTFont
from hashlib import md5
from time import sleep
from bs4 import BeautifulSoup
import csv
import aiohttp
import asyncio
from city import city_list
from pybloom_live import BloomFilter
from log_write import SpiderLog
from threading import Thread


class CarSpider:

    def __init__(self):
        self.spiderlog = SpiderLog()
        self.bf = BloomFilter(capacity=100000, error_rate=0.01)
        self.url_template = 'https://{addr}.58.com/ershouche/pn{page}'
        self.dic_font = {'856c80c30a9c2100282e94be2ef01a1a': 3, '4c12e2ca6ab31a1832549d3a2661cee9': 2, '221ce0f06ec2094938778887f59c096c': 1, '0edc309270450f4e144f1fa90a633a72': 0, 'a06d9a83fde2ea9b2fd4b8c0e92da4d9': 7,
                    'fe91949296531c26783936c17da4c896': 6, '0d0fd3a2d04e61526662b13c2db00537': 5, '0958ad9f2976dce5451697bef0227a0f': 4, 'bf3f23b53cb12e04d67b3f141771508d': 9, '9de9732e406d7025c0005f2f9cec817a': 8}
        self.headers = {
            'Origin': 'https://tj.58.com',
            'Referer': 'https://c.58cdn.com.cn/escstatic/upgrade/zhuzhan_pc/ershouche/ershouche_list_v20200622145811.css',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'
        }
        self.thread_loop = asyncio.new_event_loop()

    async def downHTML(self, url, session):
        try:
            count = 0
            await asyncio.sleep(5)
            async with session.get(url, headers=self.headers) as resp:
                if str(resp.status)[0] == '2':
                    if url not in self.bf:
                        self.bf.add(url)
                        self.spiderlog.info('正在爬取:{url}'.format(url=url))
                        await asyncio.sleep(3)
                        return await resp.text()
                else:
                    count += 1
                    if count < 3:
                        await self.downHTML(url, session)
                return None
        except Exception as e:
            self.spiderlog.info(e)

    def getTempdict(self, html):

        pattern = re.compile(r"charset=utf-8;base64,(.*?)//wAP", re.S)
        try:
            ttf_url = re.search(pattern, html)
            content = base64.b64decode(ttf_url.group(1)+'=')
            with open('tc.ttf', 'wb') as f:
                f.write(content)

            font = TTFont('tc.ttf')
            temp_dict = {}
            for i, k in enumerate(font.getGlyphOrder()):
                if i == 0:
                    continue
                coor = font['glyf'][k].coordinates
                m = md5(str(coor).encode()).hexdigest()
                k = k.lower().replace('uni00', '&#x')
                k = k.replace('uni', '&#x')
                temp_dict[k.lower()] = self.dic_font[m]
            return temp_dict
        except Exception as e:
            self.spiderlog.info(e)
    # &#x2f;.&#x4e07,&#xa5;&#x65f6;.&#x2d

    def parseHtml(self, html, temp_dict):
        for k, v in temp_dict.items():
            html = html.replace(k, str(v))
        # res_dic = {}
        try:
            soup = BeautifulSoup(html, 'lxml')
            city = re.search(
                r'<title>【(.*?)二手车.*?二手车交易市场.*?58同城</title>', html).group(1)
            prices = soup.select('.info--price b')

            info = soup.select('.info_params')
            title = soup.select('.info_title>span')
            tag = soup.select('div.info--desc div:nth-of-type(1)')
        except Exception as e:
            self.spiderlog.info(e)
        for p, i, t, ta in zip(prices, info, title, tag):
            item = {}
            item['城市'] = city
            item['价格'] = p.get_text().replace(';', '')
            item['车型'] = t.get_text().split('急')[0].strip()
            i = i.get_text("\n", strip=True).split('\n')
            ta = '_'.join(ta.get_text().strip().split('\n'))
            item['上牌时间'] = i[0]
            item['里程'] = i[2]
            item['tag'] = ta
            yield item

    async def save(self, item):
        with open('car.csv', 'a', encoding='utf-8') as f:
            fieldname = ['城市', '价格', '车型', '上牌时间', '里程', 'tag']

            writer = csv.DictWriter(f, fieldnames=fieldname)
            writer.writerow({'城市': '城市', '价格': '价格(万)', '车型': '车型',
                             '上牌时间': '上牌时间', '里程': '里程', 'tag': 'tag'})
            for i in item:
                writer.writerow(i)

    async def main(self, url):
        async with aiohttp.ClientSession() as session:
            html_detail = await self.downHTML(url, session)
            if html_detail:
                temp_dict = self.getTempdict(html_detail)
                item = self.parseHtml(html_detail, temp_dict)
                await self.save(item)
    
    async def add_task(self, url):
        asyncio.run_coroutine_threadsafe(self.main(url), self.thread_loop)


    def start_loop(self, loop):
        asyncio.set_event_loop(loop)
        loop.run_forever()

    def run(self, url):    
        athread = Thread(target=self.start_loop, args=(thread_loop,))
        athread.start()
        loop = asyncio.get_event_loop()
        loop.run_until_complete(self.add_task(url))


if __name__ == "__main__":
    cs = CarSpider()
    

6.使用celery的爬虫代码:download_v3.py

import requests
import re
import base64
from fontTools.ttLib import TTFont
from hashlib import md5
from time import sleep
from bs4 import BeautifulSoup
import csv
import aiohttp
import asyncio
from city import city_list
from pybloom_live import BloomFilter
from log_write import SpiderLog
from threading import Thread


class CarSpider:

    def __init__(self):
        self.spiderlog = SpiderLog()
        self.bf = BloomFilter(capacity=100000, error_rate=0.01)
        self.url_template = 'https://{addr}.58.com/ershouche/pn{page}'
        self.dic_font = {'856c80c30a9c2100282e94be2ef01a1a': 3, '4c12e2ca6ab31a1832549d3a2661cee9': 2, '221ce0f06ec2094938778887f59c096c': 1, '0edc309270450f4e144f1fa90a633a72': 0, 'a06d9a83fde2ea9b2fd4b8c0e92da4d9': 7,
                    'fe91949296531c26783936c17da4c896': 6, '0d0fd3a2d04e61526662b13c2db00537': 5, '0958ad9f2976dce5451697bef0227a0f': 4, 'bf3f23b53cb12e04d67b3f141771508d': 9, '9de9732e406d7025c0005f2f9cec817a': 8}
        self.headers = {
            'Origin': 'https://tj.58.com',
            'Referer': 'https://c.58cdn.com.cn/escstatic/upgrade/zhuzhan_pc/ershouche/ershouche_list_v20200622145811.css',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'
        }
        self.thread_loop = asyncio.new_event_loop()

    async def downHTML(self, url, session):
        try:
            count = 0
            await asyncio.sleep(5)
            async with session.get(url, headers=self.headers) as resp:
                if str(resp.status)[0] == '2':
                    if url not in self.bf:
                        self.bf.add(url)
                        self.spiderlog.info('正在爬取:{url}'.format(url=url))
                        await asyncio.sleep(3)
                        return await resp.text()
                else:
                    count += 1
                    if count < 3:
                        await self.downHTML(url, session)
                return None
        except Exception as e:
            self.spiderlog.info(e)

    def getTempdict(self, html):

        pattern = re.compile(r"charset=utf-8;base64,(.*?)//wAP", re.S)
        try:
            ttf_url = re.search(pattern, html)
            content = base64.b64decode(ttf_url.group(1)+'=')
            with open('tc.ttf', 'wb') as f:
                f.write(content)

            font = TTFont('tc.ttf')
            temp_dict = {}
            for i, k in enumerate(font.getGlyphOrder()):
                if i == 0:
                    continue
                coor = font['glyf'][k].coordinates
                m = md5(str(coor).encode()).hexdigest()
                k = k.lower().replace('uni00', '&#x')
                k = k.replace('uni', '&#x')
                temp_dict[k.lower()] = self.dic_font[m]
            return temp_dict
        except Exception as e:
            self.spiderlog.info(e)
    # &#x2f;.&#x4e07,&#xa5;&#x65f6;.&#x2d

    def parseHtml(self, html, temp_dict):
        for k, v in temp_dict.items():
            html = html.replace(k, str(v))
        # res_dic = {}
        try:
            soup = BeautifulSoup(html, 'lxml')
            city = re.search(
                r'<title>【(.*?)二手车.*?二手车交易市场.*?58同城</title>', html).group(1)
            prices = soup.select('.info--price b')

            info = soup.select('.info_params')
            title = soup.select('.info_title>span')
            tag = soup.select('div.info--desc div:nth-of-type(1)')
        except Exception as e:
            self.spiderlog.info(e)
        for p, i, t, ta in zip(prices, info, title, tag):
            item = {}
            item['城市'] = city
            item['价格'] = p.get_text().replace(';', '')
            item['车型'] = t.get_text().split('急')[0].strip()
            i = i.get_text("\n", strip=True).split('\n')
            ta = '_'.join(ta.get_text().strip().split('\n'))
            item['上牌时间'] = i[0]
            item['里程'] = i[2]
            item['tag'] = ta
            yield item

    async def save(self, item):
        with open('car.csv', 'a', encoding='utf-8') as f:
            fieldname = ['城市', '价格', '车型', '上牌时间', '里程', 'tag']

            writer = csv.DictWriter(f, fieldnames=fieldname)
            writer.writerow({'城市': '城市', '价格': '价格(万)', '车型': '车型',
                             '上牌时间': '上牌时间', '里程': '里程', 'tag': 'tag'})
            for i in item:
                writer.writerow(i)

    async def main(self, url):
        async with aiohttp.ClientSession() as session:
            html_detail = await self.downHTML(url, session)
            if html_detail:
                temp_dict = self.getTempdict(html_detail)
                item = self.parseHtml(html_detail, temp_dict)
                await self.save(item)
    
    async def add_task(self, url):
        asyncio.run_coroutine_threadsafe(self.main(url), self.thread_loop)


    def start_loop(self, loop):
        asyncio.set_event_loop(loop)
        loop.run_forever()

    def run(self, url):    
        athread = Thread(target=self.start_loop, args=(thread_loop,))
        athread.start()
        loop = asyncio.get_event_loop()
        loop.run_until_complete(self.add_task(url))


if __name__ == "__main__":
    cs = CarSpider()
    

7.日志部分代码:log_write.py

import logging
import getpass
import sys


class SpiderLog(object):
    # 类SpiderLog的构造函数

    # 日志模块采用单例模式
    def __new__(cls):
        if not hasattr(cls, '_instance'):
            cls._instance = super(SpiderLog, cls).__new__(cls)
        return cls._instance

    def __init__(self):
        self.user = getpass.getuser()
        self.logger = logging.getLogger(self.user)
        self.logger.setLevel(logging.DEBUG)
        # 日志文件名
        self.logFile = sys.argv[0][0:-3] + '.log'
        self.formatter = logging.Formatter(
            '%(asctime)-12s %(levelname)-8s %(name)-10s %(message)-12s\r\n')

        # 输出到日志文件
        self.logHand = logging.FileHandler(self.logFile, encoding='utf8')
        self.logHand.setFormatter(self.formatter)
        self.logHand.setLevel(logging.DEBUG)

        # 添加Handler
        self.logger.addHandler(self.logHand)

    def info(self, msg):
        self.logger.info(msg)

  

if __name__ == '__main__':
    spiderlog = SpiderLog()
    spiderlog.info("test")

问题

所有的代码都在上面,但是这里仍然有几个问题。
1.当代码进入到验证部分时,即使验证成功,也不会主动跳转到58主页,但是手动验证后却可以。
2.58实际上还涉及点选验证,这里没有解决。
3.当使用aiohttp.ClientSession.get发送请求后,突然创建了40个线程(debug显示的是ThreadPoolExecutor线程),我自己又单独试了一下,随着task里任务增多,创建的线程越多。这样就产生了另外一个问题,当某一次响应需要验证时,即使验证通过,后面的线程也执行过session.get了,即已经发送请求了,获取的响应也已经包含 '请输入验证码‘了,开始等待第一次验证完成后接着验证,如果不加处理,那这几十个线程都需要进行滑动验证。这里的处理方式是将需要验证的那次url及后面的39个url(这39个url绕过验证,放弃此次请求)都通过run_coroutine_threadsafe重新添加到任务里面。
但是我很疑惑,需要验证的情况难道不适合用aiohttp爬取吗?还是我的代码写的有问题?如果代码没问题,那该怎么更好的解决这个问题呢?希望能有个人能帮我解答这些问题,欢迎赐教讨论。

结语

以上只做学习交流使用,希望有人能一起讨论,也欢迎各位提出宝贵意见,另转载请注明出处。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值