Python爬虫笔记（二）——多线程爬虫、正则表达式、多进程爬虫

最新推荐文章于 2024-05-02 20:05:43 发布

菜到怀疑人生

最新推荐文章于 2024-05-02 20:05:43 发布

阅读量3.7k

点赞数 7

分类专栏： crawler python爬虫

本文链接：https://blog.csdn.net/dhaiuda/article/details/80980252

版权

crawler 同时被 2 个专栏收录

18 篇文章 3 订阅

订阅专栏

python爬虫

16 篇文章 14 订阅

订阅专栏

正则表达式

首先先简单介绍一下正则表达式（关于正则表达式的原理以及算法，等以后有时间在写）

python中常使用以下函数来返回正则表达式匹配的项目（使用前先import re）：

//pattern为正则表达式，string为待匹配的字符串，返回的是列表
findall（pattern，string，flags=0）

爬虫中常用的正则表达式：

. 表示任意字符

^表示匹配行开头部分，在方括号表达式中使用，此时它表示不接受该字符集合

$表示匹配行结尾部分

\A表示匹配输入的开头部分

\A与^的不同：\A匹配开始输入的位置，^匹配每行的开头，需要注意的是，如果没有设置多行模式（通过re.MULTILINE设置），\n会被当成普通的字符处理。

*表示匹配前一个字符0次或多次

+表示匹配前一个字符1次或多次

？表示匹配前一个字符0次或是1次

\d表示匹配数字

\表示转义，比如现在想匹配$

{m，n}表示匹配m到n次

[]表示一个匹配集合，里面的字符均会进行匹配

|表示或，而且不是短路或

多线程爬虫

什么是线程？进程内部包含有许多线程，这些线程共享一组硬件资源（比如内存），如果分配的合理，在没有资源竞争的情况下，线程之间可以并发执行，并且由于共享硬件资源，所以线程之间的通信相对简单，多线程可能出现死锁或是改值现象等问题，死锁就是线程之间相互占据一项资源，需要的资源又被其他线程占据，导致没有一个线程可以运行的尴尬局面，对于改值现象，我举个例子，现在有A和B两个线程，A在内存中写了一个字符串——菜到怀疑人生，接着B在内存同一位置写入另一个字符串——基佬，那么当A再次取值时，就会发现值被更改了

使用python3.6中的threading线程包，比较重要的函数如下：

#创建一个线程，target为线程需要执行的函数，args为函数的参数
thread=threading.Threading(target=demo,args=(1,2,3))
#运行一个线程，即运行target指定的函数
thread.start()
#表示当主进程中断时，所有的线程也会停止执行
thread.setDaemon(True)

接下来给个实例——爬取美桌网站迪丽热巴的照片

首先进行网页解析：

http://www.win4000.com/mt/dilireba_1.html

会发现含有迪丽热巴的照片都有title这个属性，其余img标签都没有，相对简单，简直愉悦

import threading
from lxml import etree
from collections import deque
from pybloom_live import BloomFilter
from urllib import request
import time


class imgInfo:
    
    def __init__(self,url,title):
        self.url=url
        self.title=title

class clawer:
    
    def __init__(self,image_file_name):
        self.image_file_name=image_file_name
    
    request_header={
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',
        'Connection': 'close'
        }     
    bfdownload=BloomFilter(1024*1024,0.01)
    cur_que=deque()
    
    def getImageUrl(self,url):
        req=request.Request(url,headers=self.request_header)
        response=request.urlopen(req)
        html_page=response.read()
        html=etree.HTML(html_page.lower().decode('utf-8'))
        img_list=html.xpath('//img[@title]')
        for img in img_list:
            if '迪丽热巴' in img.attrib['title'] and img.attrib['src'] not in self.bfdownload:
                self.cur_que.append(imgInfo(img.attrib['src'],img.attrib['title']))
                self.bfdownload.add(img.attrib['src'])        
        
    def getImge(self,url,title):
        req=request.Request(url,headers=self.request_header)
        response=request.urlopen(req)
        html_page=response.read()
        file=open(self.image_file_name+title+'.jpg','wb')
        file.write(html_page)
        file.close()
    
    
if __name__=='__main__':
    clw=clawer('C:\\Users\\lzy\\Desktop\\图片\\')
    clw.getImageUrl('http://www.win4000.com/mt/dilireba_2.html')
    thread_pool=[]
    max_thread=10
    start=time.time()
    while True:
        try:           
            #多线程
            for t in thread_pool:
                 if not t.is_alive():
                     thread_pool.remove(t)
            if len(thread_pool)==max_thread:
                    continue   
            imgif=clw.cur_que.popleft()
            if imgif!=None:  
                thread=threading.Thread(target=clawer.getImge,name=None,args=(clw,imgif.url,imgif.title))
                thread_pool.append(thread)
                thread.setDaemon(True)
                thread.start()             
            else:
                break
            '''
            #单进程
            imgif=clw.cur_que.popleft()
            if imgif!=None: 
                clw.getImge(imgif.url,imgif.title)
            else:
                break
            '''
        except Exception as Arg:
            print(Arg)
            break
    print(time.time()-start)

代码相对简单，就没写注释，我们来看看运行时间

实现效果：

在来看看单线程运行时间：

多线程情况下运行速度提升大约十倍

python的多线程是一种假并行技术，由于python GIL的存在，多个线程竞争GIL，获得GIL的线程运行，而GIL只有一个，所以每次只能运行一个线程，这意味着从微观上来说，每次只会有一个线程运行，当计数器到期后，线程会释放GIL（python 3.x的实现方式），线程继续竞争，更多请戳：https://zhuanlan.zhihu.com/p/20953544，那么为什么我们代码的运行速度的确快了呢？爬虫属于I/O密集型程序，意味着我们存在等待时间，例如请求一个url，等待服务器响应需要时间，写入文件需要时间，这段时间里cpu是啥都不干的，也就是空转，我们的多线程爬虫速度之所以提高，是因为很好的利用了等待时间，但多线程并不一定可以提高程序执行的速度（如果程序不存在等待时间），例如计算密集型程序（例如机器学习），更多详情请戳：https://www.zhihu.com/question/37396742

多进程爬虫

进程和线程不一样，线程是进程内部的单位，而进程是操作系统进行资源调度的单位，每个进程都有独立内存空间，死锁问题和改值问题依然存在，因为进程可能同时访问一块公共区域，所以改值问题依然存在，对于单核cpu来说，只能实现伪并行，即宏观并行，微观串行，cpu切换进程的速度和每个进程运行的时间都比较短，例如每个进程运行100毫秒后切换，由于切换速度比较快，所以宏观上就好似并行一样，但是，现在的cpu基本都是多核，所以微观上的并行是可以实现的

使用python3.6的multiprocess包，这个包真的坑了我很久，我们使用其中的进程池，比较重要的函数如下：

#返回进程池对象，可以指定进程数目，默认是cpu核数，我是四核，所以默认进程数为4
p=Pool()
'''
运行target指定的函数，args指定参数，其实还有一个回调函数参数，只接受一个参数，为非阻塞调用，即当进程池满了之后，主进程不会阻塞，而是立刻返回，接着运行，子进程会等待运行，当然也有阻塞版本的apply
'''
p.apply_async(target=f,args=(1,2,3))
#关闭进程池，此时不允许在获得进程池中的进程运行函数
p.close()
#等待全部子进程运行完毕后，关闭主进程，否则主进程直接结束，子进程无论是否运行完，都将被撤销
p.join()

附上官方文档：https://docs.python.org/2/library/multiprocessing.html

这里为了学习python的数据库连接池，就实现了一个基于数据库连接池的python多进程爬虫，对于爬取迪丽热巴图片的需求，不用数据库完全可以实现，只需将多线程代码改一改就可以

数据库语句：

create table image
(indexs int(11) NOT NULL AUTO_INCREMENT PRIMARY KEY,
 url varchar(256) NOT NULL UNIQUE KEY,
 status varchar(16) NOT NULL DEFAULT 'new',
 title varchar(256) NOT NULL );

数据库连接池：

import mysql.connector.pooling
import mysql.connector.connection


class db_manager:  
    
    #创建数据库连接池
    def __init__(self,database,user,password,host):
        try:
          dbconfig = {
          "user":       user,
          "password":   password,
          "host":       host,
          "port":       3306,
          "database":   database,
          "charset":    "utf8"
        }
          self.conpool=mysql.connector.pooling.MySQLConnectionPool(pool_name='image',pool_size=10,**dbconfig)
        except Exception as arg:
            print('出现异常')
    
    #获取未爬取的url,将状态设置为download pooling.MySQLConnectionPool,记得commit
    def getUrl(self):
        con=self.conpool.get_connection()
        cursor=con.cursor()     
        try:
            sql='select url,indexs,title from image where status="new" limit 1'
            #这里是函数使用错误，不会返回值
            cursor.execute(sql)
            result=cursor.fetchone()
            #将status更新为download
            sql='update image set status="download" where indexs="%d"'%result[1]
            cursor.execute(sql)
            con.commit()
        except Exception as Arg:
            con.rollback()
            print(Arg)
            return None
        finally:
            if con:
                con.close()
            if cursor:
                cursor.close()
        return result
    
    #插入url数据，刚开始爬取时，获取img标签的url
    def inserturl(self,url,title):
        con=self.conpool.get_connection()
        curson=con.cursor()
        try:
           sql='insert into image(url,title) values("%s","%s")'%(url,title)
           curson.execute(sql)
           con.commit()
        except Exception as Arg:
           print(Arg)
        finally:
            if con:
               con.close()
            if curson:
               curson.close()
    
    
    #爬取完毕后，将状态置为finish
    def finishCrawle(self,index):
        con=self.conpool.get_connection()
        cursor=con.cursor()
        try:
            sql='update image set status="finish" where indexs="%d"'%index
            cursor.execute(sql)
            con.commit()
        except Exception as Arg:
            con.rollback()
            print(Arg)
        finally:
            if con:
                con.close()
            if cursor:
                cursor.close()

为了避免进程的改值问题（数据库是一块共享区域，假设两个进程A，B几乎同时读取url c，则此时会有两个进程同时爬取url c），我们用一个主进程获得url，调用子进程进行爬取，具体代码如下：

from lxml import etree
from urllib import request
from multiprocessing import Pool
import dbmanager
import os
import time

class clawer:
    request_header={
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',
        'Connection': 'close'
        }  
    
	
    def __init__(self,image_catalog):
        self.image_catalog=image_catalog

    #第一次爬取，获得所有图片地址
    def getImageUrl(self,url):
        req=request.Request(url,headers=self.request_header)
        response=request.urlopen(req)
        html_page=response.read()
        html=etree.HTML(html_page.lower().decode('utf-8'))
        img_list=html.xpath('//img[@title]')
        for img in img_list:
            try:
                dbm.inserturl(img.attrib['src'],img.attrib['title'])
            except Exception as arg:
                print(arg)
                continue
    
    def getImage(self,url,title):
        print('当前进程的id为'+str(os.getpid()))     
        req=request.Request(url,headers=self.request_header)
        response=request.urlopen(req)
        picture=response.read()       
        file=open(self.image_catalog+title+'.jpg','wb')
        file.write(picture)
        file.close()
		
if __name__=='__main__':
    dbm=dbmanager.db_manager('imagedb','uer','12345','127.0.0.1')
    clw=clawer('C:\\Users\\lzy\\Desktop\\图片\\')
    clw.getImageUrl('http://www.win4000.com/mt/dilireba_4.html')
    p=Pool() 
    indexss=[]
    start=time.time()
    while True:
        tar=dbm.getUrl()	
        if tar is None:
            break
        #clw.getImage(tar[0],tar[2])
        p.apply_async(clawer.getImage,args=(clw,tar[0],tar[2],))
        indexss.append(tar[1])     
    p.close()
    p.join()
    print('所有进程执行完毕')
    for indexs in indexss:
        dbm.finishCrawle(indexs)
    end=time.time()
    print(end-start)

可以看到cpu利用率是100%，因为所有cpu都被利用了

运行结果如下：

注意到异常抛出的位置，此时所有url都被分配给子进程执行

接下来我们看看单进程：

可以看到时间明显延长

那么为什么例子中多进程比多线程慢呢？因为多进程有数据库读写，是对硬盘操作，而例子中的多线程直接对内存操作，从一定程度上说明了硬盘操作比内存操作慢许多

那么同样是内存，多进程有多快呢？四个进程速度如下：

可以看到速度是非常快的，比多线程快个十倍多

遇到的异常

归根揭底还是自己对python以及sql语句不够熟悉：

1、Unknown column 'new' in 'where clause'，这类异常一般是在sql语句中对字符串类型漏加双引号引起的，把

select url,indexs,title from image where status=new limit 1

改为

select url,indexs,title from image where status="new" limit 1

2、tuple indices must be integers or slices, not str，元组和列表应该用下标访问

3、使用搭配：遍历xpath函数返回的数据时，可以使用attrib访问：

       img_list=html.xpath('//img[@title]')
        for img in img_list:
            try:
                dbm.inserturl(img.attrib['src'],img.attrib['title'])

4、Mysqlconnection doesn't has attribute curson：这个错误坑了我很久，因为ide本身就没告诉我有cursor这个函数，所以会报错在我意料之内................最后发现是我拼写错误............以后遇到异常首先检查拼写

5、使用apply_async方法时，若子进程执行类内的方法，要用类名+函数名，不是对象名+函数名，同时在参数中带上对象（具体原因不清楚，对于python语言的了解不够，也许和python处理类方法的方式有关，知道的大佬和我说一下，不胜感激）：

#clw为对象，clawer为类名
p.apply_async(clawer.getImage,args=(clw,tar[0],tar[2],))

要好好检查子进程执行的方法有没有错误，如果有错误，子进程执行出错是不会出现任何异常通知的（即使我们有捕获异常的代码），所以发现子进程不会执行，首先检查一下自己的代码是否有问题

菜到怀疑人生

关注

7
点赞
踩
22

收藏

觉得还不错? 一键收藏
10
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录