人生苦短，我用Python（四）— 多线程爬取WiKiCFP-CSDN博客

本文链接：https://blog.csdn.net/llllllyyy/article/details/81613328

前面介绍了两个简单的爬虫，看明白了就知道其中涉及的不过是基本操作了……其实，本篇博文要介绍爬虫才是博主接触Python后写的第一个爬虫，也是博主较为满意的一个多线程爬虫了（大佬轻喷_(￣▽￣)*）。从起初只有下载、解析到设置代理、headers、socket超时等等，从正则表达式到Xpath，从单线程到多线程……其间种种，着实教会我一句话——“百度大法好牛逼！”

开始正题，虽然写得不好不完善，也是对自主学习过程的一个总结啦。

并发编程与Python

由于GIL(Global Interpreter Lock)的存在，Python多线程是并发的，不是并行的。也就是说，python多线程并不能真正缩短总体的任务时间、提高效率，所以也就有了网上众多批评python多线程的说法：“python多线程是鸡肋、推荐使用多进程”等等。但博主觉得凡事存在即合理，python多线程虽然鸡肋，但也不至于不堪。所以不建议跳过多线程直接学习多进程，有时多进程解决不了的问题多线程反而能有更好的效果，IO密集型的爬虫就是多线程极好的一个应用。

不是真正的并行，即同一时间只有一个线程在工作，那为什么从直观上多线程的工作效率要比单个线程高许多呢?

要理解这个问题，关键是要找到多线程节省了哪部分时间。以爬虫为例，绝大多数时间爬虫是在等待socket返回数据的。当A线程在等待当前page的数据时，B线程可以继续执行请求next page的数据；当B线程进入等待状态，C线程又可以发出下一个请求……

一点个人经验：

Python的多线程对IO密集型程序比较友好，对CPU密集型程序还是要使用多进程。
在爬虫编写过程中，博主最初用的多进程方式，它的关键点是进程间数据的共享和同步。
利用多线程方式编程，可以直接利用Python的Queue模块。它提供了同步的、线程安全的队列Queue并且实现了锁原语。因此使用队列来实现线程间的同步，可以不必再使用threading.Lock或threading.RLock的acquire()和release()方法去获取锁，代码写起来更简洁。
针对WiKiCFP的多进程爬虫至今写得也不是很理想，运行过程中因为多进程要跑满多核CPU经常会导致电脑卡死、调试不方便（当然，菜是原罪○|￣|_）；对于四核CPU来说，多线程只会跑满一核，完全不妨碍爬虫运行的过程中做其他工作。

Python实现多线程的两种方式

一般来说，使用多线程有两种方式：

创建线程要执行的函数，把这个函数传递进Thread对象里，让它来执行；
继承Thread类，创建一个新的class，然后重写init()方法和run()方法。

详情可见：http://blog.csdn.net/qq_15297487/article/details/48185743，代码简单易懂。

在编写CrawlerWikicfp时，博主采用的是第2种方式，代码在文章的最后。

爬取策略

需要的关键技术大致确定之后，像往常一样，一切从制定爬取策略开始。

· url 格式

打开多个页面，分析他们的url格式可以得到：
http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid= + {1~73000}

· 爬取字段

爬取的目标字段如图：

**Ps：**自上往下一次为：会议名称、Link、表格中所有字段、Categories。由于不同会议表格的大小可能不定，所以爬取过程中不能仅仅只抓取表格右列的value，而是要把左侧的属性名一同抓取下来，存入字典，然后利用之前文章讲过的写不等长字典的方法写入文件或数据库。

· 无效页面

这种页面的会议记录可能已经失效被删除，所以遇到这种页面直接忽略就好了。
ignore

综合上述分析，我们得到爬取策略如下：

创建两个队列，队列A用于维护page number(1~73000)，队列B用于保存爬取的数据，其中每个元素都是字典的形式；
利用多线程实现：队列A中page number出队，重组出完整的url，urllib2下载、xpath解析，得到目标字段（多个线程共享队列A）；
将得到的目标字段组织成字典的形式，存入队列B;
队列B中元素出队，写入csv文件。

实现：CrawlerWikicfp 代码

# coding:utf-8
'''
Created on 2018年1月19日
 
@author: li_yan
'''
from threading import Thread
from Queue import Queue
import urllib2
import time
import csv
#import re
from lxml import etree
import sys
 
reload(sys)
sys.setdefaultencoding('utf-8')
 
exitFlag = 0
 
class WIKICFP(Thread):
    def __init__(self, name, url, q):     #url:page列表       q:存数据队列
        # 重写写父类的__init__方法
        super(WIKICFP, self).__init__()
        self.name =name
        self.url = url
        self.q = q
        self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36',}
 
    def run(self):
        print "starting "+ self.name
        while not exitFlag:
            self.getPage(self.url)
        print "exiting "+ self.name
 
    def getPage(self, url):
        if not url.empty():
            num = url.get()
            try:
                base_url = 'http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid='
                #构造url
                newurl = base_url+str(num)
                request = urllib2.Request(newurl, headers=self.headers)
                response = urllib2.urlopen(request)
                pageCode = response.read()
                response.close()
 
                # xpath抓取信息
 
                selector = etree.HTML(pageCode)
 
                #处理无效页面     /html/body/div[4]/center/h3
                emptypage = selector.xpath('/html/body/div[4]/center/h3')
 
                if not emptypage:     #非空页面
                    #提取题目
                    #/html/body/div[4]/center/table/tbody/tr[2]/td/h2/span/span[7]/text()
                    #confname = selector.xpath('//span[@property="v:description"]/text()')
 
                    try:
                        confname = selector.xpath('//span[@property="v:description"]/text()')
                    except UnicodeDecodeError, e:
                        if hasattr(e, 'code'):
                            print "Error Code:", e.code
                        if hasattr(e,"reason"):
                            print "Error Reason:", e.reason
                        confname = []
                        confname.append("page "+str(num))
 
                    #提取会议系列 Conference Series    /html/body/div[4]/center/table/tbody/tr[2]/td/a
                    try:
                        confser = selector.xpath('/html/body/div/center/table/tr/td/a[starts-with(@href,"/cfp/program")]/text()')
                        if confser==[]:
                            confser.append('N/A')
                    except UnicodeDecodeError, e:
                        if hasattr(e, 'code'):
                            print "Error Code:", e.code
                        if hasattr(e,"reason"):
                            print "Error Reason:", e.reason
                        confser = []
                        confser.append("page "+str(num))
 
                    #提取链接 link /html/body/div[4]/center/table/tbody/tr[3]/td/a
                    try:
                        link = selector.xpath('/html/body/div/center/table/tr/td/a[@target="_newtab"]/@href')
                        if link==[]:
                            link.append('N/A') 
                    except UnicodeDecodeError, e:
                        if hasattr(e, 'code'):
                            print "Error Code:", e.code
                        if hasattr(e,"reason"):
                            print "Error Reason:", e.reason
                        link = []
                        link.append("page "+str(num))
 
                    #提取 when where submission_deadline 等...
                    #/html/body/div[4]/center/table/tr[5]/td/table/tr/td/table/tr[1]/td/table/tr/th    
                    th = selector.xpath('/html/body/div/center/table/tr/td/table/tr/td/table/tr/td/table/tr/th/text()')
                    tddata = selector.xpath('/html/body/div/center/table/tr/td/table/tr/td/table/tr/td/table/tr/td')
                    tds = []
                    for td in tddata:
                        try:
                            td = tddata[tddata.index(td)].xpath('string(.)').replace('\n', '').strip()
                            tds.append(td)
                        except UnicodeDecodeError,e:
                            if hasattr(e, 'code'):
                                print "Error Code:", e.code
                            if hasattr(e,"reason"):
                                print "Error Reason:", e.reason
                            tds.append("page "+str(num))
 
                    table = dict(zip(th,tds))
 
                    #提取分类 Categories   /html/body/div[4]/center/table/tbody/tr[5]/td/table/tbody/tr/td/table/tbody/tr[2]/td/table/tbody/tr[2]/td/h5/a[2]
                    #categories = selector.xpath('/html/body/div/center/table/tr/td/table/tr/td/table/tr/td/table/tr/td/h5/a/text()')
                    try:
                        categories = selector.xpath('/html/body/div/center/table/tr/td/table/tr/td/table/tr/td/table/tr/td/h5/a/text()')
                        print categories
                    except UnicodeDecodeError, e:
                        if hasattr(e, 'code'):
                            print "Error Code:", e.code
                        if hasattr(e,"reason"):
                            print "Error Reason:", e.reason
                        categories = []
                        categories.append("page "+str(num))
 
                    # 写入字典 table
                    table['Conference Name'] = confname[0]
                    table['Conference Series'] = confser[0]
                    table['Link'] = link[0] 
                    table['Categories'] = categories
                    print table     
                    self.q.put(table)
 
                    print str(self.name),"getpage",str(num),"success."
 
                else:
                    print str(self.name), str(num)+" is empty page."
 
            except urllib2.URLError, e:
                if hasattr(e, 'code'):
                    print "Error Code:", e.code
                if hasattr(e,"reason"):
                    print "Error Reason:", e.reason
                    return None
 
def main():
    # 创建一个队列用来保存进程获取到的数据
    q = Queue()
    #index_list = range(1,73000)
    # 内存对队列B的限制
    index_list = range(10000,20000)
    workQueue = Queue(20000)
    threadList = ['thread-1', 'thread-2', 'thread-3','thread-4', 'thread-5',
                  'thread-6', 'thread-7', 'thread-8','thread-9', 'thread-10']
    # 保存线程
    Thread_list = []
 
    #填充url列表
    for index in index_list:
        workQueue.put(index)
 
    #创建线程，限制数目
    for tName in threadList:
        thread = WIKICFP(tName, workQueue, q)
        thread.start()
        Thread_list.append(thread)
 
    #等待队列清空
    while not workQueue.empty():
        pass
 
    global exitFlag
    exitFlag = 1
 
    # 让主线程等待子线程执行完成
    for i in Thread_list:
        i.join()
 
    # 写入csv文件
    headers = ['Conference Name','Conference Series','Link','Categories',
               'When','Where','Submission Deadline','Notification Due','Final Version Due','Abstract Registration Due']
 
    with open('wikicfp.csv', 'wb',) as f:
        # 标头在这里传入，作为第一行数据
        writer = csv.DictWriter(f, headers)
        writer.writeheader()
        while not q.empty():
            writer.writerow(q.get())
 
        # 还可以写入多行
        #writer.writerows(datas)
 
if __name__=="__main__": 
 
    start = time.time()
    main()
    print '[info]耗时：%s'%(time.time()-start)

Ps：
由于内存对队列B（代码中workQueue）的限制，将它设置为20000。所以，70000+的数据是无法一次爬完的（还是那句话，菜是原罪），要修改代码中下列内容：index_list = range(10000,20000)

若一次爬取10000条会议记录，则需要将程序跑7遍。。。

爬虫运行结果

反思与可能的优化

按照上述方法，确实完成了学长交给的项目要求，但其中也存在很多不完善的地方：

将一个程序适当修改，然后跑7次才得到结果的做法未免有些蠢。解决办法是另开一条线程，专门控制将队列B：当队列B达到一定大小时，中断当前线程，将队列B中的内容出队写入文件，这样可以防止内存溢出。但实际操作过程中有许多细节要注意，咸鱼的我就直接跑7遍了。。。
过程中还会经常出现error: [Errno 10054]这个错误，这应该是大量下载导致线程被ban掉了。通过设置header、模拟浏览器的做法并没有取得很好的效果……博主在这里提供两种也许可行的办法：减少并发的线程数目、利用代理下载。
整个代码用到的技术有：urllib2下载、xpath解析、多线程并发编程threading模块、FIFO队列……

最后，附上GitHub地址：https://github.com/lyandut/CrawlerWikiCFP_thread.git

C309

18/03/07晚