中华英才网爬虫程序解析（2）-多线程threading模块

最新推荐文章于 2020-12-06 16:03:56 发布

Ejasmine

最新推荐文章于 2020-12-06 16:03:56 发布

阅读量331

点赞数 1

分类专栏：中华英才网爬虫 python爬虫教程从入门到精通文章标签： threading python 网络爬虫

本文链接：https://blog.csdn.net/weixin_42183408/article/details/88075373

版权

python爬虫教程从入门到精通同时被 2 个专栏收录

16 篇文章 2 订阅

订阅专栏

中华英才网爬虫

5 篇文章 0 订阅

订阅专栏

欢迎来到爬虫高级兼实战教程，打开你的IDE，开始python之旅吧！

threading模块

threading是多线程的一个模块。所谓多线程，就是实现多个线程并发执行的技术。
使用多线程能帮助我们提升整体处理性能，也就是让我们的爬虫更快。

但是python有一个不同，python具有GIL锁，也就是全局解释器锁，也就是在同一时间只能有一个线程执行，GIL锁就像通行证一样，只有一张，所以python的多线程指的是线程间快速切换来增加速度。

虽说有GIL锁，但是依旧能提高不少效率，如果于我们之后要学习的redis进行结合，效率会更上一步，废话不多说，开始程序解说。

程序解析

首先给出我们的代码和解析（完整代码可查看GitHub）：

#导入库
import requests
from bs4 import BeautifulSoup
import time
import re
import class_connect
import threading

#把所有要请求的网址放入link_list列表中
link_list=[]
for i in range(1,208):
    url='http://campus.chinahr.com/qz/P'+str(i)+'/?job_type=10&'
    link_list.append(url)

#连接数据库的类的实例化
a = class_connect.spider()
collection = a.connect_to_mongodb()
cur, conn = a.connect_to_mysql()

#重写Thread方法并继承threading.Thread父类
class myThread(threading.Thread):
    def __init__(self,name,link_range):
    
        #使用Thread的__init__(self)
        threading.Thread.__init__(self)
        
        #定义线程名称和每个线程爬取的网站数
        self.name=name
        self.link_range=link_range
    def run(self):
   
        #crawler为主函数
        print('Starting '+self.name)
        crawler(self.name,self.link_range)
        print('Exiting '+self.name)

# 网站请求头
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}

#记录开始时间
scrapy_time = time.time()

#主函数开始
def crawler(threadName,link_range):

    #因为在函数内，所以需要重新连接MySQL数据库，否则会引发错误
    cur, conn = a.connect_to_mysql()

    #循环网站列表
    for i in range(link_range[0],link_range[1]+1):
    
        #依次抽取网站并请求
        link = link_list[i-1]
        r = requests.get(link, headers=headers, timeout=20)

        #=使用BeautifulSoup解析网页
        soup = BeautifulSoup(r.text, "lxml")

        #使用soup.find_all找到我们需要的
        salary_list = soup.find_all('strong', class_='job-salary')     #工资
        city_list = soup.find_all('span', class_="job-city Fellip")     #城市
        top_list = soup.find_all('div', class_="top-area")     #名称和公司
        job_info = soup.find_all('div', class_='job-info')     #城市，学历和人数
        type_list = soup.find_all('span', class_='industry-name')     #类别

        #循环每一条招聘信息
        for x in range(len(top_list)):

            #使用strip()提取文字信息
            salary = salary_list[x].text.strip()     #工资
            city = city_list[x].text.strip()      #城市
            top = top_list[x].text.strip()     #名称和公司
            job_and_company = top.split('\n', 1)     #分开名称和公司
            job_information = job_info[x].text.strip()     #城市，学历和人数
            city_to_people = job_information.split('\n')     #分开城市，学历和人数
            type = type_list[x].text.strip()     #类别

            #为了mongodb数据库的字典
            all = {"job": job_and_company[0],
                   "company": job_and_company[1],
                   "salary": salary,
                   "city": city,
                   "type": type}

            #用for循环分开城市，学历和人数
            for each in range(0, 5):

                #使用re正则表达式
                first = re.compile(r'  ')     #compile构造去掉空格的正则
                time_for_sub = first.sub('', city_to_people[each])     #把空格替换为没有，等于去掉空格
                another = re.compile(r'/')     #compile构造去掉/的正则
                the_final_info = another.sub('', time_for_sub)     #把/替换为没有，等于去掉/

                #获取背景和人数并插入字典
                if each == 3:
                    all['background'] = the_final_info     #背景
                    back=the_final_info
                if each == 4:
                    all['people'] = the_final_info     #人数
                    peo=the_final_info

            #插入MongoDB和MySQL数据库
            collection.insert_one(all)
            cur.execute(
                "INSERT INTO yingcaiwang(job,company,salary,city,type,background,people) VALUES(%s,%s,%s,%s,%s,%s,%s);",
                (job_and_company[0], job_and_company[1], salary, city, type, back, peo))     #SQL语句
            conn.commit()     #提交变动

        #每爬取5页休息3秒
        if i % 5 == 0:
            print(threadName+" : 第%s页爬取完毕，休息三秒" % (i))
            print('the %s page is finished,rest for three seconds' % (i))
            time.sleep(3)

        #每爬取1页休息1秒
        else:
            print(threadName+" : 第%s页爬取完毕，休息一秒" % (i))
            print('the %s page is finished,rest for one second' % (i))
            time.sleep(1)

#平均分配网址
thread_list=[]
link_range_list=[(1,40),(41,80),(81,120),(121,160),(161,207)]

#利用for开启5个线程
for i in range(1,6):
    thread=myThread('Thread-'+str(i),link_range_list[i-1])
    thread.start()
    thread_list.append(thread)

#等待线程执行完成
for thread in thread_list:
    thread.join()

#输出总时间
scrapy_end = time.time()
scrapy_time_whole = scrapy_end - scrapy_time
print('it takes {}'.format(scrapy_time_whole))

#提交MySQL的变动并关闭
cur.close()
conn.commit()
conn.close()