多进程+多线程爬取链家武汉二手房价

最新推荐文章于 2024-04-22 14:14:35 发布

Spark---Yang

最新推荐文章于 2024-04-22 14:14:35 发布

阅读量738

点赞数 2

分类专栏：爬虫文章标签：爬虫多进程多线程

本文链接：https://blog.csdn.net/ylftake/article/details/80230435

版权

爬虫专栏收录该内容

1 篇文章 0 订阅

订阅专栏

因为数据分析的需要,就写了爬取链家武汉的数据.因为用scrapy框架感觉太慢了,就自己写了个多进程同步执行的代码.

1.数据量:20000+

2.程序环境:Python3.6--->用的Anaconda的环境, Spyder

3.数据提取的方式:xpath

代码的思路是:

因为链家对待爬虫是比较宽容的,因此大家爬取的时候还是要控制一下访问时间间隔和访问进程数.在链家的网站里,在总览界面,会有100页,每页30条数据,但是总的数据量在20000以上,因此不能直接第一界面获取.一般拥有多个数据的网站都不会直接把所有数据一次性显示的,一是为了降低服务器的负载,二也是反爬措施的一种.

但是数据总要显示啊,不能说不给人看了啊,因此我么要对页面进行分析,在分类栏里,我们发现,真个的二手房库是会根据地区进行区分的,因此我们点一下,发现:1,url里面只是最后的地址变了,前缀都没变.2,当页面数据超过3000时,也只显示3000条,即100页数据.因此我们要再分析一下,我们再看分栏,发现地区还能再分,点第一个发现:还是url变了,和第一次点分栏一样,只变了url的地区标识,此时页面数据降低在3000以下了,这就标识页面分页在100页以内了,这就给了我们可乘之机了.爬取的时候根据获取的url列表,多进程直接异步分别爬取各个url,然后在各个url里用多个线程爬取分页的数据,然后再保存数据即可.

代码步骤:

开始连接:https://wh.lianjia.com/ershoufang/baibuting/

1.准备步骤:获取所有待爬的url地区部分.

地区分两部分:首先获取大区才能获取小地区,所幸大区可以直接获取

def getHousDict(firstUrl,housDict=None):
    
    #获取{'jiangan':['url']}
       #获取页面
    try:
        firstpage = getReponse(firstUrl)
            #匹配出地址 
        firstpage = etree.HTML(firstpage.text)
            #通过初试url 获取键 然后遍历键 获取剩下的url
        area = firstpage.xpath("//div[@data-role='ershoufang']/div[1]/a/@href")#获取地区列表
            #大地区小地区
        for a in area:
            a = a.split('/')[-2]
            housDict[a]=[]
    
        for area in housDict:
            nextUrl = 'https://wh.lianjia.com/ershoufang/'+area+'/'
            nextPage =  getReponse(nextUrl)
            nextPage = etree.HTML(nextPage.text)
            mudiArea = nextPage.xpath("//div[@data-role='ershoufang']/div[1]/a[@class='selected']/@href")[0] #获取当前大地区
            mudiArea = mudiArea.split('/')[-2]
            ar = nextPage.xpath("//div[@data-role='ershoufang']/div[2]/a/@href") #当前小地区
            L=[]
            for a in ar:
                ar = a.split('/')[-2]
                L.append(ar)
            if area ==mudiArea:
                housDict[area] = L
        
    except:
        print('获取url列表出错正在循环')
        time.sleep(2)
        getHousDict(firstUrl,housDict)
    return housDict

2,此时获取了 houseDict--->{'大区的url部分':['各个街道的url部分']}

我们可以开始正式开爬了!!!!!

3.虽然有了url列表,但是我们不能直接爬取,因为要对url去重,

设置一个 set()--->getq = set()--->所有以爬过的都放进去,多进程访问url,多线程爬取页面信息

def getInfo(lock,urllist,area):
    #从 字典里取出 小地区, 进入网页 获取 页码,同时进行找信息
    # 多进程加多线程异步  调入url
    #获取页码
    #多线程运行
    print('getinfo 运行')
    global getq
    for url in urllist:
        if url in getq:
            pass
        else:
            print('----------->不存在',url)
            getq.add(url)
            mainurl = "https://wh.lianjia.com/ershoufang/"+url+'/'
            coutpage = getReponse(mainurl)
            coutpage = etree.HTML(coutpage.text)
            try: 
                count = coutpage.xpath("//div[@class='page-box house-lst-page-box']/@page-data")[0]
                count = eval(count)
                countnum = count['totalPage'] #页码
            except:
                countnum = 1
                
            try:
                threads = [Thread(getHous(url,count,area,lock)) for count in range(1,countnum+1)]
                for t in threads:
                    t.start()
                for t in threads:
                    t.join()          
            except Exception as e:
                print('线程出错',e)

4.进入分页的页面后就开始拿!!! 因为数据情况各不相同,我也没完全考虑到所有的情况,大家要是发现什么异常情况,麻烦告知我一下,我改一下!

def getHouseInfo(url,count,area,lock):
    #分页爬取https://wh.lianjia.com/ershoufang/baibuting/pg2/
    nowurl = "https://wh.lianjia.com/ershoufang/"+url+"/pg"+str(count)+"/"
    try:
        housepage =  getReponse(nowurl)
        Page = etree.HTML(housepage.text)
        #获取列表
        #大区 小区房型 面积 楼层 建造时间 发布时间 地铁站#列表
        for info in Page.xpath("//li[@class='clear']"):
    
            name = info.xpath("./div[@class='info clear']/div[@class='address']/div[@class='houseInfo']")
            if name:
                name = info.xpath("./div[@class='info clear']/div[@class='address']/div[@class='houseInfo']/text()")[0]
                name = name.split('| ')
                for x in name:
                    if x ==' ':
                        name.remove(x)
                house_type = name[0] #房型
                house_size = name[1] #面积
            else:
                house_type = '暂无数据'
                house_size = '暂无数据'
            year_type = info.xpath("./div[@class='info clear']/div[@class='flood']/div[@class='positionInfo']")
            if year_type:
                year_type = info.xpath("./div[@class='info clear']/div[@class='flood']/div[@class='positionInfo']/text()")[0]
                year_type = year_type.split(')')
                house_num = year_type[0]+')' # 楼层
                house_year = year_type[1].split(' ')[0] #楼型
            else:
                house_num = '暂无数据'
                house_year = '暂无数据'
            times = info.xpath("./div[@class='info clear']/div[@class='followInfo']")
            if times:
                times = info.xpath("./div[@class='info clear']/div[@class='followInfo']/text()")[0]
                house_times = times.split('/ ')[-1].split('以前')[0] #发布时间
            else:
                house_times = '暂无数据'
            subway = info.xpath("./div[@class='info clear']/div[@class='tag']/span[@class='subway']/text()")#地铁信息
            if subway !=[]:
                subway = subway[0]
            else:
                subway='暂无数据'
            price = info.xpath("./div[@class='info clear']/div[@class='priceInfo']/div[@class='totalPrice']/span")
            if price:
                price = info.xpath("./div[@class='info clear']/div[@class='priceInfo']/div[@class='totalPrice']/span/text()")[0]
            else:
                price = '暂无数据' #房价实际数
            pricedanwei = info.xpath("./div[@class='info clear']/div[@class='priceInfo']/div[@class='totalPrice']")
            if pricedanwei:
                pricedanwei = info.xpath("./div[@class='info clear']/div[@class='priceInfo']/div[@class='totalPrice']/text()")[0]
            houseprice = price +pricedanwei #总房价
            Unit_Price = info.xpath("./div[@class='info clear']/div[@class='priceInfo']/div[@class='unitPrice']/span/text()")
            if Unit_Price !=[]:
                Unit_Price = Unit_Price[0].split('单价')[-1]
            yield {
                    '地区':area,
                    '街道':url,
                    '房型':house_type,
                    '面积':house_size,
                    '楼层':house_num,
                    '建造时间':house_year,
                    '发布时间':house_times,
                    '房价':houseprice,
                    '单价':Unit_Price,
                    '地铁':subway
                    }           
            
    except Exception as e:
        print('info出错',e)

5.拿到了总不能放内存啊,我们要写下来才是我们借鉴的数据啊

def getHous(url,count,area,lock):
    for item in getHouseInfo(url,count,area,lock):
        print('获取成功,正在打印')
        lock.acquire()
        write_info(item)
        lock.release()

这个地方大家一定注意:多进程,多线程,不管是进程池还是异步的啥啥啥的,还是线程池啥的,但凡涉及写入文件,写入数据库啥的,必须加锁!我这次用的是multiprocessing的manage().Lock(),后面会写出来.

使用注意:

第一:大家爬取的时候注意访问间隔,

第二:设置报错情况处理,当返回码不是200时怎么办?当验证码时怎么办?当网络状态差时怎么办?

第三:headers,这个大家最好还是根据正常访问时有的东西设置随机的,特别是一些反爬厉害的网站

全部代码在下面:

import random
import requests
from lxml import etree
import time
from multiprocessing import Pool,Manager
from threading import Thread
def userAgent():# 随机获取useragent
    useragent = [
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36',
            'Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0',
            'Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1',
            'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)'
            ]
    
    return useragent[random.randint(0,3)]



def getReponse(url):
    time.sleep(2)
    useragent = userAgent() #随机获取useragent
        #host 可以根据地区改变
    headers={
                'User-Agent':useragent,
                'Cookie':"lianjia_uuid=80d19086-77a7-4c19-a170-837a6807599d; _jzqckmp=1; UM_distinctid=1632551ec1a7dd-043124e50105a2-454c062c-1fa400-1632551ec1b103e; _ga=GA1.2.1902842581.1525339523; _gid=GA1.2.1439105168.1525339523; select_city=420100; _smt_uid=5aead5c9.24201108; _jzqx=1.1525346703.1525346703.1.jzqsr=wh%2Elianjia%2Ecom|jzqct=/ershoufang/baibuting/.-; all-lj=dafad6dd721afb903f2a315ab2f72633; Hm_lvt_9152f8221cb6243a53c83b956842be8a=1525339589,1525350042,1525395704; CNZZDATA1255849575=327501666-1525338295-https%253A%252F%252Fwww.lianjia.com%252F%7C1525391329; CNZZDATA1254525948=671308690-1525335895-https%253A%252F%252Fwww.lianjia.com%252F%7C1525391567; CNZZDATA1255633284=62432075-1525337313-https%253A%252F%252Fwww.lianjia.com%252F%7C1525394112; CNZZDATA1255604082=137477002-1525337374-https%253A%252F%252Fwww.lianjia.com%252F%7C1525390582; _qzjc=1; _jzqa=1.3645530149229753300.1525339512.1525350043.1525395704.4; _jzqc=1; _jzqy=1.1525339512.1525395704.1.jzqsr=baidu.-; Hm_lpvt_9152f8221cb6243a53c83b956842be8a=1525395732; _qzja=1.1290139477.1525339589257.1525350042524.1525395704204.1525395723323.1525395732013.0.0.0.88.4; _qzjto=5.1.0; lianjia_ssid=dcf4ad87-0def-73d4-fe77-087b3db79de3",
                'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
                'Host':'wh.lianjia.com'
                }
    page = requests.get(url,headers=headers)
    if page:
        return page
    else:
        getReponse(url)
        

def getHousDict(firstUrl,housDict=None):
    
    #获取{'jiangan':['url']}
       #获取页面
    try:
        firstpage = getReponse(firstUrl)
            #匹配出地址 
        firstpage = etree.HTML(firstpage.text)
            #通过初试url 获取键 然后遍历键 获取剩下的url
        area = firstpage.xpath("//div[@data-role='ershoufang']/div[1]/a/@href")#获取地区列表
            #大地区小地区
        for a in area:
            a = a.split('/')[-2]
            housDict[a]=[]
    
        for area in housDict:
            nextUrl = 'https://wh.lianjia.com/ershoufang/'+area+'/'
            nextPage =  getReponse(nextUrl)
            nextPage = etree.HTML(nextPage.text)
            mudiArea = nextPage.xpath("//div[@data-role='ershoufang']/div[1]/a[@class='selected']/@href")[0] #获取当前大地区
            mudiArea = mudiArea.split('/')[-2]
            ar = nextPage.xpath("//div[@data-role='ershoufang']/div[2]/a/@href") #当前小地区
            L=[]
            for a in ar:
                ar = a.split('/')[-2]
                L.append(ar)
            if area ==mudiArea:
                housDict[area] = L
        
    except:
        print('获取url列表出错正在循环')
        time.sleep(2)
        getHousDict(firstUrl,housDict)
    return housDict

def getHous(url,count,area,lock):
    for item in getHouseInfo(url,count,area,lock):
        print('获取成功,正在打印')
        lock.acquire()
        write_info(item)
        lock.release()

def getHouseInfo(url,count,area,lock):
    #分页爬取https://wh.lianjia.com/ershoufang/baibuting/pg2/
    nowurl = "https://wh.lianjia.com/ershoufang/"+url+"/pg"+str(count)+"/"
    try:
        housepage =  getReponse(nowurl)
        Page = etree.HTML(housepage.text)
        #获取列表
        #大区 小区房型 面积 楼层 建造时间 发布时间 地铁站#列表
        for info in Page.xpath("//li[@class='clear']"):
    
            name = info.xpath("./div[@class='info clear']/div[@class='address']/div[@class='houseInfo']")
            if name:
                name = info.xpath("./div[@class='info clear']/div[@class='address']/div[@class='houseInfo']/text()")[0]
                name = name.split('| ')
                for x in name:
                    if x ==' ':
                        name.remove(x)
                house_type = name[0] #房型
                house_size = name[1] #面积
            else:
                house_type = '暂无数据'
                house_size = '暂无数据'
            year_type = info.xpath("./div[@class='info clear']/div[@class='flood']/div[@class='positionInfo']")
            if year_type:
                year_type = info.xpath("./div[@class='info clear']/div[@class='flood']/div[@class='positionInfo']/text()")[0]
                year_type = year_type.split(')')
                house_num = year_type[0]+')' # 楼层
                house_year = year_type[1].split(' ')[0] #楼型
            else:
                house_num = '暂无数据'
                house_year = '暂无数据'
            times = info.xpath("./div[@class='info clear']/div[@class='followInfo']")
            if times:
                times = info.xpath("./div[@class='info clear']/div[@class='followInfo']/text()")[0]
                house_times = times.split('/ ')[-1].split('以前')[0] #发布时间
            else:
                house_times = '暂无数据'
            subway = info.xpath("./div[@class='info clear']/div[@class='tag']/span[@class='subway']/text()")#地铁信息
            if subway !=[]:
                subway = subway[0]
            else:
                subway='暂无数据'
            price = info.xpath("./div[@class='info clear']/div[@class='priceInfo']/div[@class='totalPrice']/span")
            if price:
                price = info.xpath("./div[@class='info clear']/div[@class='priceInfo']/div[@class='totalPrice']/span/text()")[0]
            else:
                price = '暂无数据' #房价实际数
            pricedanwei = info.xpath("./div[@class='info clear']/div[@class='priceInfo']/div[@class='totalPrice']")
            if pricedanwei:
                pricedanwei = info.xpath("./div[@class='info clear']/div[@class='priceInfo']/div[@class='totalPrice']/text()")[0]
            houseprice = price +pricedanwei #总房价
            Unit_Price = info.xpath("./div[@class='info clear']/div[@class='priceInfo']/div[@class='unitPrice']/span/text()")
            if Unit_Price !=[]:
                Unit_Price = Unit_Price[0].split('单价')[-1]
            yield {
                    '地区':area,
                    '街道':url,
                    '房型':house_type,
                    '面积':house_size,
                    '楼层':house_num,
                    '建造时间':house_year,
                    '发布时间':house_times,
                    '房价':houseprice,
                    '单价':Unit_Price,
                    '地铁':subway
                    }           
            
    except Exception as e:
        print('info出错',e)
            
     
def write_info(item):
    with open('lianjiaershou.csv','ab') as f:
        item = dict(item)
        f.write(item['地区'].encode('gbk'))
        f.write(b',')
        f.write(item['街道'].encode('gbk'))
        f.write(b',')
        f.write(item['房型'].encode('gbk'))
        f.write(b',')
        f.write(item['面积'].encode('gbk'))
        f.write(b',')
        f.write(item['楼层'].encode('gbk'))
        f.write(b',')
        f.write(item['建造时间'].encode('gbk'))
        f.write(b',')
        f.write(item['发布时间'].encode('gbk'))
        f.write(b',')
        f.write(item['房价'].encode('gbk'))
        f.write(b',')
        f.write(item['地区'].encode('gbk'))
        f.write(b',')
        f.write(item['单价'].encode('gbk'))
        f.write(b',')
        f.write(item['地铁'].encode('gbk'))
        f.write(b'\r\n')
            
        
    


def getInfo(lock,urllist,area):
    #从 字典里取出 小地区, 进入网页 获取 页码,同时进行找信息
    # 多进程加多线程异步  调入url
    #获取页码
    #多线程运行
    print('getinfo 运行')
    global getq
    for url in urllist:
        if url in getq:
            pass
        else:
            print('----------->不存在',url)
            getq.add(url)
            mainurl = "https://wh.lianjia.com/ershoufang/"+url+'/'
            coutpage = getReponse(mainurl)
            coutpage = etree.HTML(coutpage.text)
            try: 
                count = coutpage.xpath("//div[@class='page-box house-lst-page-box']/@page-data")[0]
                count = eval(count)
                countnum = count['totalPage'] #页码
            except:
                countnum = 1
                
            try:
                threads = [Thread(getHous(url,count,area,lock)) for count in range(1,countnum+1)]
                for t in threads:
                    t.start()
                for t in threads:
                    t.join()          
            except Exception as e:
                print('线程出错',e)

if __name__=='__main__':
    import functools
    #首先获取信二手房字典
    firstUrl = 'https://wh.lianjia.com/ershoufang/baibuting/'
    housDict={}
    print('正在启动中')
    housDict =getHousDict(firstUrl,housDict) 
#    print(housDict)#所有待爬地址的关键字
    print('url列表获取成功')
    #获取信息
    q = set()#小地区队列
    getq=set()#已经爬过的小地区
    manager = Manager()
    lock = manager.Lock()
    getInfo_Lock=functools.partial(getInfo,lock)
    p = Pool(3)
    if housDict:
        print('开始运行')
        for area in housDict:#异步进程 传 url 和q
            print('正在运行')
            urllist = housDict[area]
            p.apply_async(getInfo_Lock(urllist=urllist,area=area))
        p.close()
        p.join()

上面的代码,我写了不少冗余,大家在实际使用时,可以根据自己的情况自行删减.

Spark---Yang

关注

2
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
多进程+多线程爬取链家武汉二手房价

因为数据分析的需要,就写了爬取链家武汉的数据.因为用scrapy框架感觉太慢了,就自己写了个多进程同步执行的代码.1.数据量:20000+2.程序环境:Python3.6---&gt;用的Anaconda的环境, Spyder3.数据提取的方式:xpath代码的思路是: 因为链家对待爬虫是比较宽容的,因此大家爬取的时候还是要控制一下访问时间间隔和访问进程数.在链家的网站里,在总览界面,会有1...
复制链接

扫一扫