因为数据分析的需要,就写了爬取链家武汉的数据.因为用scrapy框架感觉太慢了,就自己写了个多进程同步执行的代码.
1.数据量:20000+
2.程序环境:Python3.6--->用的Anaconda的环境, Spyder
3.数据提取的方式:xpath
代码的思路是:
因为链家对待爬虫是比较宽容的,因此大家爬取的时候还是要控制一下访问时间间隔和访问进程数.在链家的网站里,在总览界面,会有100页,每页30条数据,但是总的数据量在20000以上,因此不能直接第一界面获取.一般拥有多个数据的网站都不会直接把所有数据一次性显示的,一是为了降低服务器 的负载,二也是反爬措施的一种.
但是数据总要显示啊,不能说不给人看了啊,因此我么要对页面进行分析,在分类栏里,我们发现,真个的二手房库是会根据地区进行区分的,因此我们点一下,发现:1,url里面只是最后的地址变了,前缀都没变.2,当页面数据超过3000时,也只显示3000条,即100页数据.因此我们要再分析一下,我们再看分栏,发现地区还能再分,点第一个发现:还是url变了,和第一次点分栏一样,只变了url的地区标识,此时页面数据降低在3000以下了,这就标识页面分页在100页以内了,这就给了我们可乘之机了.爬取的时候根据获取的url列表,多进程直接异步分别爬取各个url,然后在各个url里用多个线程爬取分页的数据,然后再保存数据即可.
代码步骤:
开始连接:https://wh.lianjia.com/ershoufang/baibuting/
1.准备步骤:获取所有待爬的url地区部分.
地区分两部分:首先获取大区才能获取小地区,所幸大区可以直接获取
def getHousDict(firstUrl,housDict=None):
#获取{'jiangan':['url']}
#获取页面
try:
firstpage = getReponse(firstUrl)
#匹配出地址
firstpage = etree.HTML(firstpage.text)
#通过初试url 获取键 然后遍历键 获取剩下的url
area = firstpage.xpath("//div[@data-role='ershoufang']/div[1]/a/@href")#获取地区列表
#大地区小地区
for a in area:
a = a.split('/')[-2]
housDict[a]=[]
for area in housDict:
nextUrl = 'https://wh.lianjia.com/ershoufang/'+area+'/'
nextPage = getReponse(nextUrl)
nextPage = etree.HTML(nextPage.text)
mudiArea = nextPage.xpath("//div[@data-role='ershoufang']/div[1]/a[@class='selected']/@href")[0] #获取当前大地区
mudiArea = mudiArea.split('/')[-2]
ar = nextPage.xpath("//div[@data-role='ershoufang']/div[2]/a/@href") #当前小地区
L=[]
for a in ar:
ar = a.split('/')[-2]
L.append(ar)
if area ==mudiArea:
housDict[area] = L
except:
print('获取url列表出错正在循环')
time.sleep(2)
getHousDict(firstUrl,housDict)
return housDict
2,此时获取了 houseDict--->{'大区的url部分':['各个街道的url部分']}
我们可以开始正式开爬了!!!!!
3.虽然有了url列表,但是我们不能直接爬取,因为要对url去重,
设置一个 set()--->getq = set()--->所有以爬过的都放进去,多进程访问url,多线程爬取页面信息
def getInfo(lock,urllist,area):
#从 字典里取出 小地区, 进入网页 获取 页码,同时进行找信息
# 多进程加多线程异步 调入url
#获取页码
#多线程运行
print('getinfo 运行')
global getq
for url in urllist:
if url in getq:
pass
else:
print('----------->不存在',url)
getq.add(url)
mainurl = "https://wh.lianjia.com/ershoufang/"+url+'/'
coutpage = getReponse(mainurl)
coutpage = etree.HTML(coutpage.text)
try:
count = coutpage.xpath("//div[@class='page-box house-lst-page-box']/@page-data")[0]
count = eval(count)
countnum = count['totalPage'] #页码
except:
countnum = 1
try:
threads = [Thread(getHous(url,count,area,lock)) for count in range(1,countnum+1)]
for t in threads:
t.start()
for t in threads:
t.join()
except Exception as e:
print('线程出错',e)
4.进入分页的页面后就开始拿!!! 因为数据情况各不相同,我也没完全考虑到所有的情况,大家要是发现什么异常情况,麻烦告知我一下,我改一下!
def getHouseInfo(url,count,area,lock):
#分页爬取https://wh.lianjia.com/ershoufang/baibuting/pg2/
nowurl = "https://wh.lianjia.com/ershoufang/"+url+"/pg"+str(count)+"/"
try:
housepage = getReponse(nowurl)
Page = etree.HTML(housepage.text)
#获取列表
#大区 小区房型 面积 楼层 建造时间 发布时间 地铁站#列表
for info in Page.xpath("//li[@class='clear']"):
name = info.xpath("./div[@class='info clear']/div[@class='address']/div[@class='houseInfo']")
if name:
name = info.xpath("./div[@class='info clear']/div[@class='address']/div[@class='houseInfo']/text()")[0]
name = name.split('| ')
for x in name:
if x ==' ':
name.remove(x)
house_type = name[0] #房型
house_size = name[1] #面积
else:
house_type = '暂无数据'
house_size = '暂无数据'
year_type = info.xpath("./div[@class='info clear']/div[@class='flood']/div[@class='positionInfo']")
if year_type:
year_type = info.xpath("./div[@class='info clear']/div[@class='flood']/div[@class='positionInfo']/text()")[0]
year_type = year_type.split(')')
house_num = year_type[0]+')' # 楼层
house_year = year_type[1].split(' ')[0] #楼型
else:
house_num = '暂无数据'
house_year = '暂无数据'
times = info.xpath("./div[@class='info clear']/div[@class='followInfo']")
if times:
times = info.xpath("./div[@class='info clear']/div[@class='followInfo']/text()")[0]
house_times = times.split('/ ')[-1].split('以前')[0] #发布时间
else:
house_times = '暂无数据'
subway = info.xpath("./div[@class='info clear']/div[@class='tag']/span[@class='subway']/text()")#地铁信息
if subway !=[]:
subway = subway[0]
else:
subway='暂无数据'
price = info.xpath("./div[@class='info clear']/div[@class='priceInfo']/div[@class='totalPrice']/span")
if price:
price = info.xpath("./div[@class='info clear']/div[@class='priceInfo']/div[@class='totalPrice']/span/text()")[0]
else:
price = '暂无数据' #房价实际数
pricedanwei = info.xpath("./div[@class='info clear']/div[@class='priceInfo']/div[@class='totalPrice']")
if pricedanwei:
pricedanwei = info.xpath("./div[@class='info clear']/div[@class='priceInfo']/div[@class='totalPrice']/text()")[0]
houseprice = price +pricedanwei #总房价
Unit_Price = info.xpath("./div[@class='info clear']/div[@class='priceInfo']/div[@class='unitPrice']/span/text()")
if Unit_Price !=[]:
Unit_Price = Unit_Price[0].split('单价')[-1]
yield {
'地区':area,
'街道':url,
'房型':house_type,
'面积':house_size,
'楼层':house_num,
'建造时间':house_year,
'发布时间':house_times,
'房价':houseprice,
'单价':Unit_Price,
'地铁':subway
}
except Exception as e:
print('info出错',e)
5.拿到了 总不能放内存啊,我们要写下来才是我们借鉴的数据啊
def getHous(url,count,area,lock):
for item in getHouseInfo(url,count,area,lock):
print('获取成功,正在打印')
lock.acquire()
write_info(item)
lock.release()
这个地方大家一定注意:多进程,多线程,不管是进程池还是异步的啥啥啥的,还是线程池啥的,但凡涉及写入文件,写入数据库啥的,必须加锁!我这次用的是multiprocessing的manage().Lock(),后面会写出来.
使用注意:
第一:大家爬取的时候注意访问间隔,
第二:设置报错情况处理,当返回码不是200时怎么办?当验证码时怎么办?当网络状态差时怎么办?
第三:headers,这个大家最好还是根据正常访问时有的东西设置随机的,特别是一些反爬厉害的网站
全部代码在下面:
import random
import requests
from lxml import etree
import time
from multiprocessing import Pool,Manager
from threading import Thread
def userAgent():# 随机获取useragent
useragent = [
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36',
'Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0',
'Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)'
]
return useragent[random.randint(0,3)]
def getReponse(url):
time.sleep(2)
useragent = userAgent() #随机获取useragent
#host 可以根据地区改变
headers={
'User-Agent':useragent,
'Cookie':"lianjia_uuid=80d19086-77a7-4c19-a170-837a6807599d; _jzqckmp=1; UM_distinctid=1632551ec1a7dd-043124e50105a2-454c062c-1fa400-1632551ec1b103e; _ga=GA1.2.1902842581.1525339523; _gid=GA1.2.1439105168.1525339523; select_city=420100; _smt_uid=5aead5c9.24201108; _jzqx=1.1525346703.1525346703.1.jzqsr=wh%2Elianjia%2Ecom|jzqct=/ershoufang/baibuting/.-; all-lj=dafad6dd721afb903f2a315ab2f72633; Hm_lvt_9152f8221cb6243a53c83b956842be8a=1525339589,1525350042,1525395704; CNZZDATA1255849575=327501666-1525338295-https%253A%252F%252Fwww.lianjia.com%252F%7C1525391329; CNZZDATA1254525948=671308690-1525335895-https%253A%252F%252Fwww.lianjia.com%252F%7C1525391567; CNZZDATA1255633284=62432075-1525337313-https%253A%252F%252Fwww.lianjia.com%252F%7C1525394112; CNZZDATA1255604082=137477002-1525337374-https%253A%252F%252Fwww.lianjia.com%252F%7C1525390582; _qzjc=1; _jzqa=1.3645530149229753300.1525339512.1525350043.1525395704.4; _jzqc=1; _jzqy=1.1525339512.1525395704.1.jzqsr=baidu.-; Hm_lpvt_9152f8221cb6243a53c83b956842be8a=1525395732; _qzja=1.1290139477.1525339589257.1525350042524.1525395704204.1525395723323.1525395732013.0.0.0.88.4; _qzjto=5.1.0; lianjia_ssid=dcf4ad87-0def-73d4-fe77-087b3db79de3",
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Host':'wh.lianjia.com'
}
page = requests.get(url,headers=headers)
if page:
return page
else:
getReponse(url)
def getHousDict(firstUrl,housDict=None):
#获取{'jiangan':['url']}
#获取页面
try:
firstpage = getReponse(firstUrl)
#匹配出地址
firstpage = etree.HTML(firstpage.text)
#通过初试url 获取键 然后遍历键 获取剩下的url
area = firstpage.xpath("//div[@data-role='ershoufang']/div[1]/a/@href")#获取地区列表
#大地区小地区
for a in area:
a = a.split('/')[-2]
housDict[a]=[]
for area in housDict:
nextUrl = 'https://wh.lianjia.com/ershoufang/'+area+'/'
nextPage = getReponse(nextUrl)
nextPage = etree.HTML(nextPage.text)
mudiArea = nextPage.xpath("//div[@data-role='ershoufang']/div[1]/a[@class='selected']/@href")[0] #获取当前大地区
mudiArea = mudiArea.split('/')[-2]
ar = nextPage.xpath("//div[@data-role='ershoufang']/div[2]/a/@href") #当前小地区
L=[]
for a in ar:
ar = a.split('/')[-2]
L.append(ar)
if area ==mudiArea:
housDict[area] = L
except:
print('获取url列表出错正在循环')
time.sleep(2)
getHousDict(firstUrl,housDict)
return housDict
def getHous(url,count,area,lock):
for item in getHouseInfo(url,count,area,lock):
print('获取成功,正在打印')
lock.acquire()
write_info(item)
lock.release()
def getHouseInfo(url,count,area,lock):
#分页爬取https://wh.lianjia.com/ershoufang/baibuting/pg2/
nowurl = "https://wh.lianjia.com/ershoufang/"+url+"/pg"+str(count)+"/"
try:
housepage = getReponse(nowurl)
Page = etree.HTML(housepage.text)
#获取列表
#大区 小区房型 面积 楼层 建造时间 发布时间 地铁站#列表
for info in Page.xpath("//li[@class='clear']"):
name = info.xpath("./div[@class='info clear']/div[@class='address']/div[@class='houseInfo']")
if name:
name = info.xpath("./div[@class='info clear']/div[@class='address']/div[@class='houseInfo']/text()")[0]
name = name.split('| ')
for x in name:
if x ==' ':
name.remove(x)
house_type = name[0] #房型
house_size = name[1] #面积
else:
house_type = '暂无数据'
house_size = '暂无数据'
year_type = info.xpath("./div[@class='info clear']/div[@class='flood']/div[@class='positionInfo']")
if year_type:
year_type = info.xpath("./div[@class='info clear']/div[@class='flood']/div[@class='positionInfo']/text()")[0]
year_type = year_type.split(')')
house_num = year_type[0]+')' # 楼层
house_year = year_type[1].split(' ')[0] #楼型
else:
house_num = '暂无数据'
house_year = '暂无数据'
times = info.xpath("./div[@class='info clear']/div[@class='followInfo']")
if times:
times = info.xpath("./div[@class='info clear']/div[@class='followInfo']/text()")[0]
house_times = times.split('/ ')[-1].split('以前')[0] #发布时间
else:
house_times = '暂无数据'
subway = info.xpath("./div[@class='info clear']/div[@class='tag']/span[@class='subway']/text()")#地铁信息
if subway !=[]:
subway = subway[0]
else:
subway='暂无数据'
price = info.xpath("./div[@class='info clear']/div[@class='priceInfo']/div[@class='totalPrice']/span")
if price:
price = info.xpath("./div[@class='info clear']/div[@class='priceInfo']/div[@class='totalPrice']/span/text()")[0]
else:
price = '暂无数据' #房价实际数
pricedanwei = info.xpath("./div[@class='info clear']/div[@class='priceInfo']/div[@class='totalPrice']")
if pricedanwei:
pricedanwei = info.xpath("./div[@class='info clear']/div[@class='priceInfo']/div[@class='totalPrice']/text()")[0]
houseprice = price +pricedanwei #总房价
Unit_Price = info.xpath("./div[@class='info clear']/div[@class='priceInfo']/div[@class='unitPrice']/span/text()")
if Unit_Price !=[]:
Unit_Price = Unit_Price[0].split('单价')[-1]
yield {
'地区':area,
'街道':url,
'房型':house_type,
'面积':house_size,
'楼层':house_num,
'建造时间':house_year,
'发布时间':house_times,
'房价':houseprice,
'单价':Unit_Price,
'地铁':subway
}
except Exception as e:
print('info出错',e)
def write_info(item):
with open('lianjiaershou.csv','ab') as f:
item = dict(item)
f.write(item['地区'].encode('gbk'))
f.write(b',')
f.write(item['街道'].encode('gbk'))
f.write(b',')
f.write(item['房型'].encode('gbk'))
f.write(b',')
f.write(item['面积'].encode('gbk'))
f.write(b',')
f.write(item['楼层'].encode('gbk'))
f.write(b',')
f.write(item['建造时间'].encode('gbk'))
f.write(b',')
f.write(item['发布时间'].encode('gbk'))
f.write(b',')
f.write(item['房价'].encode('gbk'))
f.write(b',')
f.write(item['地区'].encode('gbk'))
f.write(b',')
f.write(item['单价'].encode('gbk'))
f.write(b',')
f.write(item['地铁'].encode('gbk'))
f.write(b'\r\n')
def getInfo(lock,urllist,area):
#从 字典里取出 小地区, 进入网页 获取 页码,同时进行找信息
# 多进程加多线程异步 调入url
#获取页码
#多线程运行
print('getinfo 运行')
global getq
for url in urllist:
if url in getq:
pass
else:
print('----------->不存在',url)
getq.add(url)
mainurl = "https://wh.lianjia.com/ershoufang/"+url+'/'
coutpage = getReponse(mainurl)
coutpage = etree.HTML(coutpage.text)
try:
count = coutpage.xpath("//div[@class='page-box house-lst-page-box']/@page-data")[0]
count = eval(count)
countnum = count['totalPage'] #页码
except:
countnum = 1
try:
threads = [Thread(getHous(url,count,area,lock)) for count in range(1,countnum+1)]
for t in threads:
t.start()
for t in threads:
t.join()
except Exception as e:
print('线程出错',e)
if __name__=='__main__':
import functools
#首先获取信二手房字典
firstUrl = 'https://wh.lianjia.com/ershoufang/baibuting/'
housDict={}
print('正在启动中')
housDict =getHousDict(firstUrl,housDict)
# print(housDict)#所有待爬地址的关键字
print('url列表获取成功')
#获取信息
q = set()#小地区队列
getq=set()#已经爬过的小地区
manager = Manager()
lock = manager.Lock()
getInfo_Lock=functools.partial(getInfo,lock)
p = Pool(3)
if housDict:
print('开始运行')
for area in housDict:#异步进程 传 url 和q
print('正在运行')
urllist = housDict[area]
p.apply_async(getInfo_Lock(urllist=urllist,area=area))
p.close()
p.join()
上面的代码,我写了不少冗余,大家在实际使用时,可以根据自己的情况自行删减.