python爬虫系列：获取获取自如房租信息

最新推荐文章于 2024-05-01 21:42:57 发布

数据观察

最新推荐文章于 2024-05-01 21:42:57 发布

阅读量1k

点赞数 2

文章标签： python 爬虫自如租房价格图像识别

本文链接：https://blog.csdn.net/qq_42257125/article/details/99636367

版权

文章转自微信公众号“数据观察”

本文探索使用Python获取自如北京所有可见合租和整租房租信息。主要使用selenium+BeautifulSoup 模块进行数据爬取与解析。
一、概况
主要流程可以分为：

根据查询栏给出的地铁站点，爬取地铁站点以及房租信息链接并保存
单击每个链接并解析所有房租信息
存储已解析的文本数据

二、数据获取

2.1 所有地铁站点及租房页面获取
首先根据搜索页爬取可以选择的所有地铁站以及各站点的链接。数据保存为csv文件，包括地铁线路、地铁站点名和对应链接。这是一个静态网页，为方便直接使用Requests抓取。

headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'
        }
resp = requests.get(url,headers=headers)       
html_data = resp.text  
soup = bs(html_data, 'html.parser')
comment_div_lits = soup.find_all('ul', class_='clearfix filterList')
item=comment_div_lits[1]
subway_list=item.find_all('li',class_='')
station_line_list=[]
out = open('station_list.csv','w', newline='')
csv_write = csv.writer(out,dialect='excel')
for i in range(1,len(subway_list)):
    subway=subway_list[i].find_all('a')
    station_list=subway_list[i].find_all('span',class_='tag')
    for station in station_list:
        stations=station.find_all('a')
        station_name=stations[0].string
        for lines in stations:
            line_href=lines.get('href')
            line_area=[subway[0].string,station_name,line_href]
            station_line_list.append(line_area)
for station in station_line_list:
    if station[1]=='全部':
        continue
    csv_write.writerow([station[0],station[1],station[2]])

2.2 租房信息爬虫
因为网站在租房价格使用了加密手段，这里将租房信息抓取分为两部分，第一部分是普通信息抓取，第二部分是价格抓取。
（1）普通信息抓取
链接栏直接给出了房租的基本信息，抓取的信息为：小区，租房面积，楼层，户型，距离地铁站距离，其它房租标签。因为网站是动态网页，这里使用selenium抓取，然后通过BeautifulSoup解析.

def get_info(dec_file):
    for row in dec_file:
        print(row[1])
        house={}
        url=row[2]
        chrome_options = webdriver.ChromeOptions()
        chrome_options.add_argument('--headless')
        all_house=[]
        for i in range(1,50):
            browser=webdriver.Chrome(chrome_options=chrome_options)
            wait=WebDriverWait(browser, 10)
            start=time.time()
            urls='http:'+url+'?p='+str(i)
            browser.get(urls)
            print('页数：',i)
            html_data2 = browser.page_source.encode('utf-8')
            browser.close()
            soup = bs(html_data2, 'html.parser')
            nomsg= soup.find_all('div', class_='nomsg area')
            if len(nomsg)==0:
                house_lits = soup.find_all('li', class_='clearfix')
                img_path=get_price_list(soup)
                price_list=recognition_img(img_path)
                for house in house_lits:
                     house_name=house.find_all('a', target='_blank',class_='t1')[0].string
                     if house_name is None:
                         continue
                     house_detail=house.find('div',class_='detail')  
                     house_details=[]
                     for detail in house_detail.select('span'):
                         house_details.append(detail.get_text())
                     house_characteristic=house.find('p',class_='room_tags clearfix')
                     characteristic_list=[]
                     for characteristic in house_characteristic.select('span'):
                         characteristic_list.append(characteristic.get_text())
                     all_house.append([house_name,price,house_details,characteristic_list])
            else:      
               break
        end=time.time()
        print('time:',end-start)
        for i in range(0,len(all_house)):
            csv_write1.writerow([row[0],row[1],all_house[i]])

（2）房租价格抓取
1.网页分析

image

可以分析可以发现，房租对应标签是一些图片位置标签第一行是人民币元的符号，最下一行是房租价格个位数，倒数第二行是房租价格十位数，依次类推。这是网站在价格爬虫设置的一个加密手段。分析background-position属性值可发现，background-position属性值固定为【-210:30:0】之间取值，并且分别表示0-9这几个数字。在网页之间对比可以看出，同网页下相同属性值表示相同数字，不同网页之间仍然有差异。
进一步分析网页发现，网页底部定义了一个style，给出了房租价格的一些属性，其中image链接中为一张如下图，正是该页面价格解析的钥匙。6532148907依次为【-210:30:0】对应的值。

image

因此破解价格可以通过先保存该图片，然后使用图像识别技术识别数字，进而获得background-position属性值所代表的数字，最后可以破解获得房租价格。
2.数据抓取
抓取价格对应background-position属性值

get_price_code=house.find('div',class_='priceDetail')
for price_code in get_price_code.select('span'):
    house_price.append(price_code.get('style'))
    house_price_list=house_price[1:-1]

保存该网页style的价格顺序图片。为提高图片识别准确率，保存的图片加入一个白色背景。

def get_price_list(soup):
    price=soup.find_all('style')[1].text
    startStr='background-image:'
    endStr='.png'
    s = re.search('%s.*%s' % (startStr, endStr),price)
    #print(s.group())
    str_http=s.group()[21:]
    result = requests.get('http:'+str_http)
    save_folder='E:/spyder/ziru/img/'
    save_path = os.path.join(save_folder+ '1.png' )
    with open(save_path, 'wb') as fw:
        fw.write(result.content)
    im = Image.open(save_path)
    x,y = im.size 
    p = Image.new('RGBA', (320,40), (255,255,255))
    p.paste(im, (10, 5, 10+x, 5+y), im)
    p.save(save_path)
    return save_path

调用百度OCR进行数字图片识别。

def recognition_img(save_path):
    APP_ID='********'
    API_KEY='********'
    SECRET_KEY='********'
    aipOcr=AipOcr(APP_ID,API_KEY,SECRET_KEY)
    options={'detect_direction':'true','language_type':'CHN_ENG'}
    img= open(save_path,'rb').read()
    result=aipOcr.basicGeneral(img,options)
    content=result['words_result']
    num_list=content[0]['words']
   # print(len(num_list))
    num_dict={}
    num_dict['background-position:-0px']=num_list[0]
    num_dict['background-position:-30px']=num_list[1]
    num_dict['background-position:-60px']=num_list[2]
    num_dict['background-position:-90px']=num_list[3]
    num_dict['background-position:-120px']=num_list[4]
    num_dict['background-position:-150px']=num_list[5]
    num_dict['background-position:-180px']=num_list[6]
    num_dict['background-position:-210px']=num_list[7]
    num_dict['background-position:-240px']=num_list[8]
    num_dict['background-position:-270px']=num_list[9]
    return num_dict

最后解析出价格。

def get_price(house_price_list,num_dict):
    k=0
    for price_code in house_price_list:
        price_num=num_dict[price_code]
        if k==0:
            price=str(price_num)
        else:
            price=price+str(price_num)
        k+=1
    print(price)
    return price

文章只做技术交流使用，爬虫请遵守相关协议，维护网络基本道德。

如有爬虫技术问题沟通，欢迎关注微信公众号“数据观察”

数据观察

关注

2
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
python爬虫系列：获取获取自如房租信息

文章转自微信公众号“数据观察”本文探索使用Python获取自如北京所有可见合租和整租房租信息。主要使用selenium+BeautifulSoup 模块进行数据爬取与解析。一、概况主要流程可以分为：根据查询栏给出的地铁站点，爬取地铁站点以及房租信息链接并保存单击每个链接并解析所有房租信息存储已解析的文本数据二、数据获取2.1 所有地铁站点及租房页面获取 ...
复制链接

扫一扫