数据抓取与清洗——链家二手房信息及户型图的爬取

最新推荐文章于 2025-03-31 13:56:30 发布

m0_53427580

最新推荐文章于 2025-03-31 13:56:30 发布

阅读量3.1k

点赞数 6

文章标签： python

本文链接：https://blog.csdn.net/m0_53427580/article/details/121737840

版权

本文介绍了一种方法，通过爬虫技术抓取链家网站的二手房详情，包括详细信息和户型图，利用异步协程加速图片下载，提高效率。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

0.思路

从主页面拿到子页面的url，再在子页面中爬取二手房的详细信息和户型图。

1.导入所需要的包

#导入所需要的包
import requests
import re
from bs4 import BeautifulSoup
import time
from lxml import etree
import pandas as pd
import aiohttp
import asyncio

2.查看一下robots协议

res = requests.get("https://bj.lianjia.com/ershoufang/robots.txt")
print(res.text)

3.选择城市和要爬取的页数

可以看到url的格式是:

https：//{城市名缩写}.lianjia.com/ershoufang/{页数}/

因此只要改变城市名缩写和页数就可以控制我们要爬取的城市和页数

#选择要查询的城市
q=input("请输入你想查询的城市的拼音首字母")

#用于存放房屋信息的列表
htmldata = []

#用于存放异步协程时所需要的图片下载链接
hreff = [] 

#爬取第1,2页
for i in range(1,3):
    pg="pg"+str(i)
    url = f"https://{q}.lianjia.com/ershoufang/{pg}/"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"
    }
    print(q)

    resp = requests.get(url,headers=headers)
    resp.encoding="utf-8"

这里我们只爬取第一第二页，以免给服务器造成太大压力

其中q用来选择想要爬取的城市，pg用于控制页数。

4.获取子页面的url和标题

这里想获得房子的详细信息就必须点进子页面，因此就要首先获取子页面的url

通过开发者工具可以看到标题和子页面的url都在属性为title的div标签里

再往上找可以找到<ul class=""sellListContent>，这个标签是唯一的

if resp.status_code==200:
        print("访问正常！")
            
        main_page = BeautifulSoup(resp.text,"html.parser") 
        alist=main_page.find("ul",class_="sellListContent").find_all("div",class_="title") 
        #用BeautifulSoup找到ul属性为sellListContent下所有div属性为title的内容

        obj = re.compile(r'.*?href="(?P<href>.*?)"',re.S)
        obj2 = re.compile(r'.*?blank">(?P<name>.*?)</a>',re.S)
        #用正则表表达式去匹配子页面的url（href）和标题（name）
    
        alist2=str(alist)
        #将列表变为字符串方便正则表达式去匹配
    
        result = obj.finditer(alist2)
        result2 = obj2.finditer(alist2)
        a=[]

所以我们用beautifulsoup去找到<div class="title">中的内容

再用正则表达式去分别匹配其中包含的url和标题

可以看到href="后的是我们要的子页面的url

blank“>后的是我们要的标题

用result和result2来放得到匹配的结果

5.获得房屋的详细信息和户型图

        for it2 in result2:
            name=it2.group("name")
            a.append(name+".jpg")
        #用于存放图片的名字（name.jpg）

        k=0
        filename = 'E:\\大三上\\课程\\数据抓取与数据清洗\\期末作业\\期末大作业\\bj二手房.csv'
        #存放二手房详细信息csv的路径
    
        for it in result:
            ##############################################
            #part1 提取房屋的信息
        
            href = it.group("href")
            html_resp = requests.get(url=href,headers=headers)
        
            #用xpath去提取信息
            html = etree.HTML(html_resp.text)
        
            #标题
            bt = html.xpath("/html/body/div[3]/div/div/div[1]/h1/text()")
            #总价
            zj = html.xpath("/html/body/div[5]/div[2]/div[3]/span[1]/text()")
            #每平方的单价
            dj = html.xpath("/html/body/div[5]/div[2]/div[3]/div[1]/div[1]/span/text()")
            #地段
            dd = html.xpath("//div[@class='overview']//div/span/a/text()")
            ## 房屋基本属性
            #基本信息的标题
            bt2 = html.xpath("//div[@class='base']//span/text()")
            #基本信息的内容
            nr2 = html.xpath("//div[@class='base']//li/text()")
            ## 房源交易属性
            # 交易标题
            bt3 =html.xpath("//div[@class='transaction']//span[1]//text()")
            # 交易信息内容
            nr3 = html.xpath("//div[@class='transaction']//span[2]//text()")
            ## 特色信息
            # 特色标题
            bt4 = html.xpath("//div[@class='baseattribute clear']/div[@class='name']/text()")
            # 特色内容
            nr4 = html.xpath("//div[@class='baseattribute clear']/div[@class='content']/text()")
            #把信息都存放到htmldict字典中
            htmldict=dict(zip(['标题', '总价格', '单价', '地段'] + bt2 + bt3 + bt4,[bt, zj, dj, dd] + nr2 + nr3 + nr4))
        
            #存到htmldata中
            htmldata.append(htmldict)

            ##############################################
            #part2 提取户型图
        
            href=it.group("href")
        
            child_page_resp = requests.get(href)
            child_page_resp.enconding = "utf-8"
            child_page_text = child_page_resp.text
        
            child_page = BeautifulSoup(child_page_text,"html.parser")
        
            img = child_page.find("div",class_="m-content").find("div",class_="layout").find("img")
            #print(img.get("src"))
            src = img.get("src")
            hreff.append(src)
            img_resp = requests.get(src)
            img_resp.content
        
            #有的名字里带有XXX元/月，其中/会影响路径，因此把/替换为空
            img_name = a[k].replace("/","")
            path="E:\\大三上\\课程\\数据抓取与数据清洗\\期末作业\\期末大作业\\户型图bj1\\"
            with open(path+img_name,mode="wb") as f:
                f.write(img_resp.content)
            print("over",img_name)
        
            #设置一个间隔
            time.sleep(0.5)
        
            #k是用来控制第几个名字的（a[k]）
            k=k+1
        
        f.close()
        print("over"+pg)

5.1存放得到的标题

由于finditer返回的是一个迭代器，我们要用一个循环去存放得到的标题

这里存放在列表a中，格式为标题.jpg 用于后续为户型图的命名

5.2 获取房屋的详细信息

用开发者工具找到标题的位置，再复制其xpath，就可得的到：

/html/body/div[3]/div/div/div[1]/h1

我们修改为

/html/body/div[3]/div/div/div[1]/h1/text()

就可以得到其中的内容，即标题。

再用同样的方法获得我们所需要的其他属性

基本属性

交易属性

特色信息

最后再将爬取到的内容存在htmldata中

5.3 获取房屋的户型图

用开发者工具找到图片下载链接所在的位置，往上找到<div class="m-content">再找到<div class="latout">最后找到img标签

代码为：

 img = child_page.find("div",class_="m-content").find("div",class_="layout").find("img")

我们所需要的图片下载链接在img标签里的src属性里

所以用get得到下载地址

最后用img.content去拿到内容——我们想要的图片

这里为了后续用异步协程的方式去爬取图片，这里把得到的图片下载链接放到hreff中

src = img.get("src")
hreff.append(src)

6.输出户型图和房屋的详细信息

            #有的名字里带有XXX元/月，其中/会影响路径，因此把/替换为空
            img_name = a[k].replace("/","")
            path="E:\\大三上\\课程\\数据抓取与数据清洗\\期末作业\\期末大作业\\户型图bj1\\"
            with open(path+img_name,mode="wb") as f:
                f.write(img_resp.content)
            print("over",img_name)

6.1 输出户型图到文件夹中

之前a列表是用来存放标题的，这里我们用a中的标题给图片命名

但这里我发现有的标题中会出现xxx元/月，xxx元/平，其中的“/”会影响输出路径，因此这里用replace将"/"替换为""

最后输出到我们设置好的文件夹中（户型图bj1）

同时打印一下over+标题，用来提示我们下载到了第几个图片

6.2 输出房屋详细信息

filename = 'E:\\大三上\\课程\\数据抓取与数据清洗\\期末作业\\期末大作业\\bj二手房.csv'
pd.DataFrame(htmldata).to_csv(filename, encoding='GB18030')
print("csv over")

之前获得的房屋详细信息存放在htmldata中，这里将其输出到bj二手房8.csv中

最后打印一下 “csv over” 用来提示我们完成了。

7.关闭掉所有的请求

    res.close()
    resp.close()
    html_resp.close()
    child_page_resp.close()
    
    if resp.status_code==404:
        print("页面不存在")
        
    if resp.status_code==403:
        print("页面禁止访问")

8.用异步协程加速获取户型图

async def aiodownload(urlhref):
    path1 = "E:\\大三上\\课程\\数据抓取与数据清洗\\期末作业\\期末大作业\\户型图bj2\\"
    name1 = urlhref.rsplit("/",1)[1]
    async with aiohttp.ClientSession() as session:
        async with session.get(urlhref) as resp1:
            with open(path1+name1,mode="wb") as f1:
                f1.write(await resp1.content.read())
    print(urlhref,"完成")

async def main():
    tasks = []
    for urlhref in hreff:
        tasks.append(aiodownload(urlhref))
    await asyncio.wait(tasks)

    
if __name__=='__main__':
    time1 = time.time()
    await main()
    time2 = time.time()
    print(time2-time1)

可以看到最到用时仅1秒多，相比于之前快了很多

9.完整代码

##正式版
#导入所需要的包
import requests
import re
from bs4 import BeautifulSoup
import time
from lxml import etree
import pandas as pd
import aiohttp
import asyncio


#查看一下robots协议
res = requests.get("https://bj.lianjia.com/ershoufang/robots.txt")
print(res.text)

#选择要查询的城市
q=input("请输入你想查询的城市的拼音首字母")

#用于存放房屋信息的列表
htmldata = []

#用于存放异步协程时所需要的图片下载链接
hreff = [] 

#爬取第1,2页
for i in range(1,3):
    pg="pg"+str(i)
    url = f"https://{q}.lianjia.com/ershoufang/{pg}/"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"
    }
    print(q)

    resp = requests.get(url,headers=headers)
    resp.encoding="utf-8"
    
    if resp.status_code==200:
        print("访问正常！")
            
        main_page = BeautifulSoup(resp.text,"html.parser") 
        alist=main_page.find("ul",class_="sellListContent").find_all("div",class_="title") 
        #用BeautifulSoup找到ul属性为sellListContent下所有div属性为title的内容

        obj = re.compile(r'.*?href="(?P<href>.*?)"',re.S)
        obj2 = re.compile(r'.*?blank">(?P<name>.*?)</a>',re.S)
        #用正则表表达式去匹配子页面的url（href）和标题（name）
    
        alist2=str(alist)
        #将列表变为字符串方便正则表达式去匹配
    
        result = obj.finditer(alist2)
        result2 = obj2.finditer(alist2)
        a=[]
        for it2 in result2:
            name=it2.group("name")
            a.append(name+".jpg")
        #用于存放图片的名字（name.jpg）

        k=0
        filename = 'E:\\大三上\\课程\\数据抓取与数据清洗\\期末作业\\期末大作业\\bj二手房.csv'
        #存放二手房详细信息csv的路径
    
        for it in result:
            ##############################################
            #part1 提取房屋的信息
        
            href = it.group("href")
            html_resp = requests.get(url=href,headers=headers)
        
            #用xpath去提取信息
            html = etree.HTML(html_resp.text)
        
            #标题
            bt = html.xpath("/html/body/div[3]/div/div/div[1]/h1/text()")
            #总价
            zj = html.xpath("/html/body/div[5]/div[2]/div[3]/span[1]/text()")
            #每平方的单价
            dj = html.xpath("/html/body/div[5]/div[2]/div[3]/div[1]/div[1]/span/text()")
            #地段
            dd = html.xpath("//div[@class='overview']//div/span/a/text()")
            ## 房屋基本属性
            #基本信息的标题
            bt2 = html.xpath("//div[@class='base']//span/text()")
            #基本信息的内容
            nr2 = html.xpath("//div[@class='base']//li/text()")
            ## 房源交易属性
            # 交易标题
            bt3 =html.xpath("//div[@class='transaction']//span[1]//text()")
            # 交易信息内容
            nr3 = html.xpath("//div[@class='transaction']//span[2]//text()")
            ## 特色信息
            # 特色标题
            bt4 = html.xpath("//div[@class='baseattribute clear']/div[@class='name']/text()")
            # 特色内容
            nr4 = html.xpath("//div[@class='baseattribute clear']/div[@class='content']/text()")
            #把信息都存放到htmldict字典中
            htmldict=dict(zip(['标题', '总价格', '单价', '地段'] + bt2 + bt3 + bt4,[bt, zj, dj, dd] + nr2 + nr3 + nr4))
        
            #存到htmldata中
            htmldata.append(htmldict)

            ##############################################
            #part2 提取户型图
        
            href=it.group("href")
        
            child_page_resp = requests.get(href)
            child_page_resp.enconding = "utf-8"
            child_page_text = child_page_resp.text
        
            child_page = BeautifulSoup(child_page_text,"html.parser")
        
            img = child_page.find("div",class_="m-content").find("div",class_="layout").find("img")
            #print(img.get("src"))
            src = img.get("src")
            hreff.append(src)
            img_resp = requests.get(src)
            img_resp.content
        
            #有的名字里带有XXX元/月，其中/会影响路径，因此把/替换为空
            img_name = a[k].replace("/","")
            path="E:\\大三上\\课程\\数据抓取与数据清洗\\期末作业\\期末大作业\\户型图bj1\\"
            with open(path+img_name,mode="wb") as f:
                f.write(img_resp.content)
            print("over",img_name)
        
            #设置一个间隔
            time.sleep(0.5)
        
            #k是用来控制第几个名字的（a[k]）
            k=k+1
        
        f.close()
        print("over"+pg)
    
    #将房屋详细信息存入csv
    pd.DataFrame(htmldata).to_csv(filename, encoding='GB18030')
    print("csv over")

    #关闭所有请求
    res.close()
    resp.close()
    html_resp.close()
    child_page_resp.close()
    
    if resp.status_code==404:
        print("页面不存在")
        
    if resp.status_code==403:
        print("页面禁止访问")

#############################方法二 ：用异步协程来抓取图片
async def aiodownload(urlhref):
    path1 = "E:\\大三上\\课程\\数据抓取与数据清洗\\期末作业\\期末大作业\\户型图bj2\\"
    name1 = urlhref.rsplit("/",1)[1]
    async with aiohttp.ClientSession() as session:
        async with session.get(urlhref) as resp1:
            with open(path1+name1,mode="wb") as f1:
                f1.write(await resp1.content.read())
    print(urlhref,"完成")

async def main():
    tasks = []
    for urlhref in hreff:
        tasks.append(aiodownload(urlhref))
    await asyncio.wait(tasks)

    
if __name__=='__main__':
    time1 = time.time()
    await main()
    time2 = time.time()
    print(time2-time1)