解决抓取链家租房页面面积信息的正则表达式问题,-CSDN博客

本文链接：https://blog.csdn.net/giggle_heihei/article/details/134753907

项目场景：

抓取链家租房均价

在搜索引擎上找到用于抓取链家二手房数据的代码包，在二手房代码运行成功基础上，直接套用，抓取房租数据。
利用BeautifulSoup，爬取链家租房网站中房屋的平方与价格。
重新识别网页元素，套用改写代码。

问题描述

正则表达式错误，未拆解封装代码

首先是包含面积的HTML代码片段：

<p class="content__list--item--des">
                <a target="_blank" href="/zufang/dongcheng/">东城</a>-<a href="/zufang/dongzhimen/" target="_blank">东直门</a>-<a title="海运仓胡同3号院" href="/zufang/c1111027375431/" target="_blank">海运仓胡同3号院</a>
        <i>/</i>
        64.00㎡
        <i>/</i>东 西        <i>/</i>
          3室1厅1卫        <span class="hide">
          <i>/</i>
          中楼层                        （5层）
                  </span>
      </p>

我们想要抓取64.00㎡这一信息。

我们所参照的原抓取二手房数据的代码如下：

title = li.find('div',class_='title).a.get_text()

一开始的思路是利用正则表达式对64.00㎡进行抓取，所以根据上一代码，仿写的抓取代码如下：

square = div.find('p',class_="content__list--item--des").get_text().find(text=re.compile(r'\b(\d+\.\d+)s*㎡\b'))

但抓取结果为空，并没有抓取到64.00㎡这一信息。

而抓取结果为空的原因是，正则表达式使用错误，正确用法应该是：

import re

square_pattern = re.compile(r'\b(\d+\.\d+)s*㎡\b')
square_match = square_pattern.search(div.find('p', class_="content__list--item--des").get_text())

if square_match:
    square = square_match.group(1)
else:
    square = None  # 或者在未找到模式时进行处理

要利用patterns进行search,find。

但仿照原抓取二手房数据代码格式进行修改，同样也行得通。
第一步要进行拆解，把封装好的代码拆解开来，追根溯源寻找问题。
封装好的代码：

def extract_info(html,district):
    soup = BeautifulSoup(html,'lxml')
    data = []
    for div in soup.find_all('div',class_='content__list--item'):
        try:
            title = div.find('p',class_='content__list--item--title').get_text(strip=True)      
            square = div.find('p',class_='content__list--item--des').get_text().find(text=re.compile(r'\b(\d+\.\d+)s*㎡\b')
            price = div.find('span',class_='content__list--item-price').em.get_text(strip=True)
            data.append([district,title,square,price])
        except Exception as e:
            print('extract_info: ',e)
            print(title)
    return data

把html拿出来，把soup单拎出来，去测试div元素。再看面积信息存放在哪里。

html = requests.get('https://bj.lianjia.com/zufang/',headers=headers).text
soup = BeautifulSoup(html,'lxml')
divs = soup.find_all('div',class_='content__list--item')
square = divs[0].find('p',class_='content__list--item--des')
square

square运行结果为：

<p class="content__list--item--des">
                  精选          <i>/</i>
<a href="/zufang/chaoyang/" target="_blank">朝阳</a>-<a href="/zufang/liangmaqiao/" target="_blank">亮马桥</a>-<a href="/zufang/c1111027379671/" target="_blank" title="三源里南小街">三源里南小街</a>
<i>/</i>
        59.85㎡
        <i>/</i>东南        <i>/</i>
          2室1厅1卫        <span class="hide">
<i>/</i>
          高楼层                        （6层）
                  </span>
</p>

在正则表达式错误的情况下，思考怎么提取59.85㎡。

进行观察，再调动基础知识进行思索。

先以㎡为索引，利用split将得到的字符串分隔开。再观察divs[1]或者divs[2]里的square，发现59.85㎡前面都有空格，所以再以空格为索引，进一步分隔字符串，最后提取到想到的数字，提取到了面积。

square = divs[0].find('p',class_='content__list--item--des').get_text().split("㎡")[0].split()[-1]
square

‘79.09’

原因分析：

基础知识不牢靠
没有对封装代码进行拆解
不熟悉正则表达式用法以及split用法

解决方案：

进行记录，及时总结
勤学勤练，多做思考
端正态度，别既笨还懒

抓取链家房价数据

项目场景：

问题描述

原因分析：

解决方案：