爬取链家租房信息

最新推荐文章于 2021-05-11 07:47:27 发布

b1gx

最新推荐文章于 2021-05-11 07:47:27 发布

阅读量909

点赞数 2

分类专栏：爬虫文章标签：爬虫链家

本文链接：https://blog.csdn.net/qq_40727267/article/details/89049195

版权

爬虫专栏收录该内容

3 篇文章 0 订阅

订阅专栏

通过观察发现url为 https://nb.lianjia.com/zufang/pg{}/ 其中{}中为页码即1,、2、3、4…
一共有100页，所一设置一个循环来产生这些url

for x in range(1, 101):
    url = 'https://nb.lianjia.com/zufang/pg%d/' % x

通过浏览器的开发者工具栏可知该 url的请求的方式为get请求
请求得到源码

 response = requests.get(url, headers=headers)

利用lxml将其解析为规范的html格式，并解析得到每条信息的div标签

html = etree.HTML(result, etree.HTMLParser())
divs = html.xpath("//div[@class='content__list--item']")

由于数据量很大所以没办法保证每一条数据都获取到全部的内容，所以为了保证代码的健壮性，加入try语法，将获取不到的数据置为None。

    title = None
    house_url = None
    local = None
    area = None
    home = None
    price = None
    label = None
    orientation = None

通过分析网页源代码，通过xpath语法，按一定的匹配规则将所需的内容匹配出来；在获取数据的时候，由于每条数据的内容存在差异，进一步进行处理，获取更完整的数据

    for div in divs:
        try:
            title = "".join(div.xpath(".//p[contains(@class,'content__list--item--title')]/a/text()")).strip()
            house_url = "https://nb.lianjia.com" + "".join(
                div.xpath(".//p[contains(@class,'content__list--item--title')]/a/@href"))
            info = div.xpath(".//p[@class='content__list--item--des']//text()")
            local = "".join(info[1:4])
            if "/" in local:
                local = "".join(div.xpath(".//p[contains(@class,'content__list--item--brand')]/text()")).strip()
            area = info[6].strip()
            if "㎡" not in area:
                area = info[4].strip()
            orientation = info[8].strip()
            if "室" in orientation or "厅" in orientation:
                orientation = info[6].strip()
            home = info[-1].strip()
            if "室" not in home and "厅" not in home:
                home = info[10].strip()
            price = "".join(div.xpath(".//span[@class='content__list--item-price']//text()")).strip()
            label = "/".join(div.xpath(".//p[contains(@class,'content__list--item--bottom')]//text()")).strip()
            label = re.sub("\s", "", label)
            label = re.sub("//", "/", label)
        except:
            pass

保存数据到excel文件

f = xlwt.Workbook(encoding='utf_8')
sheet01 = f.add_sheet(u'sheet1', cell_overwrite_ok=True)
sheet01.write(0, 0, '标题')
sheet01.write(0, 1, '地区')
sheet01.write(0, 2, '面积')
sheet01.write(0, 3, '朝向')
sheet01.write(0, 4, '厅室')
sheet01.write(0, 5, '价格')
sheet01.write(0, 6, '标签')
sheet01.write(0, 7, '链接')


sheet01.write(num, 0, title)
sheet01.write(num, 1, local)
sheet01.write(num, 2, area)
sheet01.write(num, 3, orientation)
sheet01.write(num, 4, home)
sheet01.write(num, 5, price)
sheet01.write(num, 6, label)
sheet01.write(num, 7, house_url)

f.save('info' + '.xls')