记一次python爬虫租房经历

最新推荐文章于 2024-06-05 21:52:54 发布

爷一隐居青楼

最新推荐文章于 2024-06-05 21:52:54 发布

阅读量610

点赞数

分类专栏：脚本文章标签： Python 爬虫

本文链接：https://blog.csdn.net/u011144214/article/details/94581731

版权

脚本专栏收录该内容

1 篇文章 0 订阅

订阅专栏

最近租房到期了，然后就想找找附近的房子但是信息太多太杂了看起来太不方便了身为一个程序员怎么能不搞点事情呢？

就想到了“爬虫”，利用爬虫去爬取链家，58，安居客等的租房信息，然后存入excel表格中，我们就可以在表格中清晰地看到各种租赁信息了。一下就用链家作为一个示例：

首先登陆链家网，选定了地点信息之后如下图所示

https://sh.lianjia.com/zufang/minhang/pg1/#contentList

可以发现页码与紫色字体部分有一个对应关系

映射到代码中我们可以用一个循环总页码70 然后去拼接这个url 去获取html数据，代码如下图所示

设置好请求头部（不设置header会出现403的错误，有些网站不需要如58）然后去发起请求

然后打开F12 我们可以看到每一个房子的信息都存在一个div中有一个固定的规律

详细信息如下：

一下为python 获取dom元素的属性和内容做了一些简单的过滤循环得到数据然后存入excel表格中

以上就是大体的思路然后我们贴一下完整代码以及结果

# -*- coding:utf-8 -*-
import requests
from requests.exceptions import RequestException
from bs4 import BeautifulSoup
from time import sleep
import pymongo
import xlwt

#设置excel样式
style0 = xlwt.easyxf('font: name Times New Roman, color-index red, bold on',num_format_str='#,##0.00')
style1 = xlwt.easyxf('font: name Times New Roman, color-index green, bold on',num_format_str='#,##0.00')
#声明一个excel对象
wb = xlwt.Workbook()
ws = wb.add_sheet('A Test Sheet')

def get_one_page(url):
    try:
        headers = {
            "Accept": "*/*",
            "Accept-Encoding": "gzip, deflate, br",
            "Accept-Language": "zh-CN,zh;q=0.9",
            "Connection": "keep-alive",
            "Host": "sh.lianjia.com",
            "Upgrade - Insecure - Requests": "1",
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"
        }
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.text
        else:
            return None
    except RequestException:
        return None


def parse_one_page(html, id):
    soup = BeautifulSoup(html, 'lxml')
    prefix = "https://sh.lianjia.com"
    for item in soup.select('.content__list--item'):
        houseInfo = item.find("img").get("alt")
        if "2室" in houseInfo:
            houseUrl = prefix + item.find("a").get("href")
            housePrice = item.find(class_="content__list--item-price").get_text().split(" ")[0]
            houseTime = item.find(class_="content__list--item--time").get_text()
            print("地址:  " + houseInfo + "   价格:" + housePrice + "   发布时间:" + houseTime + "   url:" + houseUrl)
            #设置excel的内容
            ws.write(id+1, 0, houseInfo, style1)
            ws.write(id+1, 1, housePrice, style1)
            ws.write(id+1, 2, houseTime, style1)
            ws.write(id+1, 3, houseUrl, style1)
            id += 1
            yield {
                          '_id': id,
                          'houseUrl': houseUrl,
                          'houseInfo': houseInfo,
                          'housePrice': housePrice,
                          'houseTime': houseTime
                }, id
    # 循环完成保存到house.xls中
    wb.save('house.xls')


if __name__ == '__main__':
    client = pymongo.MongoClient('mongodb://localhost:27017')
    db_name = 'lianjia_zufang_shanghai'
    db = client[db_name]
    collection_set01 = db['set01']
    index = 0
    # 设置excel的头部
    ws.write(0, 0, "地址", style0)
    ws.write(0, 1, "价格", style0)
    ws.write(0, 2, "发布时间", style0)
    ws.write(0, 3, "url", style0)
    #总页码设置
    for page in range(1):
        sleep(1)
        #url拼接
        url = 'https://sh.lianjia.com/zufang/minhang/pg'+str(page)+'/#contentList'
        html = get_one_page(url)
        print(html)
        for item, index in parse_one_page(html, index):
            #collection_set01.save(item)
            print()
    print("完成")

运行结果

生成的数据存在excel中的结果

爷一隐居青楼

关注

0
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
记一次python爬虫租房经历

最近租房到期了，然后就想找找附近的房子但是信息太多太杂了看起来太不方便了身为一个程序员怎么能不搞点事情呢？就想到了“爬虫”，利用爬虫去爬取链家，58，安居客等的租房信息，然后存入excel表格中，我们就可以在表格中清晰地看到各种租赁信息了。一下就用链家作为一个示例：首先登陆链家网，选定了地点信息之后如下图所示https://sh.lianjia.com/zufa...
复制链接

扫一扫