python房屋数据爬取

最新推荐文章于 2022-06-13 10:26:57 发布

李哒哒哒

最新推荐文章于 2022-06-13 10:26:57 发布

阅读量1.5k

点赞数 2

分类专栏： python 数据爬取预处理文章标签： python xpath html post

本文链接：https://blog.csdn.net/weixin_45829562/article/details/112159707

版权

python 同时被 3 个专栏收录

1 篇文章 0 订阅

订阅专栏

数据爬取

1 篇文章 0 订阅

订阅专栏

预处理

1 篇文章 0 订阅

订阅专栏

利用python爬取恋家网的房屋信息
话不多说，直接上代码：

import requests
from lxml import etree
import re

def fangwu():
    with open(r'D:\python数据采集与可视化\fangwu.csv', 'a') as f:
        f.write('地区,详细地址,联系电话,房屋状态,价格,房屋类型'+'\n')
        url1 = 'http://www.ljia.net/new/p-{}.html'
        #用for循环改变网页的页码，爬取多个页面
        for i in range(100):
            url=url1.format(i+1)
            #模拟浏览器
            headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36" }
            #requests库请求
            r=requests.get(url=url1,headers=headers)
            res=r.content.decode('utf-8')
            conlist=etree.HTML(res)
            # 获取一页网页的所有楼盘所在的标签
            results=conlist.xpath("// *[ @ id = 'FyList'] / div[2]//div")
            if results:
                for result in results:
                    # 爬取地区
                    region1=result.xpath(".//ul[@class='boxText']/h3/a/text()")
                    region=region1[0]
                    # 爬取详细地址
                    # 地址在第一个li标签
                    address1=result.xpath(".//ul[@class='boxText']/li[1]/span/text()")
                    #详细地址
                    address2=result.xpath(".//ul[@class='boxText']/li[1]/text()")
                    # 正则表达式去除文字里面的一些字符
                    address2=re.sub(r'\n|\t','',address2[1])
                    # 地址拼接
                    address=address1[0]+address2
                    # print(address)
                    # 爬取电话
                    phone1=result.xpath(".//ul[@class='boxText']/li[2]/b/text()")
                    phone=phone1[0]
                    #爬取房屋类型
                    type1=result.xpath(".//ul[@class='boxText']/li[3]/span/text()")
                    type2=result.xpath(".//ul[@class='boxText']/li[3]/text()")


                    type2[1]=re.sub(r'\t','',type2[1])

                    if (type2[1])=='认购':
                        type='未知'
                    else:

                        type= type1[0] + type2[1]
                    # 房屋状态
                    state=type1[1]+type2[-1]

                    #房屋价格
                    price1=result.xpath(".//p[@class='price']/b/text()")
                    price2=result.xpath(".//p[@class='price']/text()")
                    price2[-1]=re.sub(r'\t|\n|\r','',price2[-1])
                    if price1=='待定':
                        price=('待定')
                    else:
                        price=price1[0]+price2[-1]
                    f.write(region+','+address+','+phone+','+state+','+price+','+type+'\n')
                    # print(type2)


if __name__=="__main__":
    fangwu()

房天下信息爬取，爬取“买新房的所有页面房屋信息”

爬取房屋地区
爬取详细地址
爬取楼盘电话
爬取楼盘价格
爬取楼盘状态

使用的库

**
1.requests库用来发送请求
2.etree模块。
3.re库

本次爬取所使用的是XPath路径的爬取，个人认为比Beautiful Soup库好用，因为之前用过Beautiful Soup库，感觉有点困难，个人看法！
Xpat路径可以通过网页代码去复制路径值直接可以使用。例如爬取房屋的详细地址：
在这里插入图片描述
在网页源代码中选在自己想爬取的数据，右键鼠标Copy下有
Copy XPath路径，就可以把XPath路径复制下来了。
在爬取的过程中有些爬取的结果存放在列表中，且有些非文字字符存在其中，就可以用正则表达式将非文字字符替换成空字符即可。
代码中有注释，不明白的可以留言提问，一起交流。
爬去结果：
在这里插入图片描述

李哒哒哒

关注

2
点赞
踩
19

收藏

觉得还不错? 一键收藏
4
评论
python房屋数据爬取

利用python爬取恋家网的房屋信息话不多说，直接上代码：import requestsfrom lxml import etreeimport redef fangwu(): with open(r'D:\python数据采集与可视化\fangwu.csv', 'a') as f: f.write('地区,详细地址,联系电话,房屋状态,价格,房屋类型'+'\n') url1 = 'http://www.ljia.net/new/p-{}.html'
复制链接

扫一扫