python爬虫:BeautifulSoup巴乐兔租房信息爬取

找到巴乐兔上海租房网页url:巴乐兔上海
通过翻页发现:
第一页url:‘http://sh.baletu.com/zhaofang/?entrance=14’
第二页url:‘http://sh.baletu.com/zhaofang/p2o1a1/?**seachId=0&is_rec_house=0&entrance=14&solr_house_cnt=28156’
第三页url:‘http://sh.baletu.com/zhaofang/p3o1a1/?**seachId=0&is_rec_house=0&entrance=14&solr_house_cnt=28159’
我们只需要关注网址的前半部分的不同之处,后半部分为自动生成字段,因此可以用以下代码获取每页的url

if page == 1:
    url = 'http://sh.baletu.com/zhaofang/?entrance=14'
else:
    url = 'http://sh.baletu.com/zhaofang/p'+str(page)+'o1a1/'

再通过request请求网页,用BeautifulSoup解析:

r = requests.get(url,headers=headers)
soup = BeautifulSoup(r.content.decode('utf-8'),'html.parser')

通过检查网页发现,需要的信息全部在class为list-center的div标签下:
在这里插入图片描述
在该div下,每个li标签就是一条租房信息,因此

outer_div = soup.find('div',class_="list-center")
houses = outer_div.find_all('li',attrs = {"class":"listUnit-date clearfix PBA_list_house"})

同样,通过分析每个li标签下的子标签,可以得到房屋名称,区域,小区等信息,代码如下:

houses_info = []
for house in houses:
    try:
        id = house.attrs['num']

        name = house.h3.a.attrs['title']

        address = re.search("(.*?)-(.{2})",name)
        area = address.group(2)
        community = address.group(1)

        url = house.find('a',attrs={"target":"_blank"}).attrs['href']

        price = house.attrs['price']

        rent_type = house.attrs['variant']
        
        size_info = house.find('p',attrs={"class":"list-pic-ps"}).find("span",attrs={"class":False}).text
        size = re.search("(\d+)",size_info).group()

        traffic_1 = house.find("div",attrs={"class":"list-pic-ad"}).text

        traffic_2 = re.search("距离(?:(\d+)号线)?(.*?)(?:(\d+)米)",traffic_1)
        if traffic_2 == None:
            traffic_info = '__'
        else:
            traffic = traffic_2.groups()
            info_list = []
            for i in traffic:
                if i is None:
                    info_list.append("")
                else:
                    info_list.append(i)
            traffic_info = '_'.join(info_list)

        release_time = house.find("span",attrs={"class":"room-time"}).text.replace(" 发布","")

        grade = house.find("span",attrs={"class":"lan-ratedetail"}).text

        comment = house.find("span",attrs={"class":"lan-rate-people"}).text
        comment_num = re.search("(\d+)",comment).group()
        
    except Exception:
        None
    houses_info.append({"id":id,"name":name,"area":area,"community":community,"url":url,"price":price,"rent_type":rent_type,"size":size,"traffic_info":traffic_info,"release_time":release_time,"grade":grade,"comment_num":comment_num})

将以上代码封装成方法,在定义一个保存成csv文件的方法:

def randerToFile(houses_info):
    with open("巴乐兔_上海.csv","a",encoding='utf-8') as file:
        for house in houses_info:
            file.write("::".join(house.values())+"\n")

最后通过循环,执行上述方法,就能爬取到巴乐兔所有页的租房信息。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值