【python爬虫】BeautifulSoup+request实现房源信息至Excel

王同学在这

已于 2022-03-06 01:26:27 修改

阅读量834

点赞数 2

分类专栏： Python爬虫自动化selenium 文章标签： selenium python chrome

于 2022-03-05 17:02:10 首次发布

本文链接：https://blog.csdn.net/flyskymood/article/details/123295586

版权

Python爬虫同时被 2 个专栏收录

13 篇文章 3 订阅

订阅专栏

自动化selenium

8 篇文章 4 订阅

订阅专栏

文章目录

前言:
零基础入门Python，给自己找任务，因为实战是学代码的最快方式，所以拿来练练手，也建议大家学Python时一定要多写多练。

基本开发环境

Python 3.8
Pycharm
相关模块的使用
request, BeautifulSoup , xlwt(保存至Excel)

分析网页

1.先确定网页是动态网还是静态的，再决定用Ajax的方式进行爬取还是用普通的方式进行，进入网址，右键检查看网页源代码，然后输入一下关键字信息看看在网页源代码中是否存在，如果存在则证明网页是静态的。例如下图。
在这里插入图片描述

既然我们确定了网页是静态的，那再继续分析网页看看还有我们什么所需要的信息，比如我们翻页看看会有怎样的变化，这里我们发现URLpg这里的数字变了，这不代表着要实现1-100页的翻页我们只需变化这数字不就ok。
/pg2/
/pg3/
/pg4/

明确目标

在这里插入图片描述

图上所框的内容，拿到详情页链接，进详情页提取所需信息。

思路分析

确定了网页为静态网页，这里我用request+BeautifulSoup + xlwt进行对网页的提取，首先第一步先是对网址这条URL发送请求，构造好伪造头，避免简单的反扒，请求成功后在网页源代码查找所需内容信息在哪，利用BeautifulSoup进行分析，得到详情页的URL再对详情页发送请求进详情页提取所需信息，最后写入xlwt。

开始工作

第一步发送请求，返回网页源码。
在这里插入图片描述
第二步，得到网页源码，实例化BeautifulSoup 对网页源码的信息进行筛选，通过网页信息我们知道房子信息都在ul标签的每一个li标签中

这里我们就可以先用BeautifulSoup 找到所有的《li》标签再循环它提取信息，(下图是通过标签的属性进行信息的提取)
在这里插入图片描述

获得了详情页的URL，接下来就是对详情页发起请求实例化对象再提取信息，步骤和上面的一样。
在这里插入图片描述
第三步，详情页提取完信息后，就调用两个函数把信息传进函数里进行信息的存储和房子封面的下载。下图是写人xlwt保存信息至Excel的。

这图是对封面进行下载。

翻页的设置。

打印一下。
在这里插入图片描述

*致此我们的工作就完成了，如果想要其他地方的城市切换一下URL就ok
在这里插入图片描述

实现效果

在这里插入图片描述

以下是全部代码

# @Author : 王同学
import requests
import xlwt
from bs4 import BeautifulSoup
import os.path
import re
import threading




def get_content(url):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'}
    try:
        response = requests.get(url,headers)
        if response.status_code == 200:
            return response.text
    except requests.RequestException as e:
        print(e)
        return None


car_data = xlwt.Workbook(encoding='utf-8',style_compression=0)

sheet = car_data.add_sheet('二手房',cell_overwrite_ok=True)
sheet.write(0,0,'标题')
sheet.write(0,1,'价格')
sheet.write(0,2,'每平米')
sheet.write(0,3,'信息')
sheet.write(0,4,'地区')
sheet.write(0,5, '情况')
sheet.write(0,6, '结构')
sheet.write(0,7, '面积')
sheet.write(0,8, '装修状况')
n = 1


def get_data(response):
    # 实例化
    soup = BeautifulSoup(response,'lxml')
    all_data = soup.find(class_="sellListContent").find_all('li')   # 所有li
    print('这一页的房子信息一共有:',len(all_data))    # 打印一共有多少个li
    for i in all_data:  # 循环每一个li标签
        title = i.find(class_="title").find('a').string
        moneny = i.find(class_="totalPrice totalPrice2").text
        size = i.find(class_="unitPrice").find('span').text
        location = i.find(class_="positionInfo").find('a').text
        host = i.find(class_="houseInfo").text.replace('|','')
        ditail = i.find('a').get('href')    # 获得详情页链接
        # print(title,'\n',location,'\n',host)

        # 详情页发起请求
        ditai_data = requests.get(url=ditail).text
        html = BeautifulSoup(ditai_data,'lxml')
        for it in html.find_all(class_="base"):
            # ww = it.find('ul').find_all('span')[4].string
            house_type = it.find_all('li')[3].text.replace('户型结构','')
            built_up = it.find_all('li')[2].text.replace('建筑面积','')
            structure = it.find_all('li')[7].text.replace('建筑结构','')
            try:
                renovation = it.find_all('li')[8].text.replace('装修情况','')
            except:
                renovation = 'NOT'
            save_csv(title,moneny,size,host,location,structure,house_type,built_up,renovation)



def save_csv(title,moneny,size,host,location,structure,house_type,built_up,renovation):
    global n
    sheet.write(n,0,title)
    sheet.write(n,1,moneny)
    sheet.write(n,2,size)
    sheet.write(n,3,host)
    sheet.write(n,4,location)
    sheet.write(n,5,structure)
    sheet.write(n,6,house_type)
    sheet.write(n,7,built_up)
    sheet.write(n,8,renovation)
    n = n + 1
    print('开始爬取保存csv===>>', title)
    car_data.save(u'二手房.xlsx')




def main():
    for i in range(1,11):   # 循环实现翻页
        url = f'https://bh..com/ershoufang/pg{i}/'
        print(f'===========================正在爬取第{i}页的数据===========================================')
        response = get_content(url)
        get_data(response)



if __name__ == '__main__':
    thred = threading.Thread(target=main)
    thred.start()