python爬虫练习3：通过python爬取二手房源信息

最新推荐文章于 2024-03-04 23:28:09 发布

VIXCITY

最新推荐文章于 2024-03-04 23:28:09 发布

阅读量1.1k

点赞数 1

分类专栏： python 文章标签： python 爬虫 xpath

本文链接：https://blog.csdn.net/Vixcity/article/details/109293405

版权

python 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

通过爬虫爬取二手房源信息

前言
第一步：分析数据结构
第二步：写代码
- 1.引入库
- 2.UA伪装
第三步：我们用三种库分别获取数据
源码

前言

目标网站：58同城二手房

爬虫学了一段时间了，了解了request的用法，和其他一些网页解析库的用法，今天我整合一下几个我了解过的库

接下来我们开始进行写代码几个步骤

第一步：分析数据结构

首先我们到目标网页看看

F12，Element结构下，我们需要的数据是在ul.house-list-wrap类里面

li.sendsoj类下面的div.list-info类里面的h2.title类里面的a标签中

在这里插入图片描述
同理，我们需要的价格信息在这一块

在这里插入图片描述

第二步：写代码

1.引入库

import requests
from lxml import etree
from bs4 import BeautifulSoup
from pyquery import PyQuery as pq
from fake_useragent import UserAgent

2.UA伪装

ua = UserAgent()

url = 'https://hz.58.com/xihuqu/ershoufang/?utm_source=sem-sales-baidu-pc&spm=62851881867.16537920592&utm_campaign=sell&utm_medium=cpc&showpjs=pc_fg&PGTID=0d30000c-0004-f0a2-62bf-52886ad31056&ClickID=1'
headers = {
    "User-Agent": ua.chrome
}

第三步：我们用三种库分别获取数据

1：Xpath

代码如下：

def Xpath():
	# 请求获取网页的数据
    respon = requests.get(url=url, headers=headers).text
    # 实例化对象
    tree = etree.HTML(respon)
    # xpath 的语法 '//' 在当前元素下获取的匹配的所有内容（第一个//是根目录下的所有内容）'/'是获取子元素的内容，'[]'是获取该标签下的属性值
    # 更详细的语法可以去	https://www.w3school.com.cn/xpath/index.asp	这里看下
  	# 下面的分别是获取二手房名字所对应的标签和价格所的对应的标签
    name = tree.xpath('//ul[@class="house-list-wrap"]//h2[@class="title"]/a')
    price = tree.xpath('//p[@class="sum"]/b')
    # 在for循环外面打开文件，可以防止在for循环里面一遍又一遍的打开关闭文件而导致的性能消耗
    f = open('58二手房源和价格.txt', 'a', encoding='utf-8')
    for index, i in enumerate(name):
    	# 以为xpath获取的是列表，所以我们在这里给他拿到里面的文字-->i.xpath('./text()')[0]
    	# 这里分别是获取单个的名字和价格，分别写入文件
        homeName = i.xpath('./text()')[0]
        howMuch = price[index].xpath('./text()')[0]
        f.write('名称:' + homeName + '\t')
        f.write('价格:' + howMuch + '\n')
    f.close()

2：Pyquery

代码如下：

def Pyquery():
	# pyquery这里可以直接获取到网页的内容
	# 相当于 html = pq(requests.get(url=url, headers=headers).text)
    html = pq(url=url)
    # 获取到里面的元素
    # 在pyquery里面
    # '.'是类名，空格是获取到他的后代元素，'>'是获取到他的子代元素
    # 这里是获取到他的名字的列表和价格的列表
    names = html('ul.house-list-wrap h2.title a')
    prices = html('p.sum b')
    # 打开文件
    f = open('58二手房源和价格.txt', 'a', encoding='utf-8')
    for index, name in enumerate(names):
    	# 以为pyquery里面获取的是他本身的对象，所以我们在这里给他拿到里面的文字 --> name.text
    	# 这里分别是获取单个的名字和价格，分别写入文件
        f.write('名称:' + name.text + '\t')
        f.write('价格:' + prices[index].text + '\n')
    f.close()

3：BeautifulSoup

代码如下：

def BeautifulSoups():
	# BeautifulSoup 具体用法可以看我前两篇的内容，这里就不过多赘述
    respon = requests.get(url=url, headers=headers).text
    soup = BeautifulSoup(respon, 'lxml')
    names = soup.select('ul.house-list-wrap h2.title a')
    prices = soup.select('p.sum b')
    f = open('58二手房源和价格.txt', 'a', encoding='utf-8')
    for index, name in enumerate(names):
        f.write('名称:' + name.text + '\t')
        f.write('价格:' + prices[index].text + '\n')
    f.close()

接下来话不多说，直接看源码

源码

import requests
from lxml import etree
from bs4 import BeautifulSoup
from pyquery import PyQuery as pq
from fake_useragent import UserAgent

ua = UserAgent()

url = 'https://hz.58.com/xihuqu/ershoufang/?utm_source=sem-sales-baidu-pc&spm=62851881867.16537920592&utm_campaign=sell&utm_medium=cpc&showpjs=pc_fg&PGTID=0d30000c-0004-f0a2-62bf-52886ad31056&ClickID=1'
headers = {
    "User-Agent": ua.chrome
}


def Xpath():
    respon = requests.get(url=url, headers=headers).text
    tree = etree.HTML(respon)
    name = tree.xpath('//ul[@class="house-list-wrap"]//h2[@class="title"]/a')
    price = tree.xpath('//p[@class="sum"]/b')
    f = open('58二手房源和价格.txt', 'a', encoding='utf-8')
    for index, i in enumerate(name):
        homeName = i.xpath('./text()')[0]
        howMuch = price[index].xpath('./text()')[0]
        f.write('名称:' + homeName + '\t')
        f.write('价格:' + howMuch + '\n')
    f.close()


def Pyquery():
    html = pq(url=url)
    names = html('ul.house-list-wrap h2.title a')
    prices = html('p.sum b')
    f = open('58二手房源和价格.txt', 'a', encoding='utf-8')
    for index, name in enumerate(names):
        f.write('名称:' + name.text + '\t')
        f.write('价格:' + prices[index].text + '\n')
    f.close()


def BeautifulSoups():
    respon = requests.get(url=url, headers=headers).text
    soup = BeautifulSoup(respon, 'lxml')
    names = soup.select('ul.house-list-wrap h2.title a')
    prices = soup.select('p.sum b')
    f = open('58二手房源和价格.txt', 'a', encoding='utf-8')
    for index, name in enumerate(names):
        f.write('名称:' + name.text + '\t')
        f.write('价格:' + prices[index].text + '\n')
    f.close()


def main():
    # Xpath()
    # Pyquery()
    BeautifulSoups()


if __name__ == "__main__":
    main()

感兴趣的小伙伴可以看看我之前两篇的内容
python爬虫练习1：通过python爬取糗事百科的搞笑图片
 python爬虫练习2：通过Python爬取小说

VIXCITY

关注

1
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
python爬虫练习3：通过python爬取二手房源信息

通过爬虫爬取二手房源信息前言第一步：分析数据结构第二步：写代码1.引入库2.UA伪装第三步：我们用三种库分别获取数据1：Xpath2：Pyquery3：BeautifulSoup源码前言目标网站：58同城二手房爬虫学了一段时间了，了解了request的用法，和其他一些网页解析库的用法，今天我整合一下几个我了解过的库接下来我们开始进行写代码几个步骤第一步：分析数据结构首先我们到目标网页看看F12，Element结构下，我们需要的数据是在ul.house-list-wrap类里面li.sends
复制链接

扫一扫