聚焦爬虫：正则、bs4、xpath

最新推荐文章于 2024-09-05 10:28:02 发布

一个小猴子｀

最新推荐文章于 2024-09-05 10:28:02 发布

阅读量163

点赞数

分类专栏：其他文章标签：爬虫正则 xpath bs4

本文链接：https://blog.csdn.net/m0_50127633/article/details/113796772

版权

其他专栏收录该内容

34 篇文章 0 订阅

订阅专栏

聚焦爬虫：爬出页面中指定的页面内容
数据解析分类：

正则
bs4
xpath
数据解析原理概述：

解析的局部的文本内容都会在标签之间或者标签对应的属性中进行存储
1）进行指定标签的定位
2）标签或者标签对应的属性中存储的数据值进行提取

正则解析案例:爬取糗图图片

import requests
import  re
import os
if __name__ == '__main__':
    #创建文件夹用来保存图片
    if not os.path.exists("./qiutu_imgs"):
        os.mkdir("./qiutu_imgs")
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 Edg/88.0.705.56'
    }
    for  page in range(1,5):
        url='https://www.qiushibaike.com/imgrank/page/'+str(page)+'/'
        #使用通用爬虫对url对应的一整张页面进行爬取
        page_text=requests.get(url=url,headers=headers).text
        #使用聚焦爬虫对页面中所有的糗图进行爬取
        ex='<img src="(//pic.qiushibaike.com/system/pictures.*?)"'
        img_src_list=re.findall(ex,page_text,re.S)

        for src in img_src_list:
            src="https:"+src
            #请求到图片的二进制数据
            img_data=requests.get(url=src,headers=headers).content
            #生成图片名称
            img_name=src.split("/")[-1]

            imgPath='./qiutu_imgs/'+img_name

            with open(imgPath,'wb') as fp:
                fp.write(img_data)
            print(img_name+"下载成功!")

bs4解析

数据解析的原理

1）标签定位
2）提取标签、标签属性存储的数据值
bs4数据解析的原理

1）实例化一个BeautifulSoup对象，并且将页面源码数据加载到该对象中
2）通过调用BeautifulSoup对象中相关的属性或者方法进行标签定位和数据提取
环境安装

1）pip install bs4
2）pip install lxml
如何实例化BeautifulSoup对象：

from bs4 import BeautifulSoup

对象实例化

1)将本地的html文档中的数据加载到该对象中
fp=open (’./test.html’,‘r’,encoding=‘utf-8’)
sopu=BeautifulSoup(fp,‘lxml’)
2)将互联网上获取的页面源码加载到该对象中
page_text=response.text
soup=BeautifulSoup(page_text,‘lxml’)

提供的用于数据解析的方法和属性
在这里插入图片描述

bs4案例：

import requests
from bs4 import BeautifulSoup
if __name__ == '__main__':
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 Edg/88.0.705.56'
    }
    url='https://www.shicimingju.com/book/sanguoyanyi.html'
    page_text=requests.get(url=url,headers=headers).text
    #实例化BeautifulSoup对象，需要将页面的源码数据加载到该对象中
    soup=BeautifulSoup(page_text,'lxml')
    li_list=soup.select('.book-mulu > ul > li')
    fp=open("./sanguo.txt",'w',encoding='utf-8')
    for li in li_list:
        title=li.a.string
        detail_url='https://www.shicimingju.com'+li.a['href']
        content_page=requests.get(url=detail_url,headers=headers).text
        detail_soup=BeautifulSoup(content_page,'lxml')
        content_data=detail_soup.find("div",class_='chapter_content')
        content=content_data.text
        fp.write(title+":"+content+'\n')
        print(title+" 爬取成功")

Xpath解析

xpath解析原理

1）实例化一个一个etree的对象，且需要将被解析的页面源码数据加载到该对象中。
2）调用etree对象中的xpath方法结合着xpath表达式实现标签的定位和内容捕获。
环境的安装

pip install lxml
如何实例化一个etree对象:from lxml import etree

1）将本地的html文档中的源码数据加载到etree对象中
etree.parse(文件路径)
2）可以将从互联网上获的源码数据加载到该对象中
etree.HTML(‘page_text’)
3）xpath(“xpath表达式”)
xpath表达式

1）/:表示的是从根节点开始定位。表示的是一个层级。
a=tree.xpath("/html/head/title")
b=tree.xpath("/html/body/div")
2）//：表示的是多个层级。可以表示从任意位置开始定位
c=tree.xpath(“html//div”)
d=tree.xpath("//div")
3）属性定位：标签[@属性名称=‘属性值’]
e=tree.xpath(’//div[@class=“song”]’)
4)索引定位：索引是从1开始的
f=tree.xpath(’//div[@class=“song”]/p[3]’)
5）取文本
/text()获取的是标签中直系的文本内容
//text()标签中非直系的文本内容（所有的文本内容）
g=tree.xpath(’//div[@class=“tang”]/ul/li[5]/a/text()’)
or
h=tree.xpath(’//div[@class=“tang”]//li[5]/a/text()’)
6）取属性
i=tree.xpath(‘div[@class"song"]/img/@src’)

xpath案例
案例1：58同城二手房信息

import  requests
from lxml import  etree
if __name__ == '__main__':
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 Edg/88.0.705.56'
    }
    url='https://bj.58.com/ershoufang/'
    page_text=requests.get(url=url,headers=headers).text
    #数据解析
    tree=etree.HTML(page_text)
    title_list=tree.xpath('//div[@class="property-content-title"]/h3/text()')
    for title in title_list:
        print(title)

案例2：爬取4k超清图片

import  requests
from lxml import etree
import os

if __name__ == '__main__':
    if not os.path.exists("4kfengjing_imgs"):
        os.mkdir("./4kfengjing_imgs")
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 Edg/88.0.705.56'
    }
    url='http://pic.netbian.com/4kfengjing/'
    page_text=requests.get(url=url,headers=headers).text

    tree=etree.HTML(page_text)
    li_list=tree.xpath('//div[@class="slist"]/ul/li')
    for li in li_list:
       img_src="http://pic.netbian.com"+li.xpath('./a/img/@src')[0]
       #encode('iso-8859-1').decode('gbk') 解决乱码的通用方法
       img_name=(li.xpath('./a/img/@alt')[0]+'.jpg').encode('iso-8859-1').decode('gbk')
       #图片是二进制数据，使用content
       img_data=requests.get(url=img_src,headers=headers).content
       img_path='4kfengjing_imgs/'+img_name
       with open(img_path,'wb') as fp:
           fp.write(img_data)
       print(img_name+"保存成功!")

一个小猴子｀

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
聚焦爬虫：正则、bs4、xpath

聚焦爬虫：爬出页面中指定的页面内容数据解析分类：正则bs4xpath数据解析原理概述：解析的局部的文本内容都会在标签之间或者标签对应的属性中进行存储1）进行指定标签的定位2）标签或者标签对应的属性中存储的数据值进行提取正则解析案例:爬取糗图图片List itemimport requestsimport reimport osif __name__ == '__main__': #创建文件夹用来保存图片 if not os.pat..
复制链接

扫一扫

专栏目录