正则,xpath,bs4获取数据的方法

最新推荐文章于 2022-11-29 09:05:45 发布

M996246017

最新推荐文章于 2022-11-29 09:05:45 发布

阅读量425

点赞数

本文链接：https://blog.csdn.net/M996246017/article/details/81280390

版权

import requests
import re
from lxml import etree
from bs4 import BeautifulSoup
from fake_useragent import UserAgent

headers = {
    'User-Agent':UserAgent().random
}



response1 = requests.get('https://www.meishij.net/',headers=headers).text         # 1.获取指定数据
print(type(response1))
   运行结果:<class 'str'>

# (1)使用正则时被获取数据的类型是字符串并且可以从一条符合条件的数据中提取多值
a = re.compile(r'<div class="listtyle1">.*?<a.*?>.*?<img class="img".*?alt="(.*?)".*?src=".*?">.*?</a>',re.S)
re_list = re.findall(a,response1)
print(re_list)                                 #当一次提取一个值的时候 返回的数据形式是列表 
   运行结果：例：['番茄丝瓜蛋汤', '豆豉煎辣椒', '傣味酸笋炒木耳']                                       
a = re.compile(r'<div class="listtyle1">.*?<a.*?>.*?<img class="img".*?alt="(.*?)".*?src="(.*?)">.*?</a>',re.S)
re_list = re.findall(a,response1)
print(re_list)                                   # 一次提取多个值的时候  返回的是数据形式是列表包含元组
   运行结果:例：[('番茄丝瓜蛋汤', 'https://s1.st.meishij.net/r/216/197/6174466/a6174466_153248311593816.jpg'), ('豆豉                                  煎辣椒', 'https://s1.st.meishij.net/r/216/197/6174466/a6174466_153257066516315.jpg')]

for value in re_list :
    name = value[0]
    href = value[1]
    print(name)
          运行结果:例：番茄丝瓜蛋汤
                      豆豉煎辣椒
                      傣味酸笋炒木耳
    print(href)
          运行结果：例：https://s1.st.meishij.net/r/216/197/6174466/a6174466_153248311593816.jpg
                       https://s1.st.meishij.net/r/216/197/6174466/a6174466_153257066516315.jpg
                       https://s1.st.meishij.net/r/72/43/4073322/a4073322_153233753656428.jpg

#(2)使用xpath方法之前要使用etree 先对数据格式进行转换
result = etree.HTML(response1)
print(type(result))                     #将数据转换为元素
   运行结果:<class 'lxml.etree._Element'>
Xpath_list = result.xpath('//img[@class="img"]')     # xpath可以从多个符合的标签中提取一个值 
print(Xpath_list)                                      #返回的数据形式是列表,但是列表里的标签是元素
   运行结果:例：[<Element img at 0x340a508>, <Element img at 0x340a4e0>, <Element img at 0x340a9e0>]
for x in Xpath_list:                                  #需要对每个标签进行遍历去除想要的值
    alt = x.xpath('@alt')[0]                          #xpath 获取标签指定的属性的值
    src = x.xpath('@src')[0]
    print(alt)
        #运行结果:例：番茄丝瓜蛋汤
                    #豆豉煎辣椒
                    #傣味酸笋炒木耳     
    print(src)
        运行结果：例：https://s1.st.meishij.net/r/216/197/6174466/a6174466_153248311593816.jpg
                     https://s1.st.meishij.net/r/216/197/6174466/a6174466_153257066516315.jpg
                     https://s1.st.meishij.net/r/72/43/4073322/a4073322_153233753656428.jpg

# (3)bs4 使用bs4方法之前要先使用BeautifulSoup方法对数值进行转换
soup = BeautifulSoup(response1,'lxml')
print(type(soup))                      # 将数据转换为类数据
    运行结果:<class 'bs4.BeautifulSoup'>
src_list = soup.find_all(class_='img')
print(src_list)                        #返回的数据形式是列表包含 Tag标签
    运行结果：例：
[<img alt="番茄丝瓜蛋汤"class="img"src="https://s1.st.meishij.net/r/216/197/6174466/a6174466_153248311593816.jpg"/>, <img alt="豆豉煎辣椒" class="img" src="https://s1.st.meishij.net/r/216/197/6174466/a6174466_153257066516315.jpg"/>, <img alt="傣味酸笋炒木耳" class="img" src="https://s1.st.meishij.net/r/72/43/4073322/a4073322_153233753656428.jpg"/>]
for value in src_list:      #遍历列表里的每个元素
    alt = value.get('alt')     #bs需要用get方法获取标签指定属性的值
    src= value.get('src')
    print(alt)
      运行结果:例：番茄丝瓜蛋汤
                  豆豉煎辣椒
                  傣味酸笋炒木耳 
    print(src)
            运行结果：例：https://s1.st.meishij.net/r/216/197/6174466/a6174466_153248311593816.jpg
                                     https://s1.st.meishij.net/r/216/197/6174466/a6174466_153257066516315.jpg
                                     https://s1.st.meishij.net/r/72/43/4073322/a4073322_153233753656428.jpg

M996246017

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
正则,xpath,bs4获取数据的方法

import requestsimport refrom lxml import etreefrom bs4 import BeautifulSoupfrom fake_useragent import UserAgentheaders = { 'User-Agent':UserAgent().random}response1 = requests.get(...
复制链接

扫一扫