03、爬虫数据解析-bs4解析/xpath解析

奔向sj

已于 2024-07-28 09:54:29 修改

阅读量253

点赞数 10

分类专栏：爬虫学习文章标签：爬虫

于 2024-07-27 20:03:20 首次发布

本文链接：https://blog.csdn.net/benxiangsj/article/details/140739069

版权

爬虫学习专栏收录该内容

3 篇文章 0 订阅

订阅专栏

一、bs4解析

使用bs4解析，需要了解一些html的知识，了解其中一些标签。

安装：pip install bs4

导入：from bs4 import BeautifulSoup

1、使用方式

1、把页面源代码交给BeautifulSoup进行处理，生成bs对象

2、从bs对象中查找数据

（1）find(标签，属性=值)：找第一个

（2）findall(标签，属性=值)：找全部的

2、实战：拿到上海菜价网蔬菜价格

1、思路

（1）拿到源代码

（2）使用bs4进行解析，拿到数据

2、演示

from bs4 import BeautifulSoup
import requests
import csv

#拿到数据
url = "http://www.shveg.com/cn/info/list.asp?ID=959"

reps = requests.post(url)
reps.encoding="gb2312"
f = open("菜价.csv",mode="w",encoding="utf-8")
csvwriter = csv.writer(f)

#解析数据
#1、把页面源代码交给BeautifulSoup进行处理，生成bs对象。
#2、从bs对象中查找数据
page = BeautifulSoup(reps.text,"html.parser")#html.parser指定html解析
table = page.find("td", attrs={"class":"intro_font"})
trs = table.find_all("tr")[1:]
for tr in trs:
    tds = tr.find_all("td")
    name = tds[0].text
    csvwriter.writerow(name)
print("over")
reps.close()

3、实战：抓取优美图库图片

（1）需求：拿到优美图库图片的下载地址

（2）思路

a.拿到主页面的源代码，然后提取到子页面的链接地址，href

b.通过href拿到子页面的内容，从子页面找到图片的下载地址，src属性

c.下载图片

import requests
from bs4 import BeautifulSoup


url = "https://www.umei.cc/bizhitupian/weimeibizhi/"
resp = requests.get(url)
resp.encoding="utf-8"


main_page = BeautifulSoup(resp.text,"html.parser")
alist = main_page.find("div",attrs={"class":"item_list infinite_scroll"}).find_all("a")
for a in alist:
    href = "https://www.umei.cc/"+a.get('href')
    child_page_resp = requests.get(href)
    child_page_resp.encoding="utf-8"
    child_main_page = BeautifulSoup(child_page_resp.text,"html.parser")
    img = child_main_page.find("div",attrs={"class":"big-pic"}).find("img")
    src = img.get("src")

    #下载图片
    img_resp = requests.get(src)
    # img_resp.content   #这里拿到的是字节
    img_name = src.split("/")[-1] #拿到url中的最后一个/以后的内容
    with open(img_name,mode="wb") as f:
        f.write(img_resp.content) #图片的内容写入文件
    print("over")
resp.close()
child_page_resp.close()

二、xpath解析

安装：pip install lxml

导入：from lxml import etree

1、使用方式

tree = etree.parse(html文件)

result = tree.xpath("xpath语法")

2、实战：拿到中国食品网的新闻信息

from lxml import etree
import requests

url = "http://food.china.com.cn/node_8003189.htm"
resp = requests.get(url)

#解析
tree = etree.HTML(resp.text)
divs = tree.xpath('/html/body/div[2]/div[3]/div[1]/div[2]/div[@class="d3_back_list"]')

for div in divs:
    title = div.xpath("./p/a/text()")
    summary = div.xpath("./span/text()")
    time = div.xpath("./b/text()")
    print(title)
    print(summary)
    print(time)
resp.close()