爬虫-Bs4、Xpath

最新推荐文章于 2024-07-21 22:27:18 发布

海伦•

最新推荐文章于 2024-07-21 22:27:18 发布

阅读量117

点赞数

分类专栏： ==========知识图谱========== 爬虫文章标签： python

本文链接：https://blog.csdn.net/qq_42794545/article/details/120455820

版权

==========知识图谱========== 同时被 2 个专栏收录

25 篇文章 4 订阅

订阅专栏

爬虫

5 篇文章 1 订阅

订阅专栏

Bs4

1.拿到主页面的源代码，提取子页面链接，href

2.通过herf获取子页面内容就可以获取图片的下载地址，img-》src

3.下载图片

代码：

import requests
from bs4 import BeautifulSoup
import time

url = "https://www.umei.cc/bizhitupian/weimeibizhi/"
resp = requests.get(url)
resp.encoding = 'utf-8'
resp.close()
#print(resp.text)

#把源代码交给Bs4
main_page = BeautifulSoup(resp.text, "html.parser")
alist = main_page.find("div",class_="TypeList").find_all("img")
print(alist)
for a in alist:
   src = a.get('src')
   print(src)
   #下载图片
   img_resp = requests.get(src)
   #img_resp.content  #这里拿到是字节
   img_name = src.split("/")[-1] #拿到url最后/以后的内容
   with open("img/"+img_name,mode="wb")as f:
      f.write(img_resp.content) #图片内容写入文件

   print("over!",img_name)
   time.sleep(1)

print("all over!!")

运行结果：

图片：

xpath

xpath是在XMl文档中搜索内容的一门语言

html是xml的一个子集

1.拿页面源代码

2.提取和解析数据

代码：

import requests
from lxml import etree

url = "https://beijing.zbj.com/search/f/?kw=saas"
resp = requests.get(url)
#print(resp.text)
resp.close()

#解析
html = etree.HTML(resp.text)
#拿到每一个服务商的div
divs = html.xpath("/html/body/div[6]/div/div/div[2]/div[5]/div[1]/div")
for div in divs:
    #print(divs.get(text))
    price = div.xpath("./div/div/a/div[2]/div[1]/span[1]/text()")
    title = div.xpath("./div/div/a[1]/div[2]/div[2]/p/text()")
    print(price)

运行结果：

调试过程：

如图选择则确定板块：

第三部可以选择为第2步当中块的最外层确定div后如下

选择该div之后右键copy Xpath之后得到

/html/body/div[6]/div/div/div[2]/div[5]/div[1]/div[1]

由于想要的不只是这个块，而是类似于该块的其他所有板块，所以将xpath修改为：

/html/body/div[6]/div/div/div[2]/div[5]/div[1]/div

选择想要的价格需要依次查看标签如下：

./div/div/a/div[2]/div[1]/span[1]/text()

海伦•

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫-Bs4、Xpath

Bs41.拿到主页面的源代码，提取子页面链接，href2.通过herf获取子页面内容就可以获取图片的下载地址，img-》src3.下载图片代码：import requestsfrom bs4 import BeautifulSoupimport timeurl = "https://www.umei.cc/bizhitupian/weimeibizhi/"resp = requests.get(url)resp.encoding = 'utf-8'resp.close.
复制链接

扫一扫