python爬虫的xpath、bs4、re方法

最新推荐文章于 2022-10-17 11:08:56 发布

the hornets

最新推荐文章于 2022-10-17 11:08:56 发布

阅读量1.3k

点赞数 1

本文链接：https://blog.csdn.net/xiaojiang0918/article/details/83153443

版权

1.re正则表达式

# 正则表达式分析：找开始和结束标签，两个标签之间把想要的内容需要包含进来，然后依次查找分析。

pat = r'<div class="post floated-thumb">(.*?)<p class="align-right"><span class="read-more">'

# 使用findall方法查找符合要求的全部内容，放置到一个列表

divlist = re.findall(pat,HTML,re.S) #re.S : 是.匹配包括换行之内的所有字符

2.xpath（scrapy自带的）

next=response.xpath("//li[@class='next']/a/@href").extract()[0]

extract(): 序列化该节点为unicode字符串并返回list。

3.bs4

bsoup = BeautifulSoup(dataopen, "html.parser")

datas = bsoup.find_all("div", {"class":"reveal-work-wrap"}) #获取所有这个标签，再遍历解析
for x in datas:
    print(x)
    childimg = x.find("img").get("src")
    pathpic1 = childimg.split("/")[-1]
    filepath1 = os.path.join("D:\putweb", pathpic1)
    urllib.request.urlretrieve(childimg,filepath1)

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

立即使用

the hornets

关注关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python爬虫的xpath、bs4、re方法

1.re正则表达式# 正则表达式分析：找开始和结束标签，两个标签之间把想要的内容需要包含进来，然后依次查找分析。pat = r'&lt;div class="post floated-thumb"&gt;(.*?)&lt;p class="align-right"&gt;&lt;span class="read-more"&gt;'# 使用findall方法查找符合要求的全部内容，放..
复制链接

扫一扫