使用requests爬取网页的四种解析方式（正则、bs4、Xpath、parsel）

最新推荐文章于 2024-05-13 00:42:02 发布

努力成为大牛吧

最新推荐文章于 2024-05-13 00:42:02 发布

阅读量1.3k

点赞数 1

分类专栏： python爬虫文章标签： xpath 正则表达式 css python

本文链接：https://blog.csdn.net/qq_40741909/article/details/106838106

版权

python爬虫专栏收录该内容

0 篇文章 0 订阅

订阅专栏

部分内容转载自：https://blog.csdn.net/qiushuidongshi/article/details/81252838

import re
pattern=re.compile('<h4.*?>(.*?)</h4>.*?<p>.*?author">.*?(.*?)</span>.*?year">.*?(.*?)</span>.*?publisher">.*?(.*?)</span>.*?</p>',re.S)
 
results = re.findall(pattern, content)
print(results)
for result in results:
    name,author,time,chuban=result
    name=re.sub("\s",'',name).replace(' ','').strip()
    author=re.sub('\s','',author).strip()
    time=re.sub("\s",'',time).strip()
    chuban=re.sub("\s",'',chuban).strip()
    print(name,author,time,chuban)

#0x02 解析方式二——bs4


from bs4 import BeautifulSoup
 
html = r.content
soup = BeautifulSoup(html,"lxml")                              
print(type(soup))                                                                                                                                                                 
name = soup.findAll(name='h4',class_='title',text=re.compile(".*?"))                                                                                                  
author=soup.findAll(name='span',class_='author',text=re.compile(".*?"))                                                                                          
time=soup.findAll(name='span',class_="year",text=re.compile(".*?"))                                                                                                  
chuban=soup.findAll(name="span",class_="publisher",text=re.compile((".*?")))

#0x03 解析方式三——Xpath

from lxml import etree
 
html=r.content
tree = etree.HTML(html)
name=tree.xpath("//h4/text()")
author=tree.xpath("//span[@class='author']/text()")
time=tree.xpath("//span[@class='year']/text()")
chuban=tree.xpath("//span[@class='publisher']/text()")
print(name,author,time,chuban)

#0x04 解析方式四——parsel

import requests
import parsel
 
 
response = requests.get(url)
sel = parsel.Selector(response.text)  #注意这里的S要大写
 
# re正则
# print(sel.re('正则匹配格式'))
 
# xpath
# print(sel.xpath('xpath').getall()) #getall获取所有
 
# css选择器
# print(sel.css('css选择器 ::text').extract_first())#获取第一个

努力成为大牛吧

关注

1
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
使用requests爬取网页的四种解析方式（正则、bs4、Xpath、parsel）

部分内容转载自：https://blog.csdn.net/qiushuidongshi/article/details/81252838目录0x00 requests爬取网页0x01 解析方式一——正则0x02 解析方式二——bs40x03 解析方式三——Xpath0x04 解析方式四——parsel#0x00 requests爬取网页import requestsr = requests.get('https://book.douban.com/')content = r.text
复制链接

扫一扫