python解析库爬虫_python 爬虫（二）- 解析库的简单使用

最新推荐文章于 2024-04-20 21:02:33 发布

weixin_39951396

最新推荐文章于 2024-04-20 21:02:33 发布

阅读量88

点赞数

文章标签： python解析库爬虫

本文链接：https://blog.csdn.net/weixin_39951396/article/details/111443242

版权

当我们在获取到网页相应内容的时候，就会使用去解析它过滤得到想要的内容

正则re

lxml 库

Beautiful Soup

pyquery

JsonPath

示例响应内容

http://quotes.toscrape.com/ 截取部分内容，以下所有例子将以这个响应内容来示范，假设响应的内容字符串定义为一个变量 content

一、正则re

使用python 中内置的模块 re正则模块

如解析页面上所有的名人的名字：

import re

pat = re.compile('(.*?)')

print(pat.findall(content))

输出：['Albert Einstein', 'J.K. Rowling', 'Albert Einstein', 'Jane Austen', 'Marilyn Monroe', 'Albert Einstein', 'André Gide', 'Thomas A. Edison', 'Eleanor Roosevelt', 'Steve Martin']

二、lxml库

lxml 支持xpath 的解析方式，那什么是xpath解析呢？

XPath 使用路径表达式来选取 XML 文档中的节点或节点集。节点是通过沿着路径 (path) 或者步 (steps) 来选取的。 xpath 解析方式

同样使用上面的例子，首先需要安装 lxml库

from lxml import etree

html = etree.HTML(content)

authors = html.xpath("//small[@class='author']//text()")

print(authors)

三、Beautiful Soup

BeautifulSoup也是Python的一个HTML或XML解析库，最主要的功能就是从网页爬取我们需要的数据。

首先需要安装 BeautifulSoup 解析器 pip install beautifulsoup4

from bs4 import BeautifulSoup

soup = BeautifulSoup(content,"lxml")

authors = soup.select('small.author')

for author in authors:

print(author.get_text())

四、pyquery

pyquery语法与前端 jQuery的用法几乎一样

from pyquery import PyQuery as pq

doc = pq(content)

authors = doc('small.author')

for author in authors.items():

print(author.text())

会使用jsonpath的地方，一般响应的内容是json数据。

语法：

XPath

JSONPath

Result

/store/book/author

$.store.book[*].author

the authors of all books in the store

//author

$..author

all authors

/store/*

$.store.*

all things in store, which are some books and a red bicycle.

/store//price

$.store..price

the price of everything in the store.

//book[3]

$..book[2]

the third book

//book[last()]

$..book[(@.length-1)] $..book[-1:]

the last book in order.

//book[position()<3]

$..book[0,1] $..book[:2]

the first two books

//book[isbn]

$..book[?(@.isbn)]

filter all books with isbn number

//book[price<10]

$..book[?(@.price<10)]

filter all books cheapier than 10

//*

$..*

all Elements in XML document. All members of JSON structure.

这里使用一段 json 数据

我们来获取所有的作者和所有价格

import jsonpath

import json

json_str = '''

{ "store": {

"book": [

{ "category": "reference",

"author": "Nigel Rees",

"title": "Sayings of the Century",

"price": 8.95

{ "category": "fiction",

"author": "Evelyn Waugh",

"title": "Sword of Honour",

"price": 12.99

{ "category": "fiction",

"author": "Herman Melville",

"title": "Moby Dick",

"isbn": "0-553-21311-3",

"price": 8.99

{ "category": "fiction",

"author": "J. R. R. Tolkien",

"title": "The Lord of the Rings",

"isbn": "0-395-19395-8",

"price": 22.99

}

"bicycle": {

"color": "red",

"price": 19.95

}

'''

jc = json.loads(json_str)

jp = jsonpath.jsonpath(jc, '$..author')

print(jp)

jp = jsonpath.jsonpath(jc, '$.store..price')

print(jp)

输出：

['Nigel Rees', 'Evelyn Waugh', 'Herman Melville', 'J. R. R. Tolkien']

[8.95, 12.99, 8.99, 22.99, 19.95]

weixin_39951396

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

python解析库 爬虫_python 爬虫（二）- 解析库 的简单使用

python解析库爬虫_python 爬虫（二）- 解析库的简单使用