python爬虫基础，正则、xpath、bs4(详细)

最新推荐文章于 2024-06-22 09:01:59 发布

努力生活的黄先生

最新推荐文章于 2024-06-22 09:01:59 发布

阅读量3.5k

点赞数 2

分类专栏： Python 爬虫文章标签： python 正则表达式 xpath 字符串

本文链接：https://blog.csdn.net/Java_KW/article/details/116378515

版权

Python 同时被 2 个专栏收录

16 篇文章 0 订阅

订阅专栏

爬虫

5 篇文章 0 订阅

订阅专栏

文章目录

爬虫步骤

爬虫一共有四个主要的步骤：

1、明确目标—知道自己需要爬哪个网站

2、爬----将目标网站内容全部爬取下来

3、取----将我们需要的数据拿出来

4、处理数据----按照我们想要的格式储存和使用

从爬开始说，请求网站，一般是get方式，post方式是传输数据的，比如将账号密码传输到服务器上，进行验证。

在爬虫最常用的请求库是，requests库，它可以请求网页，将网站返回的内容输出出来。

import requests

url = 'http://www.baidu.com'

response = requests.get(url)

requests.get(url,headers)url,headers是常用的两个参数。headers请求头中，可以添加Cookie和User-Agent，阻止反爬的手段，当然还有其他更多的方式进行反反爬，比如使用代理IP。

response.text
## 可以将网页转化为文本？，格式，以便之后的转换。

取数据有三种方法，正则表达式、xpath、BeautifulSoup4，下面一一练习。练习的网站统一使用的是职友集，这个网站相对好爬一点，没有太多反爬机制。

pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
// 将镜像源设置成国内的
pip list
// 查看pip下载的包

正则表达式

什么是正则表达式？

正则表达式，一般是用于检索，替代符合某个规则的的文本。

官方说法：正则表达式是对字符串操作的一种逻辑公式，就是用事先定义好的一些特定字符、及这些特定字符的组合，组成一个“规则字符串”，这个“规则字符串”用来表达对字符串的一种过滤逻辑。

使用正则表达式，我们可以1、进行匹配，查看给定的字符串是否符合正则表达式的过滤逻辑。

2、进行过滤，根据正则表达式，将文本中特定需要的内容取出来。

正则表达式的规则

语法	说明	表达式实例	完整匹配的字符串

一般字符	匹配自身	abc	abc
.	匹配任意除了换行符\n之外的字符	a.c	abc
\	转义字符，使用转义字符，可以将一个字符转为原来的含义。例如：\*	a\.c	a.c
[…]	字符集，相当于可以匹配这个字符集里的任一字符。可以逐个列出，也可以写个范围，如：[abc]和[a-z]。如果第一个字符是^{则表示取反，例如[}abc]，表示不是abc的其他字符。如果要使用`]`，`-`，`^`，可以将`]`，`-`放在第一个，`^`不放在第一个。	a[bsd]c	abc、asc

\d	数字：[0-9]
\D	非数字：[^\d]
\s	空白字符：[<空格>\t\r\n\f\v]
\S	非空白字符：[^\s]
\w	单词字符：[A-Za-z0-9_]
\W	非单词字符：[^\w]

*	匹配前一个字符0次或者无数次
+	匹配前一个字符1次或者无数次
?	匹配前一个字符0次或者1次
{m}	匹配前一个字符m次
{m,n}	匹配前一个字符m次到n次，m和n可省略一个
*?/+?/??/{m,n}?	将*，+，？，{m,n}变为非贪婪模式

^	匹配字符串的开头，在多行模式中（re.S）匹配每一行的开头
$	匹配字符串的结尾，在多行模式中匹配每一行的结尾
\A	仅匹配字符串的开头
\Z	仅匹配字符串的结尾
\b	匹配\w和\W之间
\B	[^\b]

在这里插入图片描述

Python的re模块

在python中可以使用内置的re模块来使用正则表达式。

一般使用正则表达式有三步：1、使用compile()函数将正则表达式的字符串编译称一个Pattern对象。

2、通过Pattern对象提供的方法，对文本进行匹配查找，获得匹配结果，一个Match对象。

3、最后使用Match对象的属性和方法，获得需要的信息。

compile函数

import re

p = re.compile(r"\d+")

以上已将正则表达式进行编译成Pattern对象。

Pattern对象常用的方法有：

match方法：从起始位置开始找，一次匹配

search方法：从任意位置开始找，一次匹配

findall方法：从任意位置开始找，全部匹配，返回列表

finditer方法：从任意位置开始找，全部匹配，返回迭代器

split方法：分割字符串，返回列表

sub方法：替换字符串，返回字符串

match函数

match方法用于查找字符串的开头，也可以指定位置，它是一次匹配，匹配一次成功后，就返回。一般使用格式如下：

match(str[, pos[, endpos]])

其中str是待匹配的字符串，pos是起始位置，endpos是结束位置。默认值分别为0和len字符串长度。

当匹配成功之后，会返回一个Match对象，如果没有匹配上，则返回None

import re

p = re.compile(r'\d+')

s = '123as'

m = p.match(s)
## m为一个match对象

m.group(0)
## 123

m.start(0)
## 0,起始位置

m.end(0)
## 2,匹配的字符末尾位置

m.span(0)
## (0,3),匹配字符串的起始位置和末尾位置

在上面，当匹配成功时返回一个 Match 对象，其中：

group([group1, …]) 方法用于获得一个或多个分组匹配的字符串，当要获得整个匹配的子串时，可直接使用 group() 或 group(0)；
start([group]) 方法用于获取分组匹配的子串在整个字符串中的起始位置（子串第一个字符的索引），参数默认值为 0；
end([group]) 方法用于获取分组匹配的子串在整个字符串中的结束位置（子串最后一个字符的索引+1），参数默认值为 0；
span([group]) 方法返回 (start(group), end(group))。

例如：

import re
p = re.compile(r'([a-z]+) ([a-z]+)',re.I)
## re.I是忽略大小写

m = p.match('Hello World ni hao')

m.group(0)
## Hello World,返回匹配的整个字符串

m.group(1)
## Hello,返回匹配的第一个分组

m.group(2)
## World,返回匹配的第二个分组

m.groups()
## ('Hello','World')  ，相当于(group(1),group(2),...)

m.span(1)
## (0,5),第一个分组的字符的位置

search函数

search方法，也是一次匹配，只不过是从任意位置开始匹配的。一般格式如下：

search(str[, pos[, endpos]])

str是要匹配的字符串，pos和endpos分别是起始位置和末尾位置。

同样，当匹配成功后，返回一个Match对象，如果没有，则返回None

import re
p = re.compile(r'\d+')

s = 'asd123asd'

m = p.search(s)
## m为一个match对象

m.group()
## 123

m.span()
## (3,6)

findall函数

上面的match和search都是一次匹配，findall是匹配所有符合规则的字符，将多次匹配的结果，以列表的形式返回出来。

使用形式如下：

findall(str[, pos[, endpos]])

其中，string 是待匹配的字符串，pos 和 endpos 是可选参数，指定字符串的起始和终点位置，默认值分别是 0 和 len (字符串长度)。

findall 以列表形式返回全部能匹配的子串，如果没有匹配，则返回一个空列表。

例子：

import re

p = re.compile(r'\d+')
s = 'sd123fg456fg'

p.findall(s)
## ['123','456']


p1 = re.compile(r'\d+\.\d*')
s1 = "3.14, 3.1415926, 3"

p1.findall(s1)
## ['3.14','3.1415926']

finditer函数

finditer和findall相似，都是匹配所有符合的字符，只不过findall返回的是一个列表，而finditer返回的是一个迭代器（Match对象）。

import re

p = re.compile(r'\d+')
s = 'as123as456'

m = p.finditer(s)

for i in m:
    print(i) ## 打印的i（match对象）的地址
    print(i.group()) # 打印的是匹配出来的字符

split函数

split方法，是按照能够匹配的字符分割，分割后返回列表。

split(str[, maxsplit])

其中，maxsplit 用于指定最大分割次数，不指定将全部分割。

import re
p = re.compile(r'[,;\s]+')
s = 'a,b;;c  d'

p.split(s)
## ['a','b','c','d']

sub函数

sub方法用于替换符合规则的字符。

sub(repl,string[, count])

其中，repl 可以是字符串也可以是一个函数：

如果 repl 是字符串，则会使用 repl 去替换字符串每一个匹配的子串，并返回替换后的字符串，另外，repl 还可以使用 id 的形式来引用分组，但不能使用编号 0；
如果 repl 是函数，这个方法应当只接受一个参数（Match 对象），并返回一个字符串用于替换（返回的字符串中不能再引用分组）。
count 用于指定最多替换次数，不指定时全部替换。

import re
p = re.compile(r'(\w+) (\w+)') ## \w = [A-Za-z0-9_]
s = 'hello 123,world 456'

p.sub('hello k',s)
## hello k,hello k

p.sub(r'\2 \1',s)
## 123 hello,456 world

def func(m):
    return 'hi ' + m.group(2)

p.sub(func,s)

匹配中文

在某些情况下，我们想匹配文本中的汉字，有一点需要注意的是，中文的 unicode 编码范围主要在 [u4e00-u9fa5]，这里说主要是因为这个范围并不完整，比如没有包括全角（中文）标点，不过，在大部分情况下，应该是够用的。

假设现在想把字符串 title = u’你好，hello，世界’ 中的中文提取出来，可以这么做：

import re
t = '你好,hello 世界'

p = re.compile('[\u4e00-\u9fa5]+')
p.findall(t)

注意：贪婪模式与非贪婪模式

贪婪模式：在整个表达式匹配成功的前提下，尽可能多的匹配 ( * )；
非贪婪模式：在整个表达式匹配成功的前提下，尽可能少的匹配 ( ? )；
Python里数量词默认是贪婪的。

示例一：源字符串：abbbc

使用贪婪的数量词的正则表达式ab*，匹配结果： abbb。

* 决定了尽可能多匹配 b，所以a后面所有的 b 都出现了。
使用非贪婪的数量词的正则表达式ab*?，匹配结果： a。

即使前面有 *，但是 ? 决定了尽可能少匹配 b，所以没有 b。

示例二：源字符串：aa<div>test1</div>bb<div>test2</div>cc

使用贪婪的数量词的正则表达式：<div>.*</div>
匹配结果：<div>test1</div>bb<div>test2</div>

这里采用的是贪婪模式。在匹配到第一个“</div>”时已经可以使整个表达式匹配成功，但是由于采用的是贪婪模式，所以仍然要向右尝试匹配，查看是否还有更长的可以成功匹配的子串。匹配到第二个“</div>”后，向右再没有可以成功匹配的子串，匹配结束，匹配结果为“<div>test1</div>bb<div>test2</div>”

使用非贪婪的数量词的正则表达式：<div>.*?</div>
匹配结果：<div>test1</div>

正则表达式二采用的是非贪婪模式，在匹配到第一个“</div>”时使整个表达式匹配成功，由于采用的是非贪婪模式，所以结束匹配，不再向右尝试，匹配结果为“<div>test1</div>”。

正则表达式测试网址

使用正则爬取工作信息

import requests
import re
import time

url = 'https://www.jobui.com/jobs?cityKw=%E9%83%91%E5%B7%9E&jobKw=%E6%95%B0%E6%8D%AE'
header = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0'
          ,'Cookie' : 'AXUDE_jobuiinfo=xPq3I7eEuU; jobui_p=1603107811866_15411384; TN_VisitCookie=95; jobui_user_passport=yk160310782641343; isloginStatus=hQDDe3yhKHw%3D; jobui_user_searchURL=http%3A%2F%2Fm.jobui.com%2Fjobs%3FjobKw%3D%25E6%2595%25B0%25E6%258D%25AE%25E5%2588%2586%25E6%259E%2590%25E5%25B8%2588%26cityKw%3D%25E5%258C%2597%25E4%25BA%25AC%26experienceType%3D1-3%25E5%25B9%25B4%26sortField%3Dlast; jobui_area=%25E9%2583%2591%25E5%25B7%259E; Hm_lvt_8b3e2b14eff57d444737b5e71d065e72=1619854217; Hm_lpvt_8b3e2b14eff57d444737b5e71d065e72=1619854229; TN_VisitNum=2; PHPSESSID=ov59buf4dfqo8bmpbgjk06s0f0'}


response = requests.get(url,headers=header)

res = response.text
p = re.compile(r'class="job-name" href="(.*?)" target="_blank"')

l = p.findall(res)

urls = 'https://www.jobui.com'

j=0

for i in l:
    url0 = urls+i
    response = requests.get(url0,headers=header)
    r0=response.text

    title = re.findall(r'<h1 title=.*?style="margin-bottom: 20px;">(.*?)</h1>',r0)
    yaoqiu = re.findall(r'<span>(.*?)</span>[a-z<\d\s="]*>?(.*?)[</span>]*</li>',r0,re.I)
    gangwei = re.findall(r'style="word-break: break-all;">\r\n\t[\s]+(.*?)\r\n\t[\s]+</div>',r0,re.S)[0].replace('<br />','')

    y=''
    for i in yaoqiu:
        y = i[0]+i[1]+'\n'

    with open('./工作.txt','a',encoding='utf-8') as f:
        f.write(title[0]+'\n')
        f.write(y+'\n')
        f.write(gangwei+'\n')
    time.sleep(1)
    print('已保存'+str(j+1)+'个')
    j+=1

xpath

如果有的数据使用正则比较麻烦，可以将HTML转换为XML类型，使用xpath语法来取出想要的数据。

什么是XML?

官方解释：

XML 指可扩展标记语言（EXtensible Markup Language）
XML 是一种标记语言，很类似 HTML
XML 的设计宗旨是传输数据，而非显示数据
XML 的标签需要我们自行定义。
XML 被设计为具有自我描述性。
XML 是 W3C 的推荐标准

XML文档示例

<?xml version="1.0" encoding="utf-8"?>

<bookstore> 

  <book category="cooking"> 
    <title lang="en">Everyday Italian</title>  
    <author>Giada De Laurentiis</author>  
    <year>2005</year>  
    <price>30.00</price> 
  </book>  

  <book category="children"> 
    <title lang="en">Harry Potter</title>  
    <author>J K. Rowling</author>  
    <year>2005</year>  
    <price>29.99</price> 
  </book>  

  <book category="web"> 
    <title lang="en">XQuery Kick Start</title>  
    <author>James McGovern</author>  
    <author>Per Bothner</author>  
    <author>Kurt Cagle</author>  
    <author>James Linn</author>  
    <author>Vaidyanathan Nagarajan</author>  
    <year>2003</year>  
    <price>49.99</price> 
  </book> 

  <book category="web" cover="paperback"> 
    <title lang="en">Learning XML</title>  
    <author>Erik T. Ray</author>  
    <year>2003</year>  
    <price>39.95</price> 
  </book> 

</bookstore>

XML的节点关系

节点和节点之间有父、子、同胞、先辈、后代关系。

<?xml version="1.0" encoding="utf-8"?>

<bookstore>

<book>
  <title>Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>
    
</bookstore>

以上面的xml为例

book是title、author、year、price的父。

title、year是book的子

title、year是同胞

bookstore是year、price的先辈

year是bookstore的后代

XPath是什么？

xpath是一门语言，可以查询xml格式的数据，可用来遍历xml文档中的元素和属性。

XPath 开发工具

开源的XPath表达式编辑工具:XMLQuire(XML格式文件可用)
Chrome插件 XPath Helper
Firefox插件 XPath Checker

xpath语法–选取节点

xpath使用的路径表达式来选取xml文档中的节点或者节点集，和操作系统中的文件系统近似。

下面写出常用的路径表达式：

表达式	描述
nodename	选取此节点的所有子节点
/	从根节点选取
//	从匹配选择的当前节点选择文档的节点，不考虑节点的位置
.	选取当前节点
…	选取当前节点的父节点
@	选取属性

例子：

路径表达式	结果
bookstore	选取 bookstore 元素的所有子节点。
/bookstore	选取根元素 bookstore。注释：假如路径起始于正斜杠( / )，则此路径始终代表到某元素的绝对路径！
bookstore/book	选取属于 bookstore 的子元素的所有 book 元素。
//book	选取所有 book 子元素，而不管它们在文档中的位置。
bookstore//book	选择属于 bookstore 元素的后代的所有 book 元素，而不管它们位于 bookstore 之下的什么位置。
//@lang	选取名为 lang 的所有属性。

xpath语法–谓语

谓语是查找某个特定的节点或者包含指定的值的节点，写在方括号[]中。

例子：

路径表达式	结果
/bookstore/book[1]	选取属于 bookstore 子元素的第一个 book 元素。
/bookstore/book[last()]	选取属于 bookstore 子元素的最后一个 book 元素。
/bookstore/book[last()-1]	选取属于 bookstore 子元素的倒数第二个 book 元素。
/bookstore/book[position<3]	选取前面两个属于 bookstore 子元素的 book 元素。
//title[@lang]	选取所有拥有lang属性的title元素
//title[@lang=‘eng’]	选取拥有并且lang属性为eng的title元素
/bookstore/book[price>35.00]	选取 bookstore 元素的所有 book 元素，且其中的 price 元素的值须大于 35.00。
/bookstore/book[price>35.00]/title	选取 bookstore 元素中的 book 元素的所有 title 元素，且其中的 price 元素的值须大于 35.00。

xpath语法–选取未知节点

XPath 通配符可用来选取未知的 XML 元素。

通配符	描述
*	匹配任何元素节点。
@*	匹配任何属性节点。
node()	匹配任何类型的节点。

在下面的表格中，我们列出了一些路径表达式，以及这些表达式的结果：

路径表达式	结果
/bookstore/*	选取 bookstore 元素的所有子元素。
//*	选取文档中的所有元素。
//title[@*]	选取所有带有属性的 title 元素。

xpath语法–选取若干路径

通过在路径表达式中使用“|”运算符，您可以选取若干个路径。

在下面的表格中，我们列出了一些路径表达式，以及这些表达式的结果：

路径表达式	结果
//book/title \| //book/price	选取 book 元素的所有 title 和 price 元素。
//title \| //price	选取文档中的所有 title 和 price 元素。
/bookstore/book/title \| //price	选取属于 bookstore 元素的 book 元素的所有 title 元素，以及文档中所有的 price 元素。

XPath的运算符

下面列出了可用在 XPath 表达式中的运算符：

在这里插入图片描述

这些就是XPath的语法内容，在运用到Python抓取时要先转换为xml。

lxml库

官方解释：

lxml 是一个HTML/XML的解析器，主要的功能是如何解析和提取 HTML/XML 数据。

lxml和正则一样，也是用 C 实现的，是一款高性能的 Python HTML/XML 解析器，我们可以利用之前学习的XPath语法，来快速的定位特定元素以及节点信息。

lxml python 官方文档：http://lxml.de/index.html

需要安装C语言库，可使用 pip 安装：pip install lxml

我们可以使用它来解析HTML文档。

from lxml import etree

text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a> # 注意，此处缺少一个 </li> 闭合标签
     </ul>
 </div>
'''

html = etree.HTML(text)

print(etree.tostring(html))

lxml可以修正html代码，例子中它自动补全了li标签，还添加了boby，html标签。

xpath实例

新建一个文件hello.html，内容如下：

<!-- hello.html -->

<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>

使用etree.parse('hello.html')来读取数据。

from lxml import etree

html = etree.parse('hello.html')

r0 = html.xpath('//li')
# 获取所有的li标签元素

r1 = html.xpath('//li/@class')
# 获取所有li标签的calss属性值
# ['item-0', 'item-1', 'item-inactive', 'item-1', 'item-0']

r2 = html.xpath('//li/a[@href="link4.html"]/text()')
# 获取所有li标签有下a标签，且a标签的href属性为link4.html，的a标签的值

r3 = html.xpath('//li//span')
# 获取li标签下的span标签，注意span是li的后代，不是子元素，所以需要用双斜杠//

r4 = html.xpath('//li/a//@class')
# 获取li标签下的a标签的所有class属性值
# ['bold']

r5 = html.xpath('//li[last()]/a/@href')
# 获取最后一个li标签下的a标签的href属性值

r6 = html.xpath('//li[last()-1]/a/text()')
# 获取倒数第二个li标签下的a标签的值

r7 = html.xpath('//*[@class="bold"]')
r7[0].tage
# 获取class值为bold的标签名

使用xpath爬取工作信息

import requests
from lxml import etree
import random
import time

city = input('请输入想要查询的城市：')
job = input('请输入想要查询的工作：')
t = 0
while True:
    try :
        page = input('请输入你想要的页数：')
        page = int(page)
    except ValueError as e:
        print('你输入的页数不规范，请重新输入。\n'+str(e)+'\n')
    else :
        print('输入规范！')
        break


url2 = 'https://www.jobui.com'
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0'
          ,'Cookie' : 'AXUDE_jobuiinfo=xPq3I7eEuU; jobui_p=1603107811866_15411384; TN_VisitCookie=95; jobui_user_passport=yk160310782641343; isloginStatus=hQDDe3yhKHw%3D; jobui_user_searchURL=http%3A%2F%2Fm.jobui.com%2Fjobs%3FjobKw%3D%25E6%2595%25B0%25E6%258D%25AE%25E5%2588%2586%25E6%259E%2590%25E5%25B8%2588%26cityKw%3D%25E5%258C%2597%25E4%25BA%25AC%26experienceType%3D1-3%25E5%25B9%25B4%26sortField%3Dlast; jobui_area=%25E9%2583%2591%25E5%25B7%259E; Hm_lvt_8b3e2b14eff57d444737b5e71d065e72=1619854217; Hm_lpvt_8b3e2b14eff57d444737b5e71d065e72=1619854229; TN_VisitNum=2; PHPSESSID=ov59buf4dfqo8bmpbgjk06s0f0'}


for n in range(page):
    url = "https://www.jobui.com/jobs?jobKw="+job+"&cityKw="+city+"&n="+str(n+1)
    res = requests.get(url,headers=headers)

    html = etree.HTML(res.text)
    a = html.xpath('//a[@class="job-name"]/@href')

    for i in a:
        url0=url2+i
        re = requests.get(url0,headers=headers)
        ht = etree.HTML(re.text)

        title = ht.xpath('//h1[@style="margin-bottom: 20px;"]/text()')
        if title==[]:
            continue
        else:
            title = title[0]
            t += 1

        yao = ht.xpath('//ul[@class="laver cfix fs16"]//li//text()')
        b = dict(zip(yao[::2], yao[1::2]))
        gw= ht.xpath('//div[@class="hasVist cfix sbox fs16"]//text()')

        with open(city+job+'.txt','a',encoding='utf-8') as f:
            f.write(str(title) + '\n')
            for k,v in b.items():
                f.write(k + v + '\n')

            for j in gw:
                f.write(j.strip()+'\n')
            f.write('\n\n')

        print('保存完成'+str(t)+'个')
    time.sleep(random.random()*3)

print('爬取完成')

BeautifulSoup4

和 lxml 一样，Beautiful Soup 也是一个HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 数据。

lxml 只会局部遍历，而Beautiful Soup 是基于HTML DOM的，会载入整个文档，解析整个DOM树，因此时间和内存开销都会大很多，所以性能要低于lxml。

BeautifulSoup 用来解析 HTML 比较简单，API非常人性化，支持CSS选择器、Python标准库中的HTML解析器，也支持 lxml 的 XML解析器。

Beautiful Soup 3 目前已经停止开发，推荐现在的项目使用Beautiful Soup 4。使用 pip 安装即可：pip install beautifulsoup4

官方文档：http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0

抓取工具	速度	使用难度	安装难度
正则	最快	困难	无（内置）
BeautifulSoup	慢	最简单	简单
lxml	快	简单	一般

例子

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""


soup = BeautifulSoup(html)

soup.prettify()
# 格式化输出soup中的内容

四大对象种类

Beautiful Soup将复杂的HTML文档转换成一个复杂的树形结构，每一个节点都是一个python对象。所有对象可以总结成四种：

Tag
NavigableString
BeaytifulSoup
Comment

Tag

Tag就是html中的一个个标签。例如

<head><title>The Dormouse's story</title></head>
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

上面的head、title、a、p都是一个Tag

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html)

soup.title
# <title>The Dormouse's story</title>

soup.head
# <head><title>The Dormouse's story</title></head>

type(soup.title)
# bs4.element.Tag

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>

我们可以利用soup加标签名，轻松的获取标签的内容，这些对象的类型是bs4.element.Tag，但是它只是查找第一个符合条件的标签。若需要找到所有符合的标签，可以使用find_all或select。

对于Tag它有两个重要的属性，name和attrs

soup.name
# soup 对象本身比较特殊，它的 name 即为 [document]

soup.title.name
# title,此节点的名字

soup.p.attrs
# 第一个p节点的属性及属性值
# {'class': ['title'], 'name': 'dromouse'}

soup.p['class']
# p节点的class属性值
# ['title']

soup.p['class'] = "newClass"
soup.p['class']
# 修改p节点的class的属性值，为newClass

del soup.p['class']
soup.p
# 删除p节点的class属性

NavigableString

若需要标签的值（标签内部的文字），可直接加上.string即可

soup.p.string
# "The Dormouse's story"

type(soup.p.string)
# bs4.element.NavigableString

BeautifulSoup

BeautifulSoup对象表示一个文档的内容，可以把它看作一个特殊的Tag。

type(soup.name)
# str

soup.name
# '[document]'

soup.attrs
# {}

Comment

Comment对象是一个特殊的NavigableString对象，其输出的内容不包括注释符号。

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>

soup.a.string
# ' Elsie '

type(soup.a.string)
# bs4.element.Comment

遍历文档树

直接子节点，contents和children

.contents

tag的contents属性可以将tag的子节点以列表的方式输出

soup.head.contents
# [<title>The Dormouse's story</title>]

.children

它不是返回的列表，而是一个迭代器。

soup.head.children
# <list_iterator at 0x114c2c8>

for i in soup.body.children:
    print(i)
# <p name="dromouse"><b>The Dormouse's story</b></p>
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
# <p class="story">...</p>

所有的子孙节点，descendants

.contents 和 .children 属性仅包含tag的直接子节点，.descendants 属性可以对所有tag的子孙节点进行递归循环，和 children类似，我们也需要遍历获取其中的内容。

for i in soup.descendants:
    print(i)

节点内容，string

如果tag只有一个 NavigableString 类型子节点,那么这个tag可以使用 .string 得到子节点。如果一个tag仅有一个子节点,那么这个tag也可以使用 .string 方法,输出结果与当前唯一子节点的 .string 结果相同。

简单点说，如果一个标签中没有子标签，它就会返回标签中的内容。如果标签里面只有唯一的一个标签了，那么 .string 也会返回最里面的内容。

soup.head.string
# "The Dormouse's story"

soup.title.string
# "The Dormouse's story"

搜索文档树

find_all(name, attrs, recursive, text, **kwargs)

返回一个符合条件的列表。

name参数

name参数可以查找所有名字为name的tag，字符串对象会被自动忽略掉。

传字符串

最简单的就是传入一个字符串，它会寻找与字符串相同的所有tag。

soup.find_all('b')
# [<b>The Dormouse's story</b>]

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

传正则表达式

若传入正则表达式作为参数，Beautiful Soup会通过正则表达式的match()来匹配内容。

import re

for i in soup.find_all(re.compile(r'^b')):
    print(i.name)

# body
# b

传列表

如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回.

soup.find_all(['a','b'])
# [<b>The Dormouse's story</b>,
# <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

keyword参数

可以根据节点的属性以及属性值筛选

soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

text参数

可以通过text参数寻找文档的内容，与name参数一样，可以传入字符串、正则(search)、列表。

soup.find_all(text=' Elsie ')
# [' Elsie ']

soup.find_all(text=["Tillie", " Elsie ", "Lacie"])
# [' Elsie ', 'Lacie', 'Tillie']

soup.find_all(text=re.compile("Dormouse"))
[u"The Dormouse's story", u"The Dormouse's story"]

CSS选择器

CSS选择器和find_all相似。

写 CSS 时，标签名不加任何修饰，类名前加.，id名前加#
在这里我们也可以利用类似的方法来筛选元素，用到的方法是 soup.select()，返回类型是 list

通过标签名找

soup.select('title')
# [<title>The Dormouse's story</title>]

soup.select('b')
#[<b>The Dormouse's story</b>]

通过类名找

soup.select('.sister')
# [<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通过id找

soup.select('#link3')
# [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

组合查找

组合查找即和写 class 文件时，标签名与类名、id名进行的组合原理是一样的，例如查找 p 标签中，id 等于 link1的内容，二者需要用空格分开

soup.select('p #link1')
# [<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

子标签查询，使用>分隔，或者空格分隔

soup.select('head > title')
# [<title>The Dormouse's story</title>]

soup.select('head title')
# [<title>The Dormouse's story</title>]

属性查找

查找时还可以加入属性元素，属性需要用中括号括起来，注意属性和标签属于同一节点，所以中间不能加空格，否则会无法匹配到。类似于xpath的属性语法。

soup.select('a[class="sister"]')
# [<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

属性仍然可以与上述查找方式组合，不在同一节点的空格隔开，同一节点的不加空格

soup.select('p a[href="http://example.com/elsie"]')
# [<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

获取内容

select方法返回的都是Tag列表，可以使用get_text()获取内容

type(soup.select('title')[0])
# bs4.element.Tag

soup.select('a[id="link2"]')[0].get_text()
# 'Lacie'

使用BeautifulSoup4爬取工作信息

from bs4 import BeautifulSoup
import requests
import time

header = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0'
          ,'Cookie' : 'AXUDE_jobuiinfo=xPq3I7eEuU; jobui_p=1603107811866_15411384; TN_VisitCookie=95; jobui_user_passport=yk160310782641343; isloginStatus=hQDDe3yhKHw%3D; jobui_user_searchURL=http%3A%2F%2Fm.jobui.com%2Fjobs%3FjobKw%3D%25E6%2595%25B0%25E6%258D%25AE%25E5%2588%2586%25E6%259E%2590%25E5%25B8%2588%26cityKw%3D%25E5%258C%2597%25E4%25BA%25AC%26experienceType%3D1-3%25E5%25B9%25B4%26sortField%3Dlast; jobui_area=%25E9%2583%2591%25E5%25B7%259E; Hm_lvt_8b3e2b14eff57d444737b5e71d065e72=1619854217; Hm_lpvt_8b3e2b14eff57d444737b5e71d065e72=1619854229; TN_VisitNum=2; PHPSESSID=ov59buf4dfqo8bmpbgjk06s0f0'}
url = 'https://www.jobui.com/jobs?cityKw=%E9%83%91%E5%B7%9E&jobKw=%E6%95%B0%E6%8D%AE'
resp = requests.get(url,headers=header)
soup = BeautifulSoup(resp.text,'lxml')

link = soup.select('a[class="job-name"]')

k = 1

for i in link:
    url0 = 'https://www.jobui.com'+i['href']
    res0 = requests.get(url0,headers=header)
    soup0 = BeautifulSoup(res0.text, 'lxml')
    try:
        title = soup0.select('div[class="jk-box jk-matter j-job-detail"] > h1')[0].get_text()

        yaoqiu = soup0.select('ul[class="laver cfix fs16"] > li')
        yq = ''
        for j in yaoqiu:
            yq += (j.get_text()+"\n")

        zhize = soup0.select('div[class="hasVist cfix sbox fs16"]')[0].get_text().replace('\r\n\t                    ',"")
    except IndexError:
        continue

    with open('job.txt','a',encoding='utf-8') as f:
        f.write(title+"\n\n工作要求"+yq+"\n岗位职责"+zhize+'\n-----------------------\n')

    print('已完成'+str(k)+"个")
    time.sleep(1)
    k += 1

努力生活的黄先生

关注

2
点赞
踩
30

收藏

觉得还不错? 一键收藏
打赏
0
评论
python爬虫基础，正则、xpath、bs4(详细)

文章目录爬虫步骤正则表达式什么是正则表达式？正则表达式的规则Python的re模块compile函数match函数search函数findall函数finditer函数split函数sub函数匹配中文注意：贪婪模式与非贪婪模式使用正则爬取工作信息xpath什么是XML?XML的节点关系XPath是什么？xpath语法--选取节点xpath语法--谓语xpath语法--选取未知节点xpath语法--选取若干路径XPath的运算符lxml库xpath实例使用xpath爬取工作信息BeautifulSoup4例子
复制链接

扫一扫