爬虫基本库的使用

狒狒fei狒

于 2021-12-15 20:01:36 发布

阅读量892

点赞数 2

分类专栏：爬虫入门文章标签：前端 javascript 爬虫 python 后端

本文链接：https://blog.csdn.net/m0_61597961/article/details/121940735

版权

爬虫入门专栏收录该内容

6 篇文章 1 订阅

订阅专栏

最后再补充介绍一种方法finditer（）

基本库的使用

爬虫的第一步就是模拟浏览器发送请求，也就是get和post，而python提供了功能齐全的类库来帮助我们完成请求，最基础的有urllib、requests等，这里我们只介绍requests。

使用requests

GET请求

用requests实现get请求很简单，我们这里模拟爬取百度首页。

代码：

import requests
url="https://www.baidu.com/?tn=02003390_63_hao_pg"
resp=requests.get(url)
resp.encoding="utf-8"
print(resp.text)

运行结果：

<!DOCTYPE html>

<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css><title>百度一下，你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus=autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn" autofocus></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新闻</a> <a href=https://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登录</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');

                </script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使用百度前必读</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a>&nbsp;京ICP证030173号&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>

Get里面放的就是url，而这里用了encoding方法改变了resp的解码方式，因为python的默认解码方式和百度的编码方式不同，如果不使用encoding的话就会出现一段看不懂的乱码，这实际上是因为编码和解码方式不匹配导致的。

可以看到这个url最后一部分带有参数，我们可以使用params方法，手动添加这个参数。

代码：

import requests
url="https://www.baidu.com"
data={
	"tn":"02003390_63_hao_pg"
}
resp=requests.get(url,params=data)
resp.encoding="utf-8"
print(resp.text)

如果有多个参数的话会有？&等字符出现。

例如：

https://cn.bing.com/search?q=CSDN&form=ANNTH1&refig=528bda3661d64c98802b7bd0d54aabb0https://cn.bing.com/search?q=CSDN&form=ANNTH1&refig=528bda3661d64c98802b7bd0d54aabb0

对于这些？&字符，系统会自动帮我们补上。

接下来我们爬取知乎，这里与爬取百度不同，我们需要加入headers，其中包含User-Agent信息，如果不加这个，知乎会禁止抓取，这就是一种反爬措施。User-Agent在请求表头里可以找到，如下图。

我们知道图片、音频、视频这些文件都是由二进制码组成，如果我们想要抓取它们，我们就要拿到它们的二进制码。

这里我们以百度图标为例：

代码：

import requests
url="https://www.baidu.com/favicon.ico"
#headers={
#	"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36"
#}
resp=requests.get(url)
print(resp.text)

运行结果：

如果我们采用content文本形式打印就能得到它的二进制码

代码：

import requests
url="https://www.baidu.com/favicon.ico"
#headers={
#	"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36"
#}
resp=requests.get(url)
print(resp.content)

运行结果：

接着我们将图片保存下来，运行之后我们可以在文件夹中找到该文件，里面存着的就是图标。

代码：

import requests
url="https://www.baidu.com/favicon.ico"
#headers={
#	"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36"
#}
resp=requests.get(url)
with open ('baidu.ico','wb') as f:
	f.write(resp.content)

这里使用了open方法：

Open 函数

file_object = open(‘file_name’, ‘mode')

file_name 参数：表示所打开文件名称，注意要加上文件后缀，以表示该文件类型。

mode 参数：mode参数可以不写，默认mode参数是“r”。

‘w’–写入模式，将新信息编辑写入文件中的时候使用（在使用该模式的时候、任何现存的同名文件的内容将会被擦除、从而写入新的内容）

创建文本文件create a text file

file = open(‘feifei.text’,‘w')

file.write(‘Hello World\n')

file.write(‘This is our new text file\n')

file.write(‘and this is another line. \n')

file.write(‘Why? Because we can. \n')

file.close()

那么在本地会出现一个叫做feifei的文本文件，里面写着

Hello World

This is our new text file

and this is another line

Why? Because we can.

关闭文件

当操作完成之后，使用file.close()来结束操作，从而终结使用中的资源，从而能够释放内存。

读取：在python中读取txt文件

‘r’–只读模式，当文件处在“只读”的模式时使用。

将某个txt文件中的所有内容全部打印出来，先读取再打印

file = open(‘feifei.text', ‘r')

print(file.read())

将会把该文本文件中所有的内容展示出来。

另一种读取文件的方式是调用某些字符。

例如，下面的代码中，编译器将会读写文本文件中储存的前5个字符：

file = open(‘feifei.txt', ‘r')print(file.read(5))

With open (‘file_name’ ‘w’) as f:

等效于

f=open(‘file_name’ ‘w’)

f.close()

“wb”代表写入二进制数据。

Cookies

如何获取cookies

我们可以直接调用cookies属性从而得到Cookies。

代码：

import requests
url="https://www.zhihu.com/explore"
headers={
	"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36"
}
resp=requests.get(url,headers=headers)
print(resp.cookies)
print("\n")
for key,value in resp.cookies.items():
	print(key + '=' + value)

也可以直接从请求头里获取Cookies。

会话维持

思考一个问题，当你用浏览器登录了一个网站时，你可以再次点击从而获取登录后自己的个人信息，在这个过程中实际上是发送了两个请求（实际上不止，还有许多看不到的请求），而计算机的cookies都对应着服务器的相同的会话，这样服务器就能知道你的登录状态，从而给你想要的东西。但如果我们用爬虫get或post请求时，实际上是两个完全不相关的会话，这个时候我们就需要设置cookies来维持相同的会话。有一个笨方法就是这两个请求都设置相同的cookies，这当然可以，但我们有更简单的方法——Session对象。

因为本人知识有限，对session还不够了解，所以这里不再多说，QAQ，等我学会了再来补充。

SSL证书验证

当我们发送请求时，会检查SSL证书，我们可以用verify参数控制是否检查证书，verify默认是True。

代码：

import requests
url="https://www.zhihu.com/explore"
headers={
	"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36"
}
resp=requests.get(url,headers=headers,verify=False)
print(resp.text)

如果我们请求的网站证书不合格，那么就不能正常请求，将会报错，但只要我们将verify参数设置为False即可。不过我们会发现一个警告，它建议我们给它指定证书，但我们并不用搭理它。

代理设置

代理设置只需用到proxies参数。注意proxies的键为协议，也就是你所用的代理的协议类型。

import requests
url="https://www.zhihu.com/explore"
headers={
	"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36"
}
proxies={
	"http":"http://10.11.1.10:3255"
}
resp=requests.get(url,headers=headers,proxies=proxies)
print(resp.text)

超时设置

timeout参数能用来控制计算机发出请求到服务器返回响应的时间，如下：

import requests
url="https://www.zhihu.com/explore"
headers={
	"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36"
}
proxies={
	"http":"http://10.11.1.10:3255"
}
resp=requests.get(url,headers=headers,proxies=proxies,timeout=1)
print(resp.text)
ujujuj

如果一秒内没有响应，那么就会抛出异常。

timeout参数默认为None，如果不设置直接留空，就一定能得到响应。

正则表达式

正则表达式的作用就是能用一定的规则将特定的文本提取出来。

模式	匹配规则
\w	匹配字母、数字及下划线
\d	匹配数字
\s	匹配空白字符
^	匹配字符串的开头
.	匹配除换行符以外的任意字符
*	匹配0个或多个表达式

正则表达式并不是python独有的，它也可以用在其他变成语言中，但python中的re库提供了几乎所有正则表达式的使用。下面介绍一下re库常见的方法。

match（）

match方法从字符串的起始位置匹配正则表达式，如果匹配则返回结果，否则返回None。

如下：

import re
content="Hello world 121388 fei fei 12 13 88 ni hao "
result=re.match(r'^Hello\s.....\s\d{4}',content)
print(result)
print(result.group())
print(result.span())

运行结果：

我们用了^匹配字符串的开头，用\s匹配空白字符，\d匹配数字，在\d后面加上{4}代表匹配4个数字。在match（）方法中，第一个参数传入正则表达式，第二个参数传入字符串。

用group（）方法可以输出匹配的内容，span（）方法可以输出匹配的范围，（0，16）代表字符串在原字符串中的位置范围。

我们还可以用（）将想提取的子字符串括起来，用group（）方法传入分组的索引可以得到结果。

例如：

import re
content="Hello world 121388 fei fei 12 13 88 ni hao "
result=re.match(r'^Hello(\s.....)\s\d{4}',content)
print(result)
print(result.group(1))
print(result.span())

运行结果：

我们还有更简单的正则表达式可以用来匹配除换行符以外的所有字符，那就是.*

.可以匹配除换行符以外的任意字符，*代表匹配前面的字符无限次。所以.*可以匹配任意字符（除换行符）。

代码：

import re
content="Hello world 121388 fei fei 12 13 88 ni hao "
result=re.match(r'.*(\s.....)\s(\d{4})',content)
print(result)
print(result.group(1,2))
print(result.span())

运行结果：

贪婪与非贪婪

在贪婪匹配下，.*匹配尽可能多的字符，例如：

import re
content="Hello world 121388 fei fei 12 13 88 ni hao "
result=re.match(r'Hello.*(\d)',content)
print(result)
print(result.group(1))
print(result.span())

运行结果：

如果我们把.*换成.*？就变成了非贪婪匹配，也就是匹配尽可能少的匹配字符。

代码：

import re
content="Hello world 121388 fei fei 12 13 88 ni hao "
result=re.match(r'Hello.*?(\d)',content)
print(result)
print(result.group(1))
print(result.span())

运行结果：

修饰符

这里我们只讲一个修饰符，就是re.S

它的作用是可以使.匹配包括换行符在内的所有字符。只需将re.S加在方法中即可。

.可以匹配换行符以外的任意字符，但如果我们要提取的字符串中就包含.，怎么办呢。

我们可以使用转义匹配，只需在.前面加反斜线。

例如：

import re
content="Hello world 1.21388 fei fei 12 13 88 ni hao "
result=re.match(r'Hello.*?\.(\d\d)',content,re.S)
print(result)
print(result.group(1))
print(result.span())

我们还可能碰见一种特殊情况，那就是当我们在正则中使用\d\s时，因为\d是python的关键字，可能会出现转义的情况，我们只需在前面加上r即可

例如：

result=re.match(r'Hello.*?\.(\d\d)',content,re.S)

Search（）方法

search与match的不同在于match需要考虑开头的内容，而search在整段字符串中寻找符合表达式的子字符串，它在匹配时会扫描整个字符串，然后返回第一个成功匹配的结果。

findall（）方法

findall与search和match不同，findall搜索整个字符串，然后返回匹配正则表达式的所有内容。

sub（）方法

代码：

import re
content="Hello world 1.21388 fei fei 12 13 88 ni hao "
result=re.sub(r'Hello\s','215',content)
print(result)

运行结果：

sub（）中第一个参数匹配字符，第二个参数是所替换的字符（如果去掉所匹配的字符，可以赋值为空），第三个参数是原字符串。

compile方法（）

Compile方法可以将正则字符串编译成正则表达式对象，以便在后面的匹配中服用。同时，compile（）还可以传入修饰符。

例如：

import re
content1="Hello world 1.21388 fei fei 12 13 88 ni hao "
content2="djwian 515651 wda1515wad"
pattern=re.compile(r".*?(\d\d\d\d)",re.S)
resp1=re.findall(pattern,content1)
resp2=re.findall(pattern,content2)
print(resp1)
print(resp2)

运行结果：

最后再补充介绍一种方法finditer（）

finditer（）与findall（）类似，只不过finditer返回的是迭代器。

提取内容时需要用for

例如：

import re
content="Hello world 1.21388 fei fei 12 13 88 ni hao "
pattern=re.compile(r".*?(?P<num>\d{5})",re.S)
resp=pattern.finditer(content)
for it in resp:
	print(it.group("num"))

注意：当使用match（）、search（）时返还的是match对象

例如：

必须使用.group（）才能拿到想要的东西，而且finditer（）也需要使用.group（）。

补充一种正则表达式的用法（？P）用于分组起名。

例如：

import re
content="Hello world 1.21388 fei fei 12 13 88 ni hao "
pattern=re.compile(r".*?(?P<num>\d{5})",re.S)
resp=pattern.search(content)
print(resp.group('num'))

狒狒fei狒

关注

2
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
爬虫基本库的使用

思考一个问题，当你用浏览器登录了一个网站时，你可以再次点击从而获取登录后自己的个人信息，在这个过程中实际上是发送了两个请求（实际上不止，还有许多看不到的请求），而计算机的cookies都对应着服务器的相同的会话，这样服务器就能知道你的登录状态，从而给你想要的东西。但如果我们用爬虫get或post请求时，实际上是两个完全不相关的会话，这个时候我们就需要设置cookies来维持相同的会话。有一个笨方法就是这两个请求都设置相同的cookies，这当然可以，但我们有更简单的方法——Session对象。
复制链接

扫一扫