python网络爬虫学习——requests库和正则表达式

最新推荐文章于 2024-08-26 17:00:25 发布

maizeman126

最新推荐文章于 2024-08-26 17:00:25 发布

阅读量306

点赞数 7

分类专栏： python统计分析文章标签： python 爬虫学习

本文链接：https://blog.csdn.net/maizeman126/article/details/137605785

版权

python统计分析专栏收录该内容

91 篇文章 8 订阅

订阅专栏

参考资料：python网络爬虫技术与应用【邓维】

一、requests库

1、GET请求

（1）基础GET操作

# 基本GET请求
r=requests.get("http://httpbin.org/get")
print(r.text)

（2）带参数GET操作

# 带参数GET请求
r=requests.get("http://httpbin.org/get?name=williams_z&age=21")
param={"name":"williams_z","age":21}
r=requests.get("http://httpbin.org/get",params=param)
print(r.text)

（3）JSON文件操作

# 如果项请求JSON文件，可利用JSON()方法解析
# 以文字为基础且易于阅读，同时也方便机器解析和生成
import requests
import json
r=requests.get("http://httpbin.org/get")
print(r.json())

（4）获得二进制数据

# 获得二进制数据，主要用以解析图片和视频等
r=requests.get("https://github.com/favicon.ico")
with open("favicon.ico","wb") as f:
    f.write(r.content)

（5）添加headers

# 添加headers
headers={
    "User-Agent":"Mozilla/5.0(Windows NT 10.0;Win64:x64) AppleWebKit/537.36(KHTML,likeGecko)Chrome/93.0.4577.82 Safari/537.36 Edg/93.0.961.52"
}
r=requests.get("https://www.zhihu.com/explore",headers=headers)
print(r.text)

2、高级操作

（1）文件上传

# 文件上传
file={"file":open("favicon.ico","rb")}
r=requests.post("http://httpbin.org/post",files=files)
print(r.text)

（2）获得cookie

# 获得cookie
r=requests.get("http://www.baidu.com")
print(r.cookies)
for key,value in r.cookies.items():
    print(key+"="+value)

（3）证书验证

# 证书验证
import requests
from requests.packages import urllib3
urllib3.disable_warnings()# 用以消除证书未验证系统弹出的警告
r=requests.get("https://www.12306.cn",verify=False)
print(r.status_code)

（4）代理设置

proxies={
    "http":"http://127.0.0.1:9743",
    "https":"https://127.0.0.1.9744"
}
r=requests.get("http://wwww.taobao.com",proxies=proxies)
print(r.status_code)

（5）认证设置

# 认证设置
r=requests.get("http://120.27.34.24:9001",auth=("user","123"))
print(r.status_code)

当状态返回代码为200时，表示成功接收、理解和接受。

当状态返回代码为1XX时，表示临时响应。

当状态返回代码为3XX时，表示重定向。

当状态返回代码为4XX时，表示请求错误。

当状态返回代码为5XX时，表示服务器错误。

3、在网络爬虫中的请求头中携带cookies信息，从而可以直接出去需要登录的网站。

headers={
    "cookie":"PHPSESSID=68q6d1mi0sr4ecbcpv7ptu9ph0",
    "user-agent":"Mozilla/5.0(Windows NT 10.0;Win64:x64) AppleWebKit/537.36(KHTML,likeGecko)Chrome/93.0.4577.82 Safari/537.36 Edg/93.0.961.52",
}
r=requests.get("https://www.zhihu.com",headers=headers)
print(r.text)

二、正则表达式

1、findall查找所有，返回list

import re
list1=re.findall("m","mai le go len mei meme")
print(list1)

2、search，如果匹配到第一个结果就是返回，如果匹配不到则返回None。

ret=re.search(r"\d","5之前请离开，否则要到7点才能走")
ret.group()

3、finditer和findall差不多，只不过这时返回的是迭代器。

it=re.finditer("m","mai le fo len, mai bu mai")
print(it)
for i in it:
    print(i.group())

4、示例

import re
key=r"<html><body><h1>helloworld</h1></body></html>"
p1=r"<h1>(.*?)</h1>" # 正则表达式规则
pattern=re.compile(p1) # 编译正则表达式
matcher=re.findall(pattern,key) # 在源文本中搜索符合正则表达式的部分
print(matcher[0]) # 打印findall返回的时候列表，所以要索引0来取值