爬虫基础（一）

最新推荐文章于 2024-07-12 16:16:27 发布

LCL-2019

最新推荐文章于 2024-07-12 16:16:27 发布

阅读量325

点赞数

分类专栏：爬虫技术文章标签： python

本文链接：https://blog.csdn.net/weixin_43056654/article/details/103919788

版权

爬虫技术专栏收录该内容

7 篇文章 0 订阅

订阅专栏

1、URL基本格式
schems 😕/ host [:port #] / path / …/ [?query-string ] [# anchor]

scheme : 协议（例如 http / https）
host : 服务器的 IP地址或域名
port # ：服务器的端口（默认端口为80端口）
path ：访问资源的路径
query-string ：参数（发送给http服务器的数据）
anchor ：锚定（从哪个界面跳转过来的）

http 端口号 80
https 端口号443

请求实例

1 GET / HTTP/1.1     # 请求方法+协议名称
2 Host: www.baidu.com    #请求的服务器域名
3 Connection: keep-alive   #与服务器连接方式（长连接）
4 Cache-Control: max-age=0   # 缓存， = 0 为 不用缓存
5 Upgrade-Insecure-Requests: 1   #升级为https请求
6 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.3  #用户浏览器名称
7 Sec-Fetch-Mode: navigate
8 Sec-Fetch-User: ?1
9 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/web   # 传输文件类型
10 Sec-Fetch-Site: same-origin
11 Referer: https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=bai   # 页面跳转来源
12 Accept-Encoding: gzip, deflate, br  # 文件编解码格式	
13 Accept-Language: zh-CN,zh;q=0.9  # 浏览器接收语言种类
14 Cookie: BIDUPSID=4049831E3DB8DE890DFFCA6103FF02C1; #和服务器相关的信息、保存在本地

响应实例

1 HTTP/1.1 200 OK      # 协议名称 + 响应状态码
2 Bdpagetype: 1   # 消息报头
3 Bdqid: 0xdbeb11ea000cfef4
4 Cache-Control: private   #本地是否缓存数据、下次请求资源，需要从新请求服务器，不从本地获取
5 Connection: keep-alive   # 保持长连接、支持长连接
6 Content-Encoding: gzip   # 服务器发送资源的编码格式，需要gzip 方式解码
7 Content-Type: text/html   # 资源文件的类型、字符编码格式
8 Cxy_all: baidu+642857607c537ed21fa04bcfb54ff6ee
9 Date: Thu, 02 Jan 2020 06:32:55 GMT    # 服务器发送资源的时间
10 Expires: Thu, 02 Jan 2020 06:32:51 GMT
11 Server: BWS/1.1
12 Set-Cookie: delPer=0; path=/; domain=.baidu.com
13 Set-Cookie: BDSVRTM=6; path=/
14 Set-Cookie: BD_HOME=0; path=/
15 Set-Cookie: H_PS_PSSID=1448_21096_30210_30283_30504; path=/; domain=.ba
16 Strict-Transport-Security: max-age=172800
17 Traceid: 1577946775028760116215846779410554093300
18 Vary: Accept-Encoding
19 X-Ua-Compatible: IE=Edge,chrome=1
20 Transfer-Encoding: chunked

响应状态码

100 ~ 199   # 服务器成功接收部分请求，要求客户端继续提交其余请求才能完成整个处理过程
200 ~ 299	# 服务器成功接收请求，并已经完成整个处理过程
300 ~ 399	# 为完成请求，客户端需要进一步细化请求
400 ~  499	# 客户端的请求有错误，常用404
500 ~ 599	# 服务器端出现错误，常用500（请求未完成。服务器遇到不可预知的情况）

requests 解决编码

response.text # 数据类型为str （字符串、文本）
response.encoding = 'utf-8'    # 解码-返回字符串
response.content # 二进制编码 （图片、视频、音频）
response.content.decode('utf8')   # 解码-返回字节

get请求

headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6)   # 请求头
kw = {'wd':'居然'}    # 请求参数，字典格式
# params 请求参数
response = requests.get(url,,headers = headers，params=kw)   # headers 的作用是模拟浏览器

"""
**response 常用方法**
"""
response.text   # 返回内容转化为str格式显示
response.content   # 返回内容为bety格式显示
response.status_code  # 返回状态码
response.request.headers  # 请求头
response.headers  # 响应头

下载图片

import requests
response = requests.get("url",,headers = headers)
# 图片本身为二进制，所以需要用 wb 的方式进行打开
with open('baidu.png','wb') as f:
	f.write(response.content)

post 请求

# data 为提交数据参数，格式字典型
data = {
		'from':'en',
		'to':'zh',
		'query':'hello',
		'transtype': 'realtime',
		'simple_means_flag': 3',
		'sign': '679849.965784',
		'token': '4972ace8e57859dbc36306dd1f1dfc83'
		}
response = request.post('url',data = data,headers = headers)

IP代理服务器

# 先将数据发给代理服务器，代理服务器再向目标服务器发送请求
# proxies 代理服务器地址列表、字典格式  协议类型 + IP + 端口号
proxies = {
		'http':'http://12.34.56.79:9527',
		'https':'http://12.34.56.79:9527',
			}

requests.get('http://www.baidu.com',proxies = proxies)

cookie 和session

#	cookie ：存储在本地，保存用户信息
#	session ：存储在服务器端，保存用户信息
# request 提供 session 类，实现客户端和服务端的会话保持
# 1、实例化 session           session = requests.session()
# 2、让session类来 发送get请求/post请求      response = session.get(url,headers)

模拟登陆演示

import requests
session = requests.session()
post_url = "url"
post_data = {'email':'844297347@qq.com','password':'XXXX'}
headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6
# 使⽤session发送post请求，cookie保存在其中
session.post(post_url,data=post_data,headers=headers)
# 使⽤session进程请求登录之后才能访问的地址    session 自动携带保存成功的cookiex信息
r = session.get('url',headers = headers)
with open('renren.html','w') as f:
	f.write(r.text)

POST请求添加cookie

# 1、POST请求在请求头中添加cookie、
# 2、在请求中添加cookie  获取登陆后的界面
import requests
headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6，
			Cookie ：'anonymid=joodgrmgdmis8y; depovince=GW; _r01_=1; JSESSIONID=ab
			}
r = requests.get('http://www.renren.com/474133869/profile',headers = headers)
with open('renren2.html','w') as f:
	f.write(r.text)

cookie = {i.split("=")[0]:i.split("=")[1] for i in cookie.split("; ")}
# 使⽤session进程请求登录之后才能访问的地址
r = requests.get('http://www.renren.com/474133869/profile',headers = header)
with open('renren2.html','w') as f:
	f.write(r.text)

请求中添加cookie

import requests
#  Cookie ：'anonymid=joodgrmgdmis8y; depovince=GW; _r01_=1; JSESSIONID=ab
headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6 }
cookie = {anonymid : joodgrmgdmis8y ,
		depovince: GW,
			}
cookie = {i.split("=")[0]:i.split("=")[1] for i in cookie.split("; ")}   # cookie 转化成字典格式
# 使⽤session进程请求登录之后才能访问的地址
r = requests.get('http://www.renren.com/474133869/profile',headers = header,cookies = cookie)
with open('renren2.html','w') as f:
	f.write(r.text)

requests小技巧

#  把cookie对象转化成字典   
#  requests.utils.dict_from_cookiejar 
import requests
response = requests.get("url")
print(response.cookies)
requests.utils.dict_from_cookiejar(response.cookies)  # 将cookies 转化成字典格式
requests.utils.cookiejar_from_dict("cookie字典")     # 将字典内容转化成cookies

#请求ssl证书验证
response = requests.get("url",verify = False)   # 跳过ssl证书验证

# 设置请求超时
response = requests.get(url,timeout = 10)

# 配合状态码判断是否请求成功，断言（如果返回的状态码为 200 ）
assert response.status_code == 200

# url 地址的编解码
requests.utils.unquote()   # 解码
requests.utils.quote()    # 编码

# retrying 请求超时、重新发送请求
pip install retrying
import requests
from retrying import retry
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) Appl}

@retry(stop_max_attempt_number = 3)    # 用装饰器的形式,超时重新发送三次
def _parse_url(url):
	print('代码执⾏了⼏次？')
	response = requests.get(url,headers=headers,timeout=3)
	assert response.status_code == 200
	return response.content.decode()
	
def parse_url(url):
	try:
		html_str = _parse_url(url)
	except:
		html_str = None
	return html_str
	
if __name__ == '__main__':
	url = 'http://www.baidu.com'
	print(parse_url(url))

数据分类（结构化数据、非结构化数据）

#  非结构化数据 ：一般用正则表达式、xpath 来提取数据
#  例如：HTML

# 结构化数据 ： 转化为 python 数据类型
# 例如：json、xml

json字符串与python数据类型的互相转化
json 字符串为双引号

json.loads()   # json 数据类型转化为 python数据类型
json.dumps()    # python 数据类型转化为 json数据类型

json.load()    # 包含json的类文件对象转化为 python数据类型（从文件中读取）
json.dump()     # python 数据类型转化为包含json 的类文件对象

LCL-2019

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫基础（一）

1、URL基本格式schems ????/ host [:port #] / path / …/ [?query-string ] [# anchor]scheme : 协议（例如 http / https）host : 服务器的 IP地址或域名port # ：服务器的端口（默认端口为80端口）path ：访问资源的路径query-string ：参数（发送给http服务器的数据）...
复制链接

扫一扫