Python爬虫入门（一）网络请求

最新推荐文章于 2024-07-12 19:06:42 发布

Story–teller

最新推荐文章于 2024-07-12 19:06:42 发布

阅读量347

点赞数

文章标签： python 爬虫入门

本文链接：https://blog.csdn.net/qq_42019407/article/details/103007401

版权

url组成

scheme://host:port/path?query-string=xxx/anchor

scheme：代表的是访问的协议，一般为http或者https以及ftp等。

host：主机名，域名，比如www.baidu.com

path：查找路径

query-string：查询字符串。

anchor：锚点，后台一般不用管，前端用来页面定位的

http和https协议

http协议：全称HyperText Transfer Protocol（超文本传输协议），一种发布和接收HTML页面的方法，服务器端口是80端口。

https协议：http的加密版本，在http下加入了SSL层，服务器端口是443端口。

http定义了八种请求，常用get和post

get:一般只从服务器获取数据，不会对服务器资源产生任何影响

post：一般向服务器发送数据（登录），对服务器数据资源产生影响。

（有些网站反爬虫技术则不一定，不能直接认定）

http协议中，向服务器发送数据分为三部分，第一部分在url中，第二个在body里面，第三个放在head里面

设置三个参数：useragent，referer，cookie

常见状态码

200：请求正常，服务器正常返回数据

301：永久重定向（比如访问https://www.bing.com/则会跳转https://cn.bing.com/

302：临时重定向（访问没有登录的网站重新返回登录页面）

400：请求的url在服务器上找不到

403：服务器拒绝访问

500：服务器内部错误，可能是服务器出现bug了

Urllib库

可以模拟浏览器行为，向服务器发送请求，并可以保存服务器返回的数据

urlopen函数

python3的urllib库中，所有和网络相关的集成到urllib.request里面

from urllib import request
resp = request.urlopen('http://www.baidu.com')
print(resp.read())

urlretrieve函数:下载文件

from urllib import request
request.urlretrieve('http://www.baidu.com','baidu.html')

urlencode函数和parse_qs函数：手动编码和解码

用浏览器发送请求的时候，如果url包含了中文或者其他字符，那么浏览器会自动给我们进行编码。而如果使用代码发送请求，就需要手动编码。urlencode可以把字典数据转换为url编码，而parse_qs则可以解码

from urllib import parse
params = {'name': '张三','age':16,'greet':'hello world!'}
result=parse.urlencode(params)
print(result)
result2 = parse.parse_qs(result)
print(result2)

urlparse函数和urlsplit函数

基本一样，唯一不同的是urlparse里面多了一个params属性

from urllib import request,parse

url = 'http://www.baidu.com/s?wd=huabei&name=zhangsan#2'
result = parse.urlparse(url)
result2=parse.urlsplit(url)
print(result)
print(result2)
print('scheme:', result.scheme)
print('netloc:', result.netloc)

运行结果

ProxyHandler处理器（代理设置）

很多网站会检测某一段时间某个IPd 访问次数（通过流量统计，系统日志等），如果访问多的不像正常人，它就会禁用这个IP的访问。

我们需要设置一些代理服务器，每隔一段时间换一个代理，就算IP被禁止，依然可以换个IP继续爬取。

urllib中通过ProxyHandler来设置代理服务器

1.代理的原理：在请求目的网站之前，先请求代理服务器，然后让代理服务器去请求目的网站，代理服务器拿到目的网站的数据后，再转发给我们的代码。

2.http://httpbin.org/这个网站可以方便的查看http请求的一些参数

3.在代码中使用代理

*使用’urllib.request.ProxyHandler’,传入一个代理，这个代理是一个字典，字典的key值依赖于服务器能够接收的类型，一般是’http’或者’https’,值是‘ip:port’

*使用上一步创建的’handle’,以及’request.build_opener’创建一个’opener’对象

*使用上一步创建的’opener’,调用’open’函数，发起请求

示例如下：

from urllib import request,parse

#没有使用代理
# url='http://httpbin.org/ip'
# req=request.urlopen(url)
# print(req.read().decode('utf-8'))

#使用代理
url='http://httpbin.org/ip'
#1。使用ProxyHandler，传入代理构建一个Handler
handler=request.ProxyHandler({"http:":"36.7.89.233:8060"})
#2.使用上面的handler构建一个opener
opener = request.build_opener(handler)
#3.使用opener去发送一个请求
resp=opener.open(url)
print(resp.read())

一些免费的代理网站

快代理：http://www.kuaidaili.com/

云代理：http://www.dailiyun.com/

Cookie

在网站中，http请求是无状态的。Cookie的出现就是为了解决这个问题，第一次登录服务器返回一些数据cookie给浏览器，然后浏览器保存在本地，当用户第二次请求的时候，就会自动的把上次请求的cookie数据自动携带给服务器，服务器通过浏览器携带的数据就能判断当前用户是谁，cookie的存储数据有限，不同的浏览器有不同的存储大小，但一般不超过4kb，因此cookie只能存储一些小量的数据

Cookie格式：

Set-cookie：NAME=VALUE;Expires/Max-age=DATE;Path=PATH;Domain=DOMAIN_NAME;SECURE

参数意义

NAME：cookie名字

VALUE：cookie的值

Expires：cookie的过期时间

Path：cookie的作用路径

Domain：cookie的作用域名

SECURE：是否只在https协议下起作用

1.使用cookielib库和HTTPCookieProcessor模拟登录

简单粗暴：通过浏览器查找到cookie值，直接写入

from urllib import request

url='http://www.renren.com/880151247/profile'
headers={
    'User-Agent':"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36",
    'Cookie':"***" #自己实时更新
}
req=request.Request(url=url,headers=headers)
resp=request.urlopen(req)
with open('renren.html','w',encoding='utf-8') as fp:
    #write函数写入必须是一个str的数据类型
    #resp.read()读出来是一个bytes数据类型
    #bytes -> decode ->str
    #str ->encode ->.bytes
    fp.write(resp.read().decode('utf-8'))

2.http.cookiejar模块

该模块主要的类：CookieJar、FileCookieIar、MozillaCookieJar，LWPCookieJar

利用http.cookiejar

#from urllib import request

from urllib import parse,request
from http.cookiejar import CookieJar

cookiejar=CookieJar()
handler = request.HTTPCookieProcessor(cookiejar)
opener = request.build_opener(handler)

headers={
   'User-Agent':"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36",
    #'Cookie':"***" #自己实时更新
}

data={
    'phone':'***',
    'password':'***' #自己设置
}
login_url='http://www.renren.com/880151247/profile'
req=request.Request(login_url,data=parse.urlencode(data).encode('utf-8'),headers=headers)
request.urlopen(req)

dapeng_url='http://www.renren.com/'
# req=request.Request(url=dapeng_url,headers=headers)
resp = opener.open(dapeng_url)
# resp=request.urlopen(req)
print(resp.read().decode('utf-8'))

cookie保存到本地

用cookieJar下的save函数

from urllib import request
from http.cookiejar import MozillaCookieJar
cookiejar = MozillaCookieJar("cookie.txt")
#如果要加载过期的cookie，设置ignore_discard=True
#cookiejar.load(ignore_discard=True)
headers={
   'User-Agent':"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36"
}
handler = request.HTTPCookieProcessor(cookiejar)
opener = request.build_opener(handler)
req=request.Request('http://httpbin.org/cookies/set/name/spider',headers=headers)
resp=opener.open(req)
print(resp.read())
# 如果要保存关闭浏览器就过期的cookie，则设置ignore_discard=True
#cookiejar.save(ignore_discard=True,ignore_expires=True)

# for cookie in cookiejar:
#     print(cookie)

Requests库

安装 pip install request

发送GET请求

1.直接调用requests.get

Response=request.get(http://www.baidu.com)

Response.text和response.context的区别

Response.content：这个是直接从网络上直接抓取的数据，没有通过任何的解码，所以是一个bytes类型。其实在硬盘上和网络上传输的字符串都是bytes类型。

Response.text：这是一个str数据类型，是requests库将response.content进行解码的字符串，解码需要指定一个编码方式，requests会根据自己的猜测来判断编码方式，所以有时候会出错，可能导致乱码，这时候应该用response.content.decode(‘utf-8’)

2.如果想要添加headers，可以传入headers参数增加请求头中的headers消息，如果要将参数放入url中传递，可以调用params参数

import requests
kw={'wd':"中国"}
headers={
    'User-Agent' : "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36"
}
response=requests.get("http://www.baidu.com/s",params=kw,headers=headers)
#返回的是Unicode数据类型
print(response.text)
#返回bytes数据类型
print(response.content)
print(response.url)
print(response.encoding)
print(response.status_code)

发送POST请求

使用代理

使用requests添加代理，只需要在请求中传递proxies参数就好

处理Cookie

用session获取cookie

import request
resp=request.get("http://www.baidu.com")
print(resp.cookies)
session=request.Session
url="**"
headers={
    *****
}
data={
   *****
}
session.post(url,data,headers)
response=session.get('*****')

处理不信任的SSL证书：

使用request.get方法设置verify=False

resp=request.get("http://www.baidud.com",verify=False)

如若出现以下报错

可以在前面加上

import requests
requests.packages.urllib3.disable_warnings()
resp=request.get("http://www.baidud.com",verify=False)

Story–teller

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
Python爬虫入门（一）网络请求

url组成scheme://host:port/path?query-string=xxx/anchorscheme：代表的是访问的协议，一般为http或者https以及ftp等。host：主机名，域名，比如www.baidu.compath：查找路径query-string：查询字符串。anchor：锚点，后台一般不用管，前端用来页面定位的http和https协...
复制链接

扫一扫