爬虫（1）

最新推荐文章于 2024-09-14 11:30:21 发布

弗拉基米王

最新推荐文章于 2024-09-14 11:30:21 发布

阅读量270

点赞数

文章标签：爬虫

本文链接：https://blog.csdn.net/DSJDWWJ/article/details/102991538

版权

一、网络请求

1、爬虫基础

通用爬虫:通用爬虫是搜索引擎抓取系统(百度、谷歌、搜狗)的重要组成部分，主要是将互联网上的网页下载到本地，形成一个互联网内容的镜像备份。
聚焦爬虫:是面向特定需求的一种网络爬虫程序，他与通用爬虫的区别在于:聚焦爬虫是实施网页抓取的时候，对内容进行筛选处理，尽量保证只抓取与需求相关的网页信息
http协议及url详解
url是Uniform Resource Locator的简写，统一资源定位符
-一个URL由以下几部分组成:

scheme://host:port/path/?query-string=xxx#anchor

scheme:代表的是访问协议，一般为http或者https以及ftb等
host:主机名，域名，比如www.baidu.con
port: 查找路径
query-string:查询字符串
anchor:锚点，后台一般不用管，前端用来做页面定位的
在浏览器中请求一个url，浏览器会对这个url进行一个编码，除英文字母、数字和部分符号外，其他全部使用百分号+十六进制码值进行编码
请求头常见参数:
- 在http协议中，向服务器发送一个请求
  - 数据放在url中
  - 数据放在body中
  - 数据放在head中
- 1、user-Agent:浏览器名称。请求一个网页的时候，服务器通过这个参数获取是由哪种浏览器发送的。如果使用爬虫发送请求，User-Agent默认是Python
- 2、Referer:表明当前这个请求是从哪个url过来的，这个一般也可以用来做反爬技术，如果不是从指定页面过来的，那么久不做相关的响应
- 3、Cookie:http协议是无状态的。一个人发送两次请求，服务器是没有能力判断这个请求是否来自同一个人，因此就用cookie来标记，一般想要做登录后才能访问的网站，就需要发送cookie信息
常见响应状态码：
- 1、200:请求正常
- 2、301:永久重定向，比如在访问www.jingdong.com的时候会重定向到www.jd.com
- 3、302:临时重定向。比如在访问一个需要登录的页面的时候，而此时没有登录，那么会重定向到登录页面
- 4、400:请求的url在服务器上找不到，换句话说就是请求url错误
- 5、403:权限不够，拒绝访问
- 6、500:服务器内部错误，可能是服务器出现bug
chrome抓包工具

2、urllib函数

urllib库:urllib库是python中一个基本的网络请求库，可以模拟浏览器的行为，向指定的服务器发送一个请求，并可以保存浏览器返回的数据
urlopen函数:在python3的urllib库中，所有和网络请求相关的方法，都被收集到urllib.request模块下

from urllib import request
resp = request.urlopen('http://www.baiduo.com')
print(resp.read()

1、url:请求的url
2、data:请求的data,如果设置了这个值，那么将变成post请求
3、返回值:返回值是一个http.client.HTTPResponse对象，这个对象是一个类文件句柄对象，有read(site), readline、readlines以及getcode等方法
-urlretrieve函数:这个函数可以方便的将网页的一个文件保存到本地

from urllib import request
request.urlretrieve('http://www.baidu.com/','baiduo.html')

urlencode函数：用浏览器发送请求的时候，如果url中包含了中文或者其他特殊字符，那么浏览器会自动的给我们进行编码，而如果使用代码发送请求，必须手动的进行编码，这时候应该使用urlencode函数来实现,urlencode可以把字典数据转换为url编码的数据

from urllib import parse
data = {'name':'爬虫基础', 'greet':'hello world'}
qs = parse.urlencode(data)
print(qs)


url = 'http://www.baidu.com/s'
params = {'wd':'刘德华'}
qs = parse.urlencode(params)
url = url + '?' + qs
resp = request.urlopen(url)
print(resp.read())

parse_qs函数:可以将经过编码后的url参数进行解码

from urllib import parse
qs = 'name=%E7%88%AC%E8%99%AB%E5%9F%BA%E7%A1%80&greet=hello+world'
print(parse.parse_qs(qs))

注意:空格编码后的url中用+代替
**urlparse和urlsplit:可以对url的各个部分进行分割

urlparse和urlsplit唯一不同的地方就是，urlparse多了一个params属性，params指的是？；之间的部分，例如url = 'http://www.baidu.com/s?hello;username=zhangsan ',params='hello'
from urllib import request.parse
url = 'http://www.baidu.com/s?username=zhangsan '
#result = parse.urlparse(url)
result = parse.urlsplit(url)
print('scheme:',result.scheme)
print('netloc:',result.netloc)
print('pash:',result.path)
print('query:',result.query)

request.Request类:如果想要在请求的时候增加请求头，必须使用request.Request来实现

from urllib import request
headers = {
    'User-Agent':''
}
req = request.Request('http://www.baidu.com', headers=headers)
resp = request.urlopen(req)
print(resp.read())

2、ProxyHandler处理器(代理设置)

代理的原理:
- 1、在请求目的网站之前，先请求代理服务器，然后让代理服务器去请求目的网站，代理服务器拿到目的网站的数据后，在转发给我们的代码
- 2、http://httpbin.org:这个网站可以方便的查看http请求的一些参数
- 3、在代码中使用代理：
  - 使用urllib.request.ProxyHandler,传入一个代理，这个代理是一个字典，字典的key依赖于代理服务器能够接收的类型，一般是http或者https,值是ip:port
  - 使用上一步创建的handler,以及request.build_opener创建一个opener对象
  - 使用上一步创建的opener,调用open函数，发送请求

from urllib import request
url = 'http://httpbin.org/ip'
#不使用代理
#resp = request.urlopen('http://httpbin.org/ip')
#print(resp.read().decode('utf-8'))

#传入代理，构建一个handler对象
handler = request.ProxyHandler({'http':'218.66.161.88:31869'})
opener = request.build_opener(handler)
req = request.Request(url)
resp = opener.open(req)
print(resp.read())

3、cookie

cookie:在网站中，http请求是无状态的，第一次登陆后服务器返回一些数据(cookie)给浏览器，然后浏览器保存本地，当该用户发送第二次请求的时候，会自动的把上次请求存储的cookie数据自动的携带给服务器，服务器通过浏览器携带的数据就能判断当前用户是哪个，cookie存储的数据是有有限的，不同的浏览器存储大小不同，一般不超过4kb，因此使用cookie只能存储小量数据
cookie格式：

Set-Cookie: NAME=VALUE; Expires/Max-age=DATE；Path=PATH; Domain=DOMAIN_NAME; SECURE

参数意义:
- NAME:cookie的名字
- VALUE:cookie的值
- Expires:cookie的过期时间
- Path:cookie作用的路径
- Domain:cookie作用的域名
- SECURE:是否只在https协议下起作用
http.cookiejar模块：
- 1、CookieJar:管理HTTP cookie值、存储HTTP请求生成的cookie、向传出的HTTP请求添加的cookie的对象，整个cookie都存储在内存中，对CookieJar实例进行垃圾回收后cookie也将丢失
- 2、FileCookieJar(filename, delayload=None,policy=None)
- 3、MozillaCookieJar(filename,delayload=None,policy=None)
- 4、LWPCookieJar(filename,delayload=None,policy=None)
使用cookielib库和HTTPCookieProcessor模拟登录:

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'

    }
def  get_opener():
    # 1、创建一个cookiejar对象
    cookiejar = CookieJar()
    # 1.2使用cookiejar创建一个HTTPCookieProcess对象
    handler = request.HTTPCookieProcessor(cookiejar)
    # 1.3 使用上一步创建的handler创建一个opener
    opener = request.build_opener(handler)
    # 1.4使用opener发送登录的请求
    return opener
def login_renren(opener):

    data = {
        'email':'970138074@qq.com',
        'password':'pythonspider'
    }
    login_url = 'http://www.renren.com/PLogin.do'
    req = request.Request(login_url, data=parse.urlencode(data).encode('utf-8'), headers=headers)
    opener.open(req)
def visit_profile(opener):
    # 2.访问个人主页
    dapeng_url = 'http://www.renren.com/880151247/profile'
    req = request.Request(dapeng_url,headers=headers)
    resp = opener.open(req)
    with open('renren.hv tml', 'w', encoding='utf-8') as f:
        f.write(resp.read().decode('utf-8'))
if __name__ == '__main__':
    opener = get_opener()
    login_renren(opener)
    visit_profile(opener)

保存cookie到本地
本地加载cookie

from urllib import request
from http.cookiejar import MozillaCookieJar
cookiejar = MozillaCookieJar('cookie.txt')
cookiejar.load('cookie.txt')
#cookiejar.load（）中也可设置ignore_discard
handler = request.HTTPCookieProcessor(cookiejar)
opener = request.build_opener(handler)
resp = opener.open("http://www.baidu.com/")
# resp = opener.open('http://httpbin.org/cookies/set?course=abc')
# cookiejar.save()
#ignore_discard表示保存即将过期cookie
# cookiejar.save(ignore_discard=True)
for cookie in cookiejar:
    print(cookie)

4、requests库

发送GET请求：

1、发送get请求

response = request.get('http://www.baidu.com/')

2、添加headers和查询参数：
如果想添加headers,可以传入headers参数增加请求头中的headers信息，如果要将参数放在url中传递，可以利用params参数

import requests
kw = {'kw': '张三'}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
          'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'
}
response = requests.get('http://www.baidu.com/s', params=kw, headers=headers)
#查看响应内容 返回的是Unicode格式的数据
# with open('baidu.html', 'w', encoding='utf-8') as f:
#     f.write(response.content.decode('utf-8'))
# print(response.text)
#查看响应内容，返回的是字节流数据类型bytes
print(response.content)
#返回完整的url地址
print(response.url)
#查看响应头部字符编码
print(response.encoding)
#查看响应码
print(response.status_code)

response.text和response.content的区别
- 1、response.content:直接从网络上面抓取数据，没有经过任何编码，所以是bytes类型（硬盘上和在网络传输的字符串都是bytes类型）
- 2、response.text:这个是requests库将response.content进行编码的字符串，解码需要制定一个编码方式，会根据自己的猜测来判断编码的方式，所以有时候会猜测错误，就会导致乱码，这时候应该使用response.content.decode(‘utf-8’)进行手动解码
发送post请求:
- 1、最基本的POST请求使用post方法:
```
response = request.post('http://www.baidu.com/',data=data)
```
- 2、传入data数据:不需要使用urlencode进行编码，直接传入一个字典进去就可以
- 3、如果返回的是json数据，可以调用response.json()将字符串转换为字典
使用代理：只要在请求的方法中传递proxies参数就可以

import requests
url = 'http://httpbin.org/ip'
proxy = {
    'http':'host:port'#代理ip：端口
}
resp = request.get(url, proxies=proxy)
with open('xx.html','w', encoding='utf-8') as f:
    f.write(resp.text)

cookie:直接利用cookies属性取得cookie值

import requests
resp = requests.get('http://www.baidu.com/')
print(resp.cookies)
print(resp.cookies.get_dict())

session:如果想要在多次请求中共享cookie，如下

import requests
url = 'http://www.renren.com/PLogin.do'
data = {'email':'xxxx', 'password':'xxxx'}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/'
              '537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'
}
session = requests.session()
session.post(url, data=data,headers=headers)
response = session.get('http://www.renren.com/88015247/profile')
with open('rr.html', 'w', encoding='utf-8') as f:
    f.write(response.text)

处理不信任的SSL证书：

resp = response.get(url, verify=False)
print(resp.content.decode('utf-8')

弗拉基米王

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫