爬虫基本知识

最新推荐文章于 2024-06-19 17:27:45 发布

代码小风

最新推荐文章于 2024-06-19 17:27:45 发布

阅读量1k

点赞数

文章标签： python

本文链接：https://blog.csdn.net/weixin_44722998/article/details/109880091

版权

爬虫准备工作

前提知识
- url
- http协议
- web前端，html, css, js
- ajax
- re, xpath
- xml
  测试网站 https://inv-veri.chinatax.gov.cn/

爬虫简介

爬虫定义：网络爬虫（又被称为网页蜘蛛，网络机器人，在FOAF社区中间，更经常的称为网页追逐者），是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁、自动索引、模拟程序或者蠕虫。
两大特征
- 能按作者要求下载数据或者内容
- 能自动在网络上流窜
三大步骤：
- 下载网页
- 提取正确的信息
- 根据一定规则自动跳到另外的网页上执行上两步内容
爬虫分类
- 通用爬虫
- 专用爬虫（聚焦爬虫）
Python网络包简介
- Python3： urllib，requests

常用的方法

urllib.request.urlopen()

1：不能修改请求头，一般网站可以，容易被发现
• urllib.request.urlopen(“网址”) 作用：向网站发起一个请求并获取响应
• 字节流 = response.read()
• 字符串 = response.read().decode(“utf-8”)

urlretrieve用法
urlretrieve（url,path）直接传入图片下载地址，和储存路径，自动下载，省时省力

urllib.request.Request()

可以修改请求头，防反扒
• urllib.request.Request(“网址”,headers=“字典”)
urlopen()不支持重构User-Agent响应对象
• read() 读取服务器响应的内容
• getcode() 返回HTTP的响应码
• geturl() 返回实际数据的URL(防止重定向问题)

urllib.parse模块

常用方法

urlencode

• urlencode(字典) ，不能对string编码，只能对dict类型编码

url = 'http://www.baidu.com/s?'
wd = input("Input your keyword:")

# 要想使用data， 需要使用字典结构
qs = {"wd": wd}

# 转换url编码
qs = parse.urlencode(qs)
print(qs)

fullurl = url + qs
print(fullurl)

注意：

data = urllib.parse.urlencode({"kw":"猫"})
print(type(data))#得到的data是字符串类型，<class 'str'>，和地址拼接后可以使用get请求

data = urllib.parse.urlencode({"kw":"猫"}).encode("utf-8")
print(type(data))##得到的data是<class 'bytes'>，结果可以使用post请求，因为post请求只接受字节类型的数据，不接受字符串

quote()

• quote(字符串) (这个里面的参数是个字符串)

""""url中出现中文可能会乱码，所以中文路径需要转化，
就用到了quote方法，反之用：unquote() url中文解码
"""
from urllib.parse import quote
keyword = "熊猫"
url = 'https://www.baidu.com/s?wd=' + quote(keyword)
print(url)

urljoin（）

from urllib.parse import urljoin

# 对网址进行拼接
print(urljoin('http://www.baidu.com','哈啊哈哈.html'))
print(urljoin('www.baidu.com','?category=2#comment'))
print(urljoin('www.baidu.com','http://qq.com'))#http://qq.com

请求方式：get(),post()

• GET 特点：查询参数在URL地址中显示
• POST
• 在Request方法中添加data参数 urllib.request.Request(url,data=data,headers=headers)
• data ：表单数据以bytes类型提交,不能是str
get请求和post请求最大区别就是有没有data表格数据，有就是post,没有就是get请求，其他无区别

chardet()

网页编码问题解决： chardet

	url = 'https://www.baidu.com/'

    rsp = urllib.request.urlopen(url)

    html = rsp.read()

    #利用 chardet自动检测
    cs = chardet.detect(html)
    print(type(cs))
    print(cs)#{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}


    # 使用get取值保证不会出错
    html = html.decode(cs.get("encoding", "utf-8"))
    # print(html)

detect函数返回一个字典，字典里有两个key-value对。其中一个的key值为encoding，代表chardet推断出来的编码格式。另一个key值为confidence，代表可信度。可信度是一个0到1之间float值，0代表不可信，1代表百分之百可信

chardet高级用法（可以参考）

urlopen 的返回对象

geturl: 返回请求对象的url
info: 请求反馈对象的meta信息
getcode：返回的http code

request.data 的使用

访问网络的两种方法

get:

利用参数给服务器传递信息，
参数为dict，然后用parse编码

post

一般向服务器传递参数使用
post是把信息自动加密处理
我们如果想使用psot信息，需要用到data参数
使用post，意味着Http的请求头可能需要更改：
- Content-Type: application/x-www.form-urlencode
- Content-Length: 数据长度
- 简而言之，一旦更改请求方法，请注意其他请求头部信息相适应
urllib.parse.urlencode可以将字符串自动转换成上面的
- 为了更多的设置请求信息，单纯的通过urlopen函数已经不太好用了, 需要利用request.Request 类

urllib.error

URLError产生的原因：
1 没有网络连接
服务器连接失败
找不到指定的服务器
我们可以用try except语句来捕获相应的异常

urllib.error

url = "http://www.du.com"
try:
	req = request.Request(url)
	rsp = request.urlopen(req)
	html = rsp.read().decode()
	print(html)

except error.URLError as e:
	print("URLError: {0}".format(e.reason))
	print("URLError: {0}".format(e))

except Exception as e:
	print(e)

HTTPError

import urllib.request

requset = urllib.request.Request('http://blog.baidu.com/itcast')

try:
    urllib.request.urlopen(requset)
except urllib.request.HTTPError as err:
    print(err.code)

HTTPError, 是URLError的一个子类
两者区别：
- HTTPError是对应的HTTP请求的返回码错误, 如果返回错误码是400以上的，则引发HTTPError
- URLError对应的一般是网络出现问题，包括url问题
- 关系区别： OSError-URLError-HTTPError

详情请点击

UserAgent

UserAgent：用户代理，简称UA，属于heads的一部分，服务器通过UA来判断访问者身份
设置UA可以通过两种方式：
heads
add_header

fake_useragent()

使用方式

from fake_useragent import UserAgent
#ua = UserAgent().chrome
headers= {'User-Agent':str(UserAgent().random)}
r = requests.get(url, proxies=proxies, headers=headers, timeout=10)


from fake_useragent import UserAgent

#随机添加ua
ua = UserAgent().chrome

注意

我在使用fake_useragent中遇到如下的报错，fake_useragent.errors.FakeUserAgentError: Maximum amount of retries reached
由于fake_useragent中存储的UserAgent列表发生了变动，而我本地UserAgent的列表未更新所导致的，在更新fake_useragent后报错就消失了。

更新fake_useragent，在命令行中输入pip install -U fake-useragent即可完成更新，Python的其他包也可以用这种方法完成更新pip install -U 包名

ProxyHandler处理（代理服务器）

使用代理IP，是爬虫的常用手段
获取代理服务器的地址：

西刺免费代理IP
快代理
 代理云

代理用来隐藏真实访问中，代理也不允许频繁访问某一个固定网站，所以，代理一定要很多很多
基本使用步骤：
1. 设置代理地址
2. 创建ProxyHandler
3. 创建Opener
4. 安装Opener

url = "http://www.baidu.com"

# 使用代理步骤
# 1. 设置代理地址
proxy = {'http': '120.194.18.90:81' }
# 2. 创建ProxyHandler
proxy_handler = request.ProxyHandler(proxy)
# 3. 创建Opener
opener = request.build_opener(proxy_handler)
# 4. 安装Opener
request.install_opener( opener)

request = urllib.request.Request("http://www.baidu.com")
response = opener.open(request)

详细用法请参照大佬代码

cookie & session

由于http协议的无记忆性，人们为了弥补这个缺憾，所采用的一个补充协议
cookie是发放给用户（即http浏览器）的一段信息，session是保存在服务器上的对应的另一半信息，用来记录用户信息
cookie和session的区别
- 存放位置不同
- cookie不安全
- session会保存在服务器上一定时间，会过期
- 单个cookie保存数据不超过4k，很多浏览器限制一个站点最多保存20个
session的存放位置
- 存在服务器端
- 一般情况，session是放在内存中或者数据库中
- 没有cookie登录则反馈网页为未登录状态
使用cookie登录
- 直接把cookie复制下来，然后手动放入请求头
- http模块包含一些关于cookie的模块，通过他们我们可以自动使用cookie
  - CookieJar
    - 管理存储cookie，向传出的http请求添加cookie，
    - cookie存储在内存中，CookieJar实例回收后cookie将消失
  - FileCookieJar(filename, delayload=None, policy=None):
    - 使用文件管理cookie
    - filename是保存cookie的文件
  - MozillaCookieJar(filename, delayload=None, policy=None):
    - 创建与mocilla浏览器cookie.txt兼容的FileCookieJar实例
  - LwpCookieJar(filename, delayload=None, policy=None):
    - 创建与libwww-perl标准兼容的Set-Cookie3格式的FileCookieJar实例
  - 他们的关系是: CookieJar–>FileCookieJar–>MozillaCookieJar & LwpCookieJar
- 利用cookiejar访问人人
  - 自动使用cookie登录，大致流程是
  - 打开登录页面后自动通过用户名密码登录
  - 自动提取反馈回来的cookie
  - 利用提取的cookie登录隐私页面
- handler是Handler的实例，
  - 用来处理复杂请求

from urllib import request, parse
from http import cookiejar

#  创建cookiejar的实例
cookie = cookiejar.CookieJar()

# 生成 cookie的管理器
cookie_handler = request.HTTPCookieProcessor(cookie)
# 创建http请求管理器
http_handler = request.HTTPHandler()

# 生成https管理器
https_handler = request.HTTPSHandler()

# 创建请求管理器
opener = request.build_opener(http_handler, https_handler, cookie_handler)


rsp = opener.open(url)

创立handler后，使用opener打开，打开后相应的业务由相应的hanlder处理
cookie作为一个变量，打印出来,

for item in cookie:
    print(type(item))
    print(item)
    for i in dir(item):
        print(i)

cookie的属性
- name: 名称
- value：值
- domain：可以访问此cookie的域名
- path：可以发昂文此cookie的页面路径
- expires：过期时间
- size：大小
- Http字段
cookie的保存-FileCookieJar

# 保存cookie到文件
    # ignor_discard表示及时cookie将要被丢弃也要保存下来
    # ignore_expire表示如果该文件中cookie即使已经过期，保存
    cookie.save(ignore_discard=True, ignore_expires=True)

cookie的读取

from urllib import request, parse
from http import cookiejar

#  创建cookiejar的实例
cookie = cookiejar.MozillaCookieJar()
cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)

# 生成 cookie的管理器
cookie_handler = request.HTTPCookieProcessor(cookie)
# 创建http请求管理器
http_handler = request.HTTPHandler()

# 生成https管理器
https_handler = request.HTTPSHandler()

# 创建请求管理器
opener = request.build_opener(http_handler, https_handler, cookie_handler)

SSL

SSL证书就是指遵守SSL安全套阶层协议的服务器数字证书（SercureSocketLayer)
美国网景公司开发
CA（CertifacateAuthority)是数字证书认证中心，是发放，管理，废除数字证书的收信人的第三方机构
遇到不信任的SSL证书，需要单独处理

from urllib import request

# 导入pythopn ssl处理模块
import ssl

# 利用非认证上下文环境替换认证的向下文环境
ssl._create_default_https_context = ssl._create_unverified_context

url = "https://www.12306.cn/mormhweb/"
rsp = request.urlopen(url)

html = rsp.read().decode()

print(html)

js加密

https://www.cnblogs.com/mosquito18/p/9759186.html

https://jingyan.baidu.com/article/2f9b480de15eb241cb6cc2d9.html

js实现前端《AES/DES》加密，python进行对应的后端解密

高级用法案例

有的反爬虫策略采用js对需要传输的数据进行加密处理（通常是取md5值)
经过加密，传输的就是密文，但是加密函数或者过程一定是在浏览器完成，也就是一定会把代码（js代码）暴露给使用者通过阅读加密算法，就可以模拟出加密过程，从而达到破解

ajax

什么是ajax，简单来说，就是加载一个网页完毕之后，有些信息你你还是看不到，需要你点击某个按钮才能看到数据，而不是一次性的全加载进来，或者有些网页是有很多页数据的，而你在点击下一页的时候，网页的url地址没有变化，但是内容变了，这些都可以说是ajax(如查看更多字样，或者自己返回的请求内容和在浏览器看到的HTML不一样，说明采取了动态加载）
- 异步请求
- 一定会有url，请求方法，可能有数据
- 一般使用json格式
解决方案一：直接从JAvascript中采集加载的数据
解决方案二：使用selenium模拟抓取数据

Requests-献给人类

HTTP for Humans，更简洁更友好
安装
• pip install requests
• 在开发工具中安装
官方文档
 开源地址

request常用方法

• requests.get（网址）

响应对象response的方法

• response.text

返回unicode格式的数据(str)

• response.content

返回字节流数据(二进制)

response.text 和response.content的区别
        response.text 
        类型：str 
        解码类型：根据HTTP头部对响应的编码做出有根据的推测，推测的文本编码 
        如何修改编码方式：response.encoding=”gbk”
        
        response.content 
        类型：bytes 
        解码类型：没有指定 
        如何修改编码方式：response.content.decode(“utf8”

• response.content.decode(‘utf-8’) 手动进行解码

• response.url 返回url

• response.encode() = ‘编码’

get()请求

r = requests.get('https://api.github.com/events')

发送 POST请求

r = requests.post('http://httpbin.org/post', data = {'key':'value'})

传递 URL 参数

payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.get("http://httpbin.org/get", params=payload)
print(r.url)

JSON 响应内容

r = requests.get('https://api.github.com/events')
r.json()

定制请求头

UA

headers = {'user-agent': 'my-app/0.0.1'}
r = requests.get(url, headers=headers)

Cookie

• 使用requests添加代理只需要在请求方法中(get/post)传递proxies参数就可以了

r = requests.get(url, cookies=cookies)
r.text

requests.session()


# 实例化session。
# session()中方法和requests()中一样
# session.get()  session.post()
session = requests.session()
 
# 使用session发送post请求获取cookie保存到本地session中。
# 以人人网登录为例。
post_url = "http://www.renren.com/PLogin.do"
headers = {"User-Agent": "Mozilla/5.0"}
session = requests.session()
post_data = {"email": "username", "password": "password"}
session.post(post_url, headers=headers, data=post_data)
 
# 使用session请求登录后的页面
# 得到登录后的网页内容