第十二节段 -- 爬虫01

最新推荐文章于 2022-11-19 22:04:12 发布

亚呦u椰

最新推荐文章于 2022-11-19 22:04:12 发布

阅读量555

点赞数

分类专栏： python学习爬虫

本文链接：https://blog.csdn.net/weixin_42375099/article/details/97028649

版权

python学习同时被 2 个专栏收录

72 篇文章 5 订阅

订阅专栏

爬虫

11 篇文章 0 订阅

订阅专栏

文章目录

1. 常用工具
- 1.1 fiddler 界面含义
2. 爬取数据 -- url库

1. 常用工具

1.1 fiddler 界面含义

请求 (Request) 部分详解

名称	含义
Headers	显示客户端发送到服务器的 HTTP 请求的,header 显示为一个分级视图，包含了 Web 客户端信息、Cookie、传输状态等
Textview	显示 POST 请求的 body 部分为文本
WebForms	显示请求的 GET 参数和 POST body 内容
HexView	用十六进制数据显示请求
Auth	显示响应 header 中的 Proxy-Authorization(代理身份验证) 和 Authorization(授权) 信息
Raw	将整个请求显示为纯文本
JSON	显示JSON格式文件
XML	如果请求的 body 是 XML格式，就是用分级的 XML 树来显示它

响应 (Response) 部分详解

名称	含义
Transformer	显示响应的编码信息
Headers	用分级视图显示响应的 header
TextView	使用文本显示相应的 body
ImageVies	如果请求是图片资源，显示响应的图片
HexView	用十六进制数据显示响应
WebView	响应在 Web 浏览器中的预览效果
Auth	显示响应 header 中的 Proxy-Authorization(代理身份验证) 和 Authorization(授权) 信息
Caching	显示此请求的缓存信息
Privacy	显示此请求的私密 (P3P) 信息
Raw	将整个响应显示为纯文本
JSON	显示JSON格式文件
XML	如果响应的 body 是 XML 格式，就是用分级的 XML 树来显示它

2. 爬取数据 – url库

1. 小试牛刀

怎样扒网页呢？

其实就是根据URL来获取它的网页信息，虽然我们在浏览器中看到的是一幅幅优美的画面，但是其实是由浏览器解释才呈现出来的，实质它是一段HTML代码，加 JS、CSS，如果把网页比作一个人，那么HTML便是他的骨架，JS便是他的肌肉，CSS便是它的衣服。所以最重要的部分是存在于HTML中的，下面我们就写个例子来扒一个网页下来。

from urllib.request import urlopen
 
response = urlopen("http://www.baidu.com")
print(response.read().decode())

2. 常见到的方法

requset.urlopen(url,data,timeout)
- 第一个参数url即为URL，第二个参数data是访问URL时要传送的数据，第三个timeout是设置超时时间。
- 第二三个参数是可以不传送的，data默认为空None，timeout默认为 socket._GLOBAL_DEFAULT_TIMEOUT
- 第一个参数URL是必须要传送的，在这个例子里面我们传送了百度的URL，执行urlopen方法之后，返回一个response对象，返回信息便保存在这里面。
response.read()
- read()方法就是读取文件里的全部内容，返回bytes类型
response.getcode()
- 返回 HTTP的响应码，成功返回200，4服务器页面出错，5服务器问题
response.geturl()
- 返回返回实际数据的实际URL，防止重定向问题
response.info()
- 返回服务器响应的HTTP报头

from urllib.request import urlopen

# 发送的地址
url = 'http://www.baidu.com/'
# 发送请求
resp = urlopen(url)
# 读取数据
info = resp.read().decode('utf-8')
print(info)

# 获取响应编码
print(resp.getcode())
# # 获取响应的URL
print(resp.geturl())
# # 获取响应头
print(resp.info())

3. Request对象

其实上面的urlopen参数可以传入一个request请求,它其实就是一个Request类的实例，构造时需要传入Url,Data等等的内容。比如上面的两行代码，我们可以这么改写

from urllib.request import urlopen
from urllib.request import Request
from random import choice

# url = 'http://www.baidu.com/'
url = 'https://httpbin.org/get'

# 构建UA列表
user_agents = [
    'Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50',
    'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
]

# 构建一个headers
headers = {
    'User-Agent': choice(user_agents),
}
# 构建request对象
req = Request(url, headers=headers)
# 发送请求
resp = urlopen(req)
# 读取响应
info = resp.read().decode()
# 打印
print(info)

运行结果是完全一样的，只不过中间多了一个request对象，推荐大家这么写，因为在构建请求时还需要加入好多内容，通过构建一个request，服务器响应请求得到应答，这样显得逻辑上清晰明确

4. Get 请求

大部分被传输到浏览器的html，images，js，css, … 都是通过GET方法发出请求的。它是获取数据的主要方法

例如：www.baidu.com 搜索

Get请求的参数都是在Url中体现的,如果有中文，需要转码，这时我们可使用

urllib.parse.urlencode()
urllib.parse. quote()

方式01：

from urllib.request import urlopen, Request
from urllib.parse import quote

args = '拉钩'
url = 'https://www.baidu.com/s?wd={}'.format(quote(args))
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}
req = Request(url, headers=headers)
resp = urlopen(req)
info = resp.read().decode()
print(info)

方式02：

from urllib.request import urlopen, Request
from urllib.parse import urlencode

args = {
    'wd': '黑马程序员',
    'ie': 'utf-8'
}
print(urlencode(args))
# 打印结果：wd=%E9%BB%91%E9%A9%AC%E7%A8%8B%E5%BA%8F%E5%91%98&ie=utf-8

url = 'https://www.baidu.com/s?{}'.format(urlencode(args))
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}
req = Request(url, headers=headers)
resp = urlopen(req)
info = resp.read().decode()
print(info)
'''
https://www.baidu.com/s?wd=python&pn=10
https://www.baidu.com/s?wd=python&pn=20
'''

【练习】：爬取百度贴任意吧吧的前几页数据

'''
https://tieba.baidu.com/f?kw=%E7%8E%8B%E8%80%85%E8%8D%A3%E8%80%80
https://tieba.baidu.com/f?kw=%E7%8E%8B%E8%80%85%E8%8D%A3%E8%80%80&ie=utf-8&pn=50
https://tieba.baidu.com/f?kw=%E7%8E%8B%E8%80%85%E8%8D%A3%E8%80%80&ie=utf-8&pn=100
https://tieba.baidu.com/f?kw=%E7%8E%8B%E8%80%85%E8%8D%A3%E8%80%80&ie=utf-8&pn=150

1 0
2 50
3 100
4 150
'''

from urllib.request import Request, urlopen
from urllib.parse import urlencode

def get_html(url, filename):
    print('正在下载：{}'.format(filename))
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}
    req = Request(url, headers=headers)
    resp = urlopen(req)
    html = resp.read().
    return html

def save_html(html, filename):
    print('正在保存：{}'.format(filename))
    with open(filename, 'wb') as f:
        f.write(html)

if __name__ == '__main__':
    info = input('请输入要爬取的贴吧：')
    page = input('请输入要爬取的页数：')

    base_url = 'https://tieba.baidu.com/f?{}'
    for i in range(int(page)):
        args = {
            'kw': info,
            'pn': i * 50
        }
        filename = '{}_第{}页.html'.format(info,i + 1)
        params = urlencode(args)
        new_url = base_url.format(params)
        html = get_html(new_url, filename)
        save_html(html, filename)

5. Post 请求

我们说了Request请求对象的里有data参数，它就是用在POST里的，我们要传送的数据就是这个参数data，data是一个字典，里面要匹配键值对
发送请求/响应header头的含义：

名称	含义
Accept	告诉服务器，客户端支持的数据类型
Accept-Charset	告诉服务器，客户端采用的编码
Accept-Encoding	告诉服务器，客户机支持的数据压缩格式
Accept-Language	告诉服务器，客户机的语言环境
Host	客户机通过这个头告诉服务器，想访问的主机名
If-Modified-Since	客户机通过这个头告诉服务器，资源的缓存时间
Referer	客户机通过这个头告诉服务器，它是从哪个资源来访问服务器的。（一般用于防盗链）
User-Agent	客户机通过这个头告诉服务器，客户机的软件环境
Cookie	客户机通过这个头告诉服务器，可以向服务器带数据
Refresh	服务器通过这个头，告诉浏览器隔多长时间刷新一次
Content-Type	服务器通过这个头，回送数据的类型
Content-Language	服务器通过这个头，告诉服务器的语言环境
Server	服务器通过这个头，告诉浏览器服务器的类型
Content-Encoding	服务器通过这个头，告诉浏览器数据采用的压缩格式
Content-Length	服务器通过这个头，告诉浏览器回送数据的长度

代码示例：

from urllib.request import Request, urlopen
from urllib.parse import urlencode

url = 'http://www.sxt.cn/index/login/login.html'
args = {
    'user': '17703181473',
    'password': '123456'
}
f_data = urlencode(args).encode()
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
}

req = Request(url, f_data, headers)
resp = urlopen(req)
info = resp.read().decode()
print(info)

6. 响应的编码

响应状态码：响应状态代码有三位数字组成，第一个数字定义了响应的类别，且有五种可能取值。
常见状态码：

号码	含义
100~199	表示服务器成功接收部分请求，要求客户端继续提交其余请求才能完成整个处理过程
200~299	表示服务器成功接收请求并已完成整个处理过程。常用200（OK 请求成功）
300~399	为完成请求，客户需进一步细化请求。例如：请求的资源已经移动一个新地址、常用302（所请求的页面已经临时转移至新的url）、307和304（使用缓存资源）
400~499	客户端的请求有错误，常用404（服务器无法找到被请求的页面）、403（服务器拒绝访问，权限不够）
500~599	服务器端出现错误，常用500（请求未完成。服务器遇到不可预知的情况）

7. Ajax的请求获取数据

有些网页内容使用AJAX加载，而AJAX一般返回的是JSON,直接对AJAX地址进行post或get，就返回JSON数据了。
AJAX 请求在 XHR 中可以查看。
两种判断是否是AJAX请求的方式：
1. URL地址不变，页面数据变了。
2. 查看源代码，里面没有我们想要的数据。

from urllib.request import Request, urlopen
from time import sleep
for i in range(14, 50):
    url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&start={}&limit=20'.format(
        i * 20)

    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
    }

    req = Request(url, headers=headers)
    resp = urlopen(req)
    info = resp.read().decode()
    if info == '[]':
        break
    print(info)
    sleep(3)

8. 请求 SSL证书验证

现在随处可见 https 开头的网站，urllib可以为 HTTPS 请求验证SSL证书，就像web浏览器一样，如果网站的SSL证书是经过CA认证的，则能够正常访问，如：https://www.baidu.com/
如果SSL证书验证不通过，或者操作系统不信任服务器的安全证书，比如浏览器在访问12306网站如：https://www.12306.cn/mormhweb/的时候，会警告用户证书不受信任。（据说 12306 网站证书是自己做的，没有通过CA认证）

from urllib.request import Request, urlopen
import ssl

url = 'https://sh.58.com/chuzu/pn'

context = ssl._create_unverified_context()
req = Request(url)
resp = urlopen(req, context=context)
info = resp.read().decode()

print(info)

9. 伪装自己

有些网站不会同意程序直接用上面的方式进行访问，如果识别有问题，那么站点根本不会响应，所以为了完全模拟浏览器的工作

9.1 设置请求头

其中 UserAgent 代表用的哪个请求的浏览器
代码如下：

from fake_useragent import UserAgent

ua = UserAgent()

print(ua.chrome)
print(ua.chrome)
print(ua.ie)
print(ua.firefox)

对付防盗链，服务器会识别headers中的referer是不是它自己，如果不是，有的服务器不会响应，所以我们还可以在headers中加入referer
代码如下：

headers = { 
         'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
         'Referer':'http://www.zhihu.com/articles' 
          }

9.2 设置代理Proxy

假如一个网站它会检测某一段时间某个IP 的访问次数，如果访问次数过多，它会禁止你的访问。所以你可以设置一些代理服务器来帮助你做工作，每隔一段时间换一个代理，网站君都不知道是谁在捣鬼了，这酸爽！
分类：
1. 透明代理：目标网站知道你使用了代理并且知道你的源IP地址，这种代理显然不符合我们这里使用代理的初衷
2. 匿名代理：匿名程度比较低，也就是网站知道你使用了代理，但是并不知道你的源IP地址
3. 匿代理：这是最保险的方式，目标网站既不知道你使用的代理更不知道你的源IP
代码如下：

from urllib.request import Request
from fake_useragent import UserAgent
from urllib.request import build_opener
from urllib.request import ProxyHandler

url = 'http://httpbin.org/get'
headers = {'User-Agent': UserAgent().chrome}
req = Request(url, headers=headers)

# 构造控制器
#handler = ProxyHandler({'type': 'ip:port'})
handler = ProxyHandler({'http': '39.137.77.66:8080'})
# handler = ProxyHandler({'type': 'name:password@ip:port'})
# handler = ProxyHandler({'http': '398707160:j8inhg2g@62.234.48.180:16817'})
opener = build_opener(handler)
resp = opener.open(req)
info = resp.read().decode()
print(info)

10. Cookie

为什么要使用Cookie呢？

Cookie，指某些网站为了辨别用户身份、进行session跟踪而储存在用户本地终端上的数据（通常经过加密）

比如说有些网站需要登录后才能访问某个页面，在登录之前，你想抓取某个页面内容是不允许的。那么我们可以利用Urllib库保存我们登录的Cookie，然后再抓取其他页面就达到目的了。
这种写法cookie不能自动保存

from urllib.request import Request, urlopen
from fake_useragent import UserAgent

url = 'http://www.sxt.cn/index/user.html'
# 这种写法，cookie不能自动保存
headers = {
    'User-Agent': UserAgent().chrome,
    'Cookie': 'UM_distinctid=1684b8a5d3cf1-0e6a16f6bdf837-36664c08-15f900-1684b8a5d3d42; 53gid2=10790522629008; 53revisit=1547455593151; zg_did=%7B%22did%22%3A%20%2216a28d7cd8a2f0-058b4b94de1c1a-36664c08-100200-16a28d7cd8c363%22%7D; 53gid0=10790522629008; 53gid1=10790522629008; CNZZDATA1261969808=1042060724-1547454548-%7C1555483089; PHPSESSID=44sb6bp4uvunuv7t087np8o5o6; visitor_type=old; 53kf_72085067_from_host=www.sxt.cn; 53kf_72085067_land_page=http%253A%252F%252Fwww.sxt.cn%252F; kf_72085067_land_page_ok=1; 53kf_72085067_keyword=http%3A%2F%2Fwww.sxt.cn%2Findex%2Flogin%2Flogin.html; zg_c1e08f0fa5e3446d854919ffa754d07f=%7B%22sid%22%3A%201555487103205%2C%22updated%22%3A%201555487125017%2C%22info%22%3A%201555463392661%2C%22superProperty%22%3A%20%22%7B%5C%22%E5%BA%94%E7%94%A8%E5%90%8D%E7%A7%B0%5C%22%3A%20%5C%22%E8%AF%B8%E8%91%9Bio%5C%22%7D%22%2C%22platform%22%3A%20%22%7B%7D%22%2C%22utm%22%3A%20%22%7B%7D%22%2C%22referrerDomain%22%3A%20%22%22%2C%22landHref%22%3A%20%22http%3A%2F%2Fwww.sxt.cn%2F%22%2C%22zs%22%3A%200%2C%22sc%22%3A%200%7D'
}
req = Request(url, headers=headers)

resp = urlopen(req)
info = resp.read().decode()
print(info)

10.1 Opener

当你获取一个URL你使用一个 opener（一个 urllib.OpenerDirector 的实例）。在前面，我们都是使用的默认的opener，也就是urlopen。它是一个特殊的 opener，可以理解成opener的一个特殊实例，传入的参数仅仅是 url，data，timeout。
如果我们需要用到Cookie，只用这个opener是不能达到目的的，所以我们需要创建更一般的opener来实现对Cookie的设置

from urllib.request import Request, urlopen
from fake_useragent import UserAgent
from urllib.parse import urlencode
from urllib.request import build_opener
from urllib.request import HTTPCookieProcessor

# 登录
login_url = 'http://www.sxt.cn/index/login/login'

# 个人信息
info_url = 'http://www.sxt.cn/index/user.html'

headers = {'User-Agent': UserAgent().chrome}
# 执行登录操作
params = {
    'user': '17703181473',
    'password': '123456'
}
f_data = urlencode(params).encode()
login_req = Request(login_url, data=f_data, headers=headers)
# 构造可以保存cookie的控制器
handler = HTTPCookieProcessor()
# 构造一个可以保存cookie的opener
opener = build_opener(handler)
login_resp = opener.open(login_req)
# login_resp = urlopen(login_req)
info = login_resp.read().decode()
# print(info)

10.2 Cookielib

cookielib 模块的主要作用是提供可存储 cookie 的对象，以便于与 urllib 模块配合使用来访问Internet资源。Cookielib 模块非常强大，我们可以利用本模块的CookieJar 类的对象来捕获cookie并在后续连接请求时重新发送，比如可以实现模拟登录功能。该模块主要的对象有 CookieJar、FileCookieJar、MozillaCookieJar、LWPCookieJar。

from urllib.request import Request, build_opener, HTTPCookieProcessor
from fake_useragent import UserAgent
from urllib.parse import urlencode
from http.cookiejar import MozillaCookieJar

def get_cookie():
    pass
    # 登录
    url = 'http://www.sxt.cn/index/login/login'
    headers = {'User-Agent': UserAgent().chrome}
    data = {
        'user': '17703181473',
        'password': '123456'
    }
    f_data = urlencode(data).encode()
    req = Request(url, data=f_data, headers=headers)
    cookie_jar = MozillaCookieJar()
    handler = HTTPCookieProcessor(cookie_jar)
    opener = build_opener(handler)
    resp = opener.open(req)
    info = resp.read().decode()
    print(info)
    # 保存cookie 文件
    cookie_jar.save('cookie.txt',ignore_discard=True, ignore_expires=True)

def use_cookie():
    # 使用cookie 文件
    url = 'http://www.sxt.cn/index/user.html'
    headers = {'User-Agent': UserAgent().chrome}
    req = Request(url, headers=headers)
    cookie_jar = MozillaCookieJar()
    cookie_jar.load('cookie.txt', ignore_expires=True, ignore_discard=True)
    handler = HTTPCookieProcessor(cookie_jar)
    opener = build_opener(handler)
    resp = opener.open(req)
    info = resp.read().decode()
    print(info)

if __name__ == '__main__':
    get_cookie()
    # use_cookie()

亚呦u椰

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
第十二节段 -- 爬虫01

文章目录1. 常用工具1.1 fiddler 界面含义2. 爬取数据 -- url库1. 小试牛刀2. 常见到的方法3. Request对象4. Get 请求5. Post 请求6. 响应的编码7. Ajax的请求获取数据8. 请求 SSL证书验证9. 伪装自己9.1 设置请求头9.2 设置代理Proxy10. Cookie10.1 Opener10.2 Cookielib1. 常用工具1.1...
复制链接

扫一扫

专栏目录