#9：Python爬虫的进阶之路---爬虫概述

最新推荐文章于 2024-08-07 08:19:34 发布

lrzbupt

最新推荐文章于 2024-08-07 08:19:34 发布

阅读量267

点赞数

本文链接：https://blog.csdn.net/lrzbupt/article/details/105238458

版权

爬虫基本流程

选取部分种子URL。
放入待抓取URL队列
从队列中读取URL，解析dns并得到主机IP，将URL对应的网页下载下来存入网页库，将URL放入已抓取队列。
分析已抓取URL队列中的URL，从已经下载的网页中分析出其他URL，并和已抓取的URL进行比较去重后放入待抓取队列，循环。

Python中的爬虫

Python3.X将urllib和urllib2整合在一起
Python2.x 有这些库名可用: urllib，urllib2，urllib3，httplib，httplib2，requests
Python3.x 有这些库名可用: urllib，urllib3，httplib2，requests
若只使用Python3.x，记住有个urllib的库就行了。Pyhton2.x和Python3.x都有urllib3和requests, 它们不是标准库。urllib3提供线程安全连接池和文件post等支持，与urllib及urllib2的关系不大。requests 自称HTTP for Humans，使用更简洁方便。
Python3.x中将urllib2合并到了urllib，之后此包分成了以下几个模块：

urllib.request 用于打开和读取URL
urllib.error 用于处理前面request引起的异常
urllib.parse 用于解析URL
urllib.robotparser用于解析robots.txt文件

Python3.x中，随着urllib2合入urllib，一些常用的方法也发生了变化：

在Python2.x中使用import urlparse——在Python3.x中会使用import urllib.parse
在Python2.x中使用urllib2.urlopen或urllib.urlopen（已弃用）——在Python3.x中会使用urllib.request.urlopen
在Python2.x中使用urllib2.Request——在Python3.x中会使用urllib.request.Request
在Python2.x中使用urllib.quote——在Python3.x中会使用urllib.request.quote
在Python2.x中使用urllib.urlencode——在Python3.x中会使用urllib.parse.urlencode
在Python2.x中使用cookielib.CookieJar——在Python3.x中会使用http.cookiejar
异常处理：在Python2.x中使用urllib2.URLError,urllib2.HTTPError——在Python3.x中会使用urllib.error.URLError,urllib.error.HTTPError

原文链接：https://blog.csdn.net/jiduochou963/article/details/87564467

python中的GET与POST请求

urlopen参数如下：
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
常用参数：
url:访问的地址，一般不只是地址。
data:此参数为可选字段，特别要注意的是，如果选择，请求变为post传递方式,其中传递的参数需要转为bytes，如果是我们只需要通过 urllib.parse.urlencode 转换即可：
data = bytes(urllib.parse.urlencode({‘word’: ‘hello’}), encoding= ‘utf8’)

import urllib

#----------GET----------
url = 'http://www.zhihu.com'
#请求
request = urllib.request.Request(url)
#响应
response = urllib.request.urlopen(request)
html = response.read()
print(html)


#------------------POST----------
url = 'http://www.zhihu.com/login'
postdata = {'username': 'lrz', 'password': 'lllrrrzzz'}
data = bytes(urllib.parse.urlencode(postdata),encoding='utf-8')
req = urllib.request.Request(url, data)
res = urllib.request.urlopen(req)
html2 = res.read()
print(html2)

如果服务器仍旧拒绝访问，可能是因为服务器还校验请求头，判断是否是来自浏览器的访问。

#------------------POST----------
url = 'http://www.zhihu.com/login'

user_agent = 'Mozilla/4.0(compatible; Chrome 80.0.3987.116; Windows 10)'
referer = 'http://www.zhihu.com/'

postdata = {'username': 'lrz', 'password': 'lllrrrzzz'}
data = bytes(urllib.parse.urlencode(postdata),encoding='utf-8')

headers = {'User-Agent':user_agent, 'Referer':referer}

req = urllib.request.Request(url, data, headers)
res = urllib.request.urlopen(req)
#或使用以下方式
#req = urllib.request.Request(url)
#req.add_header('User-Agent',user_agent)
#req.add_header('Referer',referer)
#req.add_data(data)
html2 = res.read()
print(html2)

在这里插入图片描述

cookie处理

使用http模块中的cookiejar.CookieJar()存储cookie，用urllib模块中的request.HTTPCookieProcessor()获取

import urllib
from http import cookiejar
cookie = cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie))
response = opener.open('http://www.zhihu.com/login')
for item in cookie:
    print(item.name+':'+item.value)

也可以自己添加cookie

opener = urllib.request.build_opener()
opener.addheaders.append(('Cookie','e-mail='+'lrz@qq.com'))
req = urllib.request.Request('http://www.youtube.com')
response = opener.open(req)

获取http响应码与重定向

响应码可以通过try…except…来抓取HTTPError类型错误，并输出错误的代码属性e.code
重定向检测，利用response.geturl()与最初的URL对比可以判断是否发生了重定向。

代理proxy

代理urllib.request.ProxyHandler

def Proxy_read(proxy_list, user_agent_list, i):
    proxy_ip = proxy_list[i]
    print('当前代理ip：%s'%proxy_ip)
    user_agent = random.choice(user_agent_list)
    print('当前代理user_agent：%s'%user_agent)
    sleep_time = random.randint(1,3)
    print('等待时间：%s s' %sleep_time)
    time.sleep(sleep_time)
    print('开始获取')
    headers = {'User-Agent': user_agent,'Accept': r'application/json, text/javascript, */*; q=0.01',
                'Referer': r'https://www.cnblogs.com'
                }
    proxy_support = request.ProxyHandler({'http':proxy_ip})
    opener = request.build_opener(proxy_support)
    #----------------------------------------------
    request.install_opener(opener)
    #-------注意使用install_opener后所有数据都从该代理通过--------
    req = request.Request(r'https://www.cnblogs.com/kmonkeywyl/p/8409715.html',headers=headers)
    try:
        html = request.urlopen(req).read().decode('utf-8')
    except Exception as e:
        print('******打开失败！******')
    else:
        global count
    count +=1
    print('OK!总计成功%s次！'%count)

注意使用install_opener后所有数据都从该全局opener的代理通过，若想使用多个proxy设置，可以使用opener.open方法，而不是去注册全局代理。

更人性化的模块 — Requests

HTTP的请求

#GET
r = requests.get("http://www.baidu.com")
#POST
postdata = {"key":"value"}
r = requests.post("http://www.zhihu.com/login",postdata = postdata)
#Others
r = requests.put()
r = requests.delete()...
#复杂URL
r = requests.get("http://zzk.cnblogs.com/s/blogpost?Keyword=blog:lrz&pageindex=1")
#可以表示为
payload = {'Keyword':'blog:lrz', 'pageindex':1}
r = requests.get("http://zzk.cnblogs.com/s/blogpost",params = payload, headers=headers,cookies=cookies)
#响应与编码
r.content  #字节形式内容
r.text #文本形式内容
r.encoding #赋值更改编码格式
#响应码与headers头
r.status_code
r.headers.get('content_type')
#cookies
for cookie in r.cookies.keys():
	print(cookie+r.cookies.get(cookie))
#不关心cookie值，只希望在连续登录跳转时自动携带cookie，可以使用会话session
s = requests.Session()
r = s.get("http://xxxx",allow_redirects=True)
datas = {'name':'lrz', 'password':'asd'}
r = s.post("http://xxxx",allow_redirects=True,data=datas)
#代理
proxies = {}
r = requests.get("http://xxxx", proxies = proxies)