Python爬虫开发与项目实战 3: 初识爬虫

最新推荐文章于 2024-06-19 17:27:45 发布

CopperDong

最新推荐文章于 2024-06-19 17:27:45 发布

阅读量1.4k

点赞数 2

分类专栏：爬虫

本文链接：https://blog.csdn.net/QFire/article/details/78984687

版权

爬虫专栏收录该内容

20 篇文章 3 订阅

订阅专栏

3.1 网络爬虫概述

概念：按照系统结构和实现技术，大致可分：通用网络爬虫、聚焦爬虫、增量式爬虫、深层爬虫。实际的爬虫系统通常是几种技术的相结合实现的。

搜索引擎：属于通用爬虫，但存在一定的局限性：

检索结果包含大量用户不关心的网页

有限的服务器资源与无限的网络数据资源之间的矛盾

SEO往往对信息含量密集且具有一定结构的数据无能为力，如音视频等

基于关键字的检索，难以支持根据语义信息提出的查询

为了解决上述问题，定向抓取相关网页资源的聚焦爬虫应运而生

聚焦爬虫：一个自动下载网页的程序，为面向主题的用户查询准备数据资源

增量式爬虫：采取更新和只爬新产生的网页。减少时间和空间上的耗费，但增加算法复杂度和实现难度

深层爬虫：网页分表层网页（SEO可以索引的）和深层网页（表单后的）

场景：BT搜索网站（https://www.cilisou.org/），云盘搜索网站（http://www.pansou.com/）

基本工作流程如下：

首先选取一部分精心挑选的种子URL
将这些URL放入待抓取URL队列
从待抓取URL队列中读取URL，解析DNS，得到IP，下载网页，存储网页，将URL放进已抓取URL队列
分析已抓取URL队列中的URL，分析网页中的URL，比较去重，后放入待抓取URL队列，进入下一个循环。

3.2 HTTP请求的Python实现

Python中实现HTTP请求的三种方式：urllib2/urllib httplib/urllib Requests

urllib2/urllib实现：Python中的两个内置模块，以urllib2为主，urllib为辅

1.实现一个完整的请求与响应模型

import urllib2
response = urllib2.urlopen('http://www.zhihu.com')
html = response.read()
print html

将请求响应分为两步：一步是请求，一步是响应

import urllib2
request = urllib2.Request('http://www.zhihu.com')
response = urllib2.urlopen(request)
html = response.read()
print html

POST方式：

有时服务器拒绝你的访问，因为服务器会检验请求头。常用的反爬虫的手段。

2、实现请求头headers处理

import urllib
import urllib2
url = 'http://www.xxxx.com/login'
user_agent = ''
referer = 'http://www.xxxx.com/'
postdata = {'username': 'qiye',
             'password': 'qiye_pass' }
# 写入头信息
headers = {'User-Agent': user_agent, 'Referer': referer}
data = urllib.urlencode(postdata)
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
html = response.read()

3、Cookie处理：使用CookieJar函数进行Cookie的管理

import urllib2
import cookielib
cookie = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
response = opener.open('http://www.zhihu.com')
for item in cookie:
	print item.name + ':' + item.value

SessionID_R3:4y3gT2mcOjBQEQ7RDiqDz6DfdauvG8C5j6jxFg8jIcJvE5ih4USzM0h8WRt1PZomR1C9755SGG5YIzDJZj7XVraQyomhEFA0v6pvBzV94V88uQqUyeDnsMj8MALBSKr
4、Timeout设置超时

import urllib2
request = urllib2.Request('http://www.zhihu.com')
response = urllib2.urlopen(request, timeout=2)
html = response.read()
print html

5、获取HTTP响应码

import urllib2
try:
	response = urllib2.urlopen('http://www.google.com')
	print response
except urllib2.HTTPError as e:
	if hasattr(e, 'code'):
		print 'Error code:', e.code

6、重定向：urllib2默认情况下会针对HTTP 3XX返回码自动进行重定向

只要检查Response的URL和Request的URL是否相同

import urllib2
response = urllib2.urlopen('http://www.zhihu.com')
isRedirected = response.geturl() == 'http://www.zhihu.com'

7、Proxy的设置：urllib2默认会使用环境变量http_proxy来设置HTTP Proxy，但我们一般不采用这种方式，而用ProxyHandler在程序中动态设置代理。

import urllib2
proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8087'})
opener = urllib2.build_opener(proxy, )
urllib2.install_opener(opener)
response = urllib2.urlopen('http://www.zhihu.com/')
print response.read()

install_opener()会设置全局opener,但如想使用两个不同的Proxy代理，比较好的做法是直接调用的open方法代替全局urlopen方法

import urllib2
proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8087'})
opener = urllib2.build_opener(proxy, )
response = opener.open('http://www.zhihu.com/')
print response.read()

httplib/urllib实现：一个底层基础模块，可以看到建立HTTP请求的每一步，但是实现的功能比较少。

Requests：更人性化，是第三方模块，pip install requests

import requests
r = requests.get('http://www.baidu.com')
print r.content

2、响应与编码

import requests
r = requests.get('http://www.baidu.com')
print 'content-->' + r.content
print 'text-->' + r.text
print 'encoding-->' + r.encoding
r.encoding = 'utf-8'
print 'new text-->' + r.text

pip install chardet 一个非常优秀的字符串/文件编码检查模块

直接将chardet探测到的编码，赋给r.encoding实现解码，r.text输出就不会有乱码了。

import requests
import chardet
r = requests.get('http://www.baidu.com')
print chardet.detect(r.content)
r.encoding = chardet.detect(r.content)['encoding']
print r.text

流模式

import requests
r = requests.get('http://www.baidu.com', stream=True)
print r.raw.read(10)

3、请求头headers处理

import requests
user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
headers = {'User-Agent': user_agent}
r = requests.get('http://www.baidu.com', headers=headers)
print r.content

4、响应码code和响应头headers处理

# -*- coding: utf-8 -*-
import requests
user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
headers = {'User-Agent': user_agent}
r = requests.get('http://www.baidu.com', headers=headers)
if r.status_code == requests.codes.ok:
	print r.status_code    #响应码
	print r.headers        #响应头
	print r.headers.get('content-type')  # 推荐这种方式
	print r.headers['content-type']      # 不推荐这种方式
else:
	r.raise_for_status()

5、Cookie处理

# -*- coding: utf-8 -*-
import requests
user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
headers = {'User-Agent': user_agent}
r = requests.get('http://www.baidu.com', headers=headers)
# 遍历出所有的cookie字段的值
for cookie in r.cookies.keys():
	print cookie + ":" + r.cookies.get(cookie)

将自定义的Cookie值发送出去

# -*- coding: utf-8 -*-
import requests
user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
headers = {'User-Agent': user_agent}
cookies = dict(name='qiye', age='10')
r = requests.get('http://www.baidu.com', headers=headers, cookies=cookies)
print r.text

Requests提供了 session的概念，使我们不需要关心Cookie值，可连续访问网页

# -*- coding: utf-8 -*-
import requests
loginUrl = "http://www.xxx.com/login"
s = requests.Session()
# 首次访问，作为游客，服务器分配一个cookie
r = s.get(loginUrl, allow_redirects=True)
datas = {'name':'qiye', 'passwd': 'qiye'}
# 向登录链接发送post请求，游客权限转为会员权限
r = s.post(loginUrl, data=datas.allow_redirects=Trues)
print r.text

这是一个正式遇到的问题，如果没有第一不访问登录的页面，而是直接向登录链接发送Post请求，系统会把你当做非法用户，因为访问登录界面式会分配一个Cookie，需要将这个Cookie在发送Post请求时带上，这种使用Session函数处理Cookie的方式之后会很常用。

6、重定向与历史信息

只需设置以下allow_redicts字段即可，可通过r.history字段查看历史信息

# -*- coding: utf-8 -*-
import requests
r= requests.get('http://github.com')   # 重定向为https://github.com
print r.url
print r.status_code
print r.history

7、超时设置

requests.get('http://github.com', timeout=2)

8、代理设置

# -*- coding: utf-8 -*-
import requests
proxies = {
	"http" = "http://0.10.10.01:3234",
	"https" = "http://0.0.0.2:1020",
}
r= requests.get('http://github.com', proxies=proxies)

也可通过环境变量HTTP_PROXY和HTTPS_PROXY来配置，但不常用。

你的代理需要使用HTTP Basic Auth，可以用http://user:password&host/语法

CopperDong

关注

2
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
Python爬虫开发与项目实战 3: 初识爬虫

3.1 网络爬虫概述概念：按照系统结构和实现技术，大致可分：通用网络爬虫、聚焦爬虫、增量式爬虫、深层爬虫。实际的爬虫系统通常是几种技术的相结合实现的。搜索引擎：属于通用爬虫，但存在一定的局限性：检索结果包含大量用户不关心的网页有限的服务器资源与无限的网络数
复制链接

扫一扫

专栏目录