python爬虫（一）

最新推荐文章于 2020-11-26 01:49:06 发布

lzkmylz

最新推荐文章于 2020-11-26 01:49:06 发布

阅读量355

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/lzkmylz/article/details/51820137

版权

python 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

Python爬虫学习笔记。

使用库为urllib，版本为python3.5。

Importurllib.request用来访问网页的库

            url = "http://placekitten.com/500/600"
req = urllib.request.Request(url)
response = urllib.request.urlopen(req)

以上为设定url，转换req并打开网页数据，此时为二进制数据。

通过response.read()可以读取，之后可以通过encode来编码。

Response.geturl()返回地址

Response.info()返回具体信息

Response.getcode()返回状态码

对爬虫页面使用审查元素

在network中注意GET和POST进程。

其中注意user-agent和request url，form data表单数据。

对于urlopen，如其data参数未赋值，则默认以get形式提交访问，如果有赋值则为post。Data有特殊的格式要求，可使用urllib.parse.urlencode()转化字符串为对应格式。Data初始化时是一个字典，从formdata拷贝而来。

得到的结果为：

{"type":"EN2ZH_CN","errorCode":0,"elapsedTime":1,"translateResult":[[{"src":"ilove my dog","tgt":"我爱我的狗"}]],"smartResult":{"type":1,"entries":["","我喜欢我的狗","我爱我的小狗","我喜欢"]}}

这是一个JSON结构的数据，通过importjson库来更好的访问。

通过修改headers来模拟正常浏览器访问。

Headers是一个字典，可以将一个字典直接传给request或者通过add_header()加进去。

Header = {}

Header[‘User-Agent’]= ‘Mozilla/5.0 (Windows NT 6.2; WOW64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36’

此时，使得：

Req= urllib.request.Request(url, data, header)

延迟提交，使用time模块

Importtime

Time.sleep(t),t为以秒计的时间间隔

使用代理

1、参数是一个字典{‘类型’：‘代理ip：端口号’}

2、Proxy_support =urllib.request.ProxyHandle(dic)

3、定制、创建一个opener

Opener = urllib.request.build_opener(proxy_support)

然后安装opener

Urllib.request.install_opener(opener)

最后调用

Opener.open(url)

Import random

Random.choice(iplist)

lzkmylz

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python爬虫（一）

Python爬虫学习笔记。使用库为urllib，版本为python3.5。 Importurllib.request用来访问网页的库 url = "http://placekitten.com/500/600"req = urllib.request.Request(url)response = urllib.request.
复制链接

扫一扫