python爬虫基础知识（一）--Urllib.request

最新推荐文章于 2024-04-23 16:42:17 发布

Yang-Zhou

最新推荐文章于 2024-04-23 16:42:17 发布

阅读量705

点赞数

分类专栏： python知识文章标签： python 爬虫

本文链接：https://blog.csdn.net/snailpeople/article/details/78771470

版权

python知识专栏收录该内容

9 篇文章 0 订阅

订阅专栏

explain：The urllib.request module defines functions and classes which help in opening URLs (mostly HTTP) in a complex world — basic and digest authentication, redirections, cookies and more.

1.urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
打开网址url，可以是一个字符串或者Request对象。
比如：

file=urllib.request.urlopen("http://www.baidu.com")
print(file.read())#读取所有内容赋给一个字符串
print(file.readline())#一行
print(file.readlines())#读取所有内容赋给一个列表
file.info()#获取当前环境有关的信息
file.getcode()#获取返回的状态码
file.geturl()#获取源网页地址

2.urllib.request.urlretrieve(url, filename=None, reporthook=None, data=None)

Copy a network object denoted by a URL to a local file. If the URL points to a local file, the object will not be copied unless filename is supplied. Return a tuple (filename, headers) where filename is the local file name under which the object can be found, and headers is whatever the info() method of the object returned by urlopen() returned (for a remote object). Exceptions are the same as for urlopen().

这个函数是打来一个url并且保存到本地文件，返回的是一个tuple对象,同时这个函数在执行的过程中，还会有一定的缓冲，用urllib.request.urlcleanup()这个函数就可以清除缓存。

filname=urllib.request.urlretrieve("http://www.baidu.com",filename="data2.html")
urllib.request.urlcleanup()
print(filname)

3.模拟浏览器修改报头

方法1：
使用build_opener()进行。

import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
opener.open('http://www.example.com/')

方法2：
使用add_header()。

import urllib.request
req = urllib.request.Request('http://www.example.com/')
req.add_header('Referer', 'http://www.python.org/')
# Customize the default User-Agent header value:
req.add_header('User-Agent', 'urllib-example/0.1 (Contact: . . .)')
r = urllib.request.urlopen(req)

4.urllib.parse.quote(string, safe=’/’, encoding=None, errors=None)
解决url编码问题，默认为utf-8，safe表示可以忽略的字符，error默认为replace，meaning invalid sequences are replaced by a placeholder character.

key="粑粑"
key_code=urllib.parse.quote(key)
print(key_code)
url="http://www.baidu.com/s?wd="+key_code

5.get请求步骤
构建对于的url–>以对应的URL为参数，构建Request对象–>
通过urlopen（）打开构建的Request对象–>后续操作

import urllib.request,urllib.parse
url="http://www.baidu.com/s?wd="
key="重庆"
key_code=urllib.parse.quote(key)
urlall=url+key_code
req=urllib.request.Request(urlall)
data=urllib.request.urlopen(req).read()
print(data)

6.post请求步骤
1）设置URl网址
2）构建表单数据，使用urllib.parse.urlkencode进行编码处理
3)创建Request对象
4）add_header()添加头信息
5)使用urllib.request.urlopen()打开提交并处理

7.代理服务器设置

def use_proxy(proxy_address,url):
    import  urllib.request
                      proxy=urllib.request.ProxyHandler({'http':proxy_address})
    opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler)
    urllib.request.install_opener(opener)#全局默认的opener对象
    data=urllib.request.urlopen(url).read().decode('utf-8')
    return  data
proxy_add="112.74.32.237:6666"
data=use_proxy(proxy_add,"http://www.baidu.com")
print(len(data))

Yang-Zhou

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python爬虫基础知识（一）--Urllib.request

explain：The urllib.request module defines functions and classes which help in opening URLs (mostly HTTP) in a complex world — basic and digest authentication, redirections, cookies and more.1.urllib.re
复制链接

扫一扫