urllib简单使用

最新推荐文章于 2023-12-05 09:57:54 发布

J.Reno

最新推荐文章于 2023-12-05 09:57:54 发布

阅读量286

点赞数

分类专栏： Python 文章标签： Python urllib

本文链接：https://blog.csdn.net/JReno/article/details/95969170

版权

Python 专栏收录该内容

16 篇文章 0 订阅

订阅专栏

urllib模块

urllib简介

在Python2版本中,有urllib和urlib2两个库可以用来实现request的发送。而在Python3中,已经不存在urllib2这个库了,统一为urllib
urllib中包括了四个模块
urllib.request可以用来发送request和获取request的结果
urllib.error包含了urllib.request产生的异常
urllib.parse用来解析和处理URL
urllib.robotparse用来解析页面的robots.txt文件

爬取网页

先需要导入用到的模块:urllib.request
在导入了模块之后,我们需要使用urllib.request.urlopen打开并爬取一个网页

读取内容常见的有3种方式:

read()读取文件的全部内容,与readlines()不同的是,read()会把读取到的内容赋给一个字符串变量。
readlines()读取文件的全部内容,readlines()会把读取到的内容赋值给一个列表变量。
readline()读取文件的一行内容。

import urllib.request
html  = urllib.request.urlopen('http://www.baidu.com')
html.readline()
html.read(4096)
html.readlines()

下载网络资源

urllib不仅可以下载网页,其他网络资源均可下载
有些文件比较大,需要像读取文件一样,每次读取一部分数据

import urllib.request
html	= urllib.request.urlopen('http://172.40.50.116/python.pdf')
fobj = open('/tmp/python.pdf', 'ab')
while True:
	data = html.read(4096)
	if not data:
		break
	fobj.write(data)
fobj.close()

模拟客户端

有些网页为了防止别人恶意采集其信息所以进行了一些反爬虫的设置,而我们又想进行爬取
可以设置一些Headers信息(User-Agent),模拟成浏览器去访问这些网站

import urllib.request
url='http://www.baidu.cn'
header = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'}
html=urllib.request.Request(url,headers=header)
data=urllib.request.urlopen(request).read()

抓取简书主页图片

'下载https://www.jianshu.com首页中的所有图片'
import wget
import os
import re
from urllib import request

def get_web(url, fname):
    headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'}
    r = request.Request(url, headers=headers)
    js_index = request.urlopen(r)
    with open(fname, 'wb') as fobj:
        while True:
            data = js_index.read(4096)
            if not data:
                break
            fobj.write(data)

def get_urls(fname, patt):
    patt_list = []
    cpatt = re.compile(patt)

    with open(fname) as fobj:
        for line in fobj:
            m = cpatt.search(line)
            if m:
                patt_list.append(m.group())

    return patt_list


if __name__ == '__main__':
    # 将图片存到dst目录，如果目录不存在则创建
    dst = '/my/jianshu'
    if not os.path.exists(dst):
        os.mkdir(dst)

    # 通过urllib下载简书首页html文件
    get_web('https://www.jianshu.com/', '/my/jianshu/js.html')

    # 在网页中找到所有的图片地址
    img_patt = '//[\w/.-]+\.(png|jpg|jpeg|gif)'
    imgs_list = get_urls('/my/jianshu/js.html', img_patt)
    # print(imgs_list)
    for img_url in imgs_list:
        img_url = 'https:' + img_url
        wget.download(img_url, dst)

urllib进阶

数据编码

一般来说,URL标准中只会允许一部分ASCII字符,比如数字、字母、部分符号等
而其他的一些字符,比如汉字等,>是不符合URL标准的。此时,我们需要编码。
如果要进行编码,可以使用urllib.request.quote()进行

>>>	urllib.request.quote('hello world!')
'hello%20world%21'
>>>	urllib.request.unquote('hello%20world%21')
'hello world!'

HTTP异常处理

如果访问的页面不存在或拒绝访问,程序将抛出异常
捕获异常需要导入urllib.error模块

>>>	html = urllib.request.urlopen('http://172.40.50.116/a.html')
urllib.error.HTTPError: HTTP Error 404: Not	Found
>>>	html = urllib.request.urlopen('http://172.40.50.116/aaa')
urllib.error.HTTPError: HTTP	Error 403: Forbidden