Python3网络爬虫——（1）利用urllib进行简单的网页抓取

最新推荐文章于 2021-02-05 19:08:05 发布

Asia-Lee

最新推荐文章于 2021-02-05 19:08:05 发布

阅读量1.5k

点赞数 3

分类专栏：网络爬虫文章标签： Python3 网络爬虫 urllib

本文链接：https://blog.csdn.net/asialee_bird/article/details/79810031

版权

网络爬虫专栏收录该内容

6 篇文章 3 订阅

订阅专栏

1、urllib介绍

urllib是Python提供的用于操作URL的模块，即Python 内置的 HTTP 请求库，它包含四个模块：

第一个模块 request，它是最基本的 HTTP 请求模块，我们可以用它来模拟发送一请求，就像在浏览器里输入网址然后敲击回车一样，只需要给库方法传入 URL 还有额外的参数，就可以模拟实现这个过程了。
第二个 error 模块即异常处理模块，如果出现请求错误，我们可以捕获这些异常，然后进行重试或其他操作保证程序不会意外终止。
第三个 parse 模块是一个工具模块，提供了许多 URL 处理方法，比如拆分、解析、合并等等的方法。
第四个模块是 robotparser，主要是用来识别网站的 robots.txt 文件，然后判断哪些网站可以爬，哪些网站不可以爬的，其实用的比较少。

2、快速使用urllib爬取网页

# -*- coding: UTF-8 -*-
from urllib import request
if __name__ == "__main__":
      file = request.urlopen("https://blog.csdn.net/asialee_bird")#使用request.urlopen()打开和读取url信息
      html = file.read() #读取文件的全部内容，read会把读取到的内容赋给一个字符串变量
      #html=file.readlines() #读取文件的全部内容，readlines会把读取到的内容赋给一个列表变量
      #html=file.readline()  #读取文件的一行内容
      html = html.decode("utf-8")  #decode()命令将网页的信息进行解码
      print(html)

3、获取网页的编码方式

# -*- coding: UTF-8 -*-
from urllib import request
import chardet               #通过第三方模块获得网页的编码方式（需要pip3安装）
if __name__ == "__main__":
    file = request.urlopen("https://blog.csdn.net/asialee_bird")
    html = file.read() 
    charset=chardet.detect(html)  #获取该网页的编码方式
    print(charset)

结果：

4、将爬取到的网页以网页的形式保存到本地

方法一：

# -*- coding: UTF-8 -*-
from urllib import request
if __name__ == "__main__":
       file = request.urlopen("https://blog.csdn.net/asialee_bird")
       html = file.read() 
       file_html=open('test.html','wb')
       file_html.write(html)
       file_html.close()

结果：

方法二：

# -*- coding: UTF-8 -*-
from urllib import request
if __name__ == "__main__":    
   file=request.urlretrieve('https://blog.csdn.net/asialee_bird',filename='test2.html')
   request.urlcleanup()  #清除缓存信息

5、urlopen的url参数信息

# -*- coding: UTF-8 -*-
from urllib import request
if __name__ == "__main__":
    # url可以是一个字符串,也可以是一个Request对象
    req = request.Request("https://blog.csdn.net/asialee_bird")
    response = request.urlopen(req)
    get_url=response.geturl()   #geturl()返回的是一个url的字符串
    in_fo=response.info() #info()返回的是一些meta标记的元信息，包括一些服务器的信息
    get_code=response.getcode() #getcode()返回的是HTTP的状态码，如果返回200表示请求成功
    #分别对获取的信息进行打印
    print("geturl打印信息：%s"%get_url)
    print('**********************************************')
    print("info打印信息：%s"%in_fo)
    print('**********************************************')
    print("getcode打印信息：%s"%get_code)

输出结果：

6、urllib.error异常处理

（1）urllib.error异常处理

# -*- coding: UTF-8 -*-
from urllib import request
from urllib import error
if __name__ == "__main__":
     url = 'https://blog.csdn.net/asialee_bir'  #错误链接
     try:
        response=request.urlopen(url)
        file=response.read().decode('utf-8')
        print(file)
     except  error.URLError as e:
        print(e.code)
        print(e.reason)

异常结果：（403错误表示禁止访问）

（2）使用HTTPError进行异常处理

# -*- coding: UTF-8 -*-
from urllib import request
from urllib import error
if __name__ == "__main__":
    url = 'https://blog.csdn.net/asialee_bir'  #错误链接
    try:
        response=request.urlopen(url)
        file=response.read().decode('utf-8')
        print(file)
    except error.HTTPError as e:
        print(e.code)    #返回状态码
        print(e.reason)

异常结果：（403错误表示禁止访问）

注意：URLError是HTTPError的父类

（3）URLError和TTPError混合使用

# -*- coding: UTF-8 -*-
from urllib import request
from urllib import error
if __name__ == "__main__":
    url = 'https://blog.baidusss.net'  #不存在的链接
    try:
        response=request.urlopen(url)
        file=response.read().decode('utf-8')
        print(file)
    except error.URLError as e:
        if hasattr(e,'code'):   #使用hasattr()判断是否有这些属性
             print('HTTPError')
             print(e.code)
        if hasattr(e,'reason'):
             print('URLError')
             print(e.reason)