Python 爬虫实战：内置模块urllib介绍

最新推荐文章于 2023-12-18 17:43:46 发布

い時間で奮闘しま

最新推荐文章于 2023-12-18 17:43:46 发布

阅读量318

点赞数

分类专栏： Python爬虫文章标签： python 爬虫

本文链接：https://blog.csdn.net/m0_46428072/article/details/107135129

版权

Python爬虫专栏收录该内容

4 篇文章 0 订阅

订阅专栏

文章目录

Hello everyone, 大家好，今天又到了我这帅气又迷人的大帅锅“thadqy”给大家介绍爬虫相关知识的时候了。那么接下来，我就来讲一讲爬虫中的__urllb模块__吧！！！后面还会有关于爬虫的实战案例哟！！！下给你没展示一下我爬取的图片吧。

在这里插入图片描述

那么接下来就正式进入正题了

urllib.request模块介绍

urllib.request模块是python中的内置模块。所以在我们使用它时就不用再去重新安装了。urllib.request模块中又很多的类和方法,下面我来说一下他们：

urllib.request.urlopen(‘网址’) ：作用是向网站发送一个请求并获取响应。

import urllib.request

url = 'http://www.baidu.com'url

resopnse = urllib.request.urlopen(url)

urllib.request.Request(url, headers=‘字典’) : 作用也是发送请求获取响应对象，但是与urllib.requset.urlopen()有所不同。当我们在发送请求时，如果不需要过多的参数传递，则可以使用urllib.request.urlopen(),否则就使用urllib.request.Requset().但是，如果还需要多响应对象做后续操作，就需要urllib.request.urlopen()的包装。

import urllib.request

url = 'http://wwww.baidu.com'

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}

#发送请求并获取响应对象
request = urllib.request.Request(url, headers=headers)
#对request进行包装
response = urllib.request.urlopen(request)

#获取响应内容
html = response.read() # 获取网页源码
html2 = response.read().decode('utf-8') # 获取‘utf-8'格式的代码

print(html)

print(html2)

对响应对象的操作以下方法需要urllib.request.urlopen()方法的包装才能使用

read() ：以字节流方式读取响应内容

import urllib.request

url = 'http://www.baidu.com'url

resopnse = urllib.request.urlopen(url)

html = response.read()

print(html)

输出结果： b'<html>\r\n<head>\r\n\t<script>\r\n\t\tlocation.replace(location.href.replace("https://","http://"));\r\n\t</script>\r\n</head>\r\n<body>\r\n\t<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>\r\n</body>\r\n</html>'

decode(‘类型’) : 根据需要的类型生成响应对象内容

import urllib.request

url = 'http://www.baidu.com'url

resopnse = urllib.request.urlopen(url)

html = response.read().decode('utf-8')

print(html)

输出结果：
<head>
	<script>
		location.replace(location.href.replace("https://","http://"));
	</script>
</head>
<body>
	<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>
</body>
</html>

getcode() ：返回响应对象的响应码

import urllib.request

url = 'http://www.baidu.com'url

resopnse = urllib.request.urlopen(url)

print(response.getcode())

# 输出结果：200

geturl() ：返回响应对象的url

import urllib.request

url = 'http://www.baidu.com'url

resopnse = urllib.request.urlopen(url)

print(response.geturl())

# 输出结果：https://www.baidu.com/

urllib.parse模块介绍

urllib.parse模块主要是用来编码用的。一般是在对访问url进行组合是用到

urllib.parse模块常用方法

urlencode(字典) ：该方法的参数是一个字典，就是说是对url中的字典部分进行编码

在我们请求网页时，我们需要知道网页的url，一般，我们会自己去组装url，这时就会对自己搜索内容以字典的储存结构记录，然后对字典编码，最后进行url的组装

# 我们在百度上请求以“妹子”作为标题搜索的网页网页url如下

#https://tieba.baidu.com/f?kw=%E5%A6%B9%E5%AD%90&ie=utf-8&pn=50

# 该url中年的kw=%E5%A6%B9%E5%AD%90，就是一个经过编码后的并组装到url中的字典。它的意思就是“妹子”的意思。只不过经过编码为十六进制了

import urllib.request
import urllib.parse

baseUrl = 'https://tieba.baidu.com/f?' # 这是“妹子”网页所有url都相同的部分。

name = '妹子' # 搜索的内容

kw = {'kw':'妹子'} # 将搜索内容以字典结构存储

name = urllib.parse.urlcode(kw) # 对字典进行编码

print(name)

# 输出结果为：kw=%E5%A6%B9%E5%AD%90

quote() ：也是进行编码的，与urlencode()不同的是，该方法可以不用字典结构存储搜索内容，不过url组装时，相同部分与urlencode()有所不同

# 同样是搜索“妹子”网页
# url = https://tieba.baidu.com/f?kw=%E5%A6%B9%E5%AD%90&ie=utf-8&pn=50

baseUrl = 'https://tieba.baidu.com/f?kw=' # 需要注意的是，该部分后面比urlencode() 方法的多了“kw=”

name = '妹子'

name = urllib.parse.quote(name)

print(name)

# 输出结果为：%E5%A6%B9%E5%AD%90

请求方法

在我们请求网页时，我们有两种方式进行。

GET 请求：特点是查询参数在URL地址中显示
POST 请求：
1. 在Request()方法中需要添加data参数，data参数是一个字典。里面包含了一些需要发送给服务器的信息
  - ```
  urllib.request.Request(url, data=data, headers=headers)
```
2. 表单数据以bytes类型提交，不能是str类型

实战：在网页上爬取风景图片，并储存在文件中

本次爬取的网址是：http://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word=风景图片&pn=20

import urllib
import urllib.request
import os
import re


# http://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word=%E9%A3%8E%E6%99%AF%E5%9B%BE&pn=20

class Spider:

    def __init__(self):

        self.headers = {
            'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
        }

    def request_url(self, url):

        requset = urllib.request.Request(url=url, headers=self.headers)

        response = urllib.request.urlopen(requset)

        html = response.read().decode('utf-8')

        self.parse_html(html)

    def parse_html(self, html):

        picUrlList = re.findall(r'"objURL":"(.*?)"', html)

        self.parse_pic_urls(picUrlList)

    def parse_pic_urls(self, picUrlList):

        for picUrl in picUrlList:

            picName = picUrl.split('/')[-1]

            self.store_date(picName, picUrl)

    def create_dir(self):

        try:
            os.mkdir('./images')
        except FileExistsError as e:
            print(e)

        os.chdir('./images')

    def store_date(self, picName, picUrl):

        try:
            response = urllib.request.urlopen(picUrl)

            picture = response.read()

        except Exception as e:
            print(e)
        else:
            with open(picName, 'w+b') as f:

                print(picName)

                f.write(picture)


if __name__ == '__main__':

    baseUrl = 'http://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word=%E9%A3%8E%E6%99%AF%E5%9B%BE&pn='

    start = eval(input("请输入起始页："))
    end = eval(input("请输入结束页："))

    spider = Spider()

    spider.create_dir()

    for page in range(start, end + 1):

        url = baseUrl + str(page * 10)

        print('开始爬取第' + str(page) + '页')

        spider.request_url(url)

结语

大家是否觉得urllib是否有点麻烦呢？确实，urllib方法是有点过时了，但是好在他是python的内置模块，不在需要大家再下载安装。对于这种状况，现在一般广泛的使用第三方模块Requests。该模块比urllib模块跟家高效方便。所以，本人会在下次给大家介绍Requests第三方模块。那么接下来大家可以看看我爬取下来的美丽风景吧。

在这里插入图片描述

い時間で奮闘しま

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python 爬虫实战：内置模块urllib介绍

文章目录urllib.request模块介绍urllib.parse模块介绍urllib.parse模块常用方法请求方法实战：在网页上爬取风景图片，并储存在文件中结语Hello everyone, 大家好，今天又到了我这帅气又迷人的大帅锅“thadqy”给大家介绍爬虫相关知识的时候了。那么接下来，我就来讲一讲爬虫中的__urllb模块__吧！！！后面还会有关于爬虫的实战案例哟！！！下给你没展示一下我爬取的图片吧。那么接下来就正式进入正题了urllib.request模块介绍urllib.reque
复制链接

扫一扫