补充urllib

最新推荐文章于 2022-05-04 17:50:52 发布

風坞

最新推荐文章于 2022-05-04 17:50:52 发布

阅读量179

点赞数 1

分类专栏： Python

本文链接：https://blog.csdn.net/m0_52414727/article/details/113768548

版权

Python 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

视频学习链接：Python爬虫基础5天速成

urllib.request

虽然不知道为什么我的spyder在importurllib时不能自动补齐但是这不重要我只需要知道怎样应用功能就可以了

获取一个get请求

import urllib.request

response = urllib.request.urlopen("http://www.baidu.com")#打开一个网页并将获取到的数据保存到response中

打印response会产生：<http.client.HTTPResponse object at 0x0000023D6B0F8320>应该说时一个网址的数据显示

那么想要读取这些数据需要使用response.read，可以将打印产物存储在新建文本文档并将其后缀名更改为html，打开后会得到下图：(虽然不知道为什么老师那里只显示\r\n)
$虽然不知道为什么老师那里只显示\r\n$
使用response.read().decode('utf-8')进行解码再打开，将会获得下图
在这里插入图片描述

获取一个post请求

这里用httpbin.org来测试
使用其中的http method,点击展开后是这样（我的chrome自动翻译了）

使用post中try it out然后点击execute,将获得以下场面：
代码

#获取一个post请求
response = urllib.request.urlopen("http://httpbin.org/post")#这里的post就是说直接用post方式调用
print(response.read())

会出现下面的错误：
在这里插入图片描述
因为这个方法不允许这样子直接调用，需要先给传递一些post的表单信息，通过这个表单的封装，才能够正确的访问post
解决方式如下：

import urllib.parse#解析器
#获取一个post请求
data = bytes(urllib.parse.urlencode({"hello":"world", "okk":"I'm fine"}), encoding = "utf-8")#bytes可以把数据封装成一个二进制的数据包
#数据包中的内容可以放一些键值对、编码解码的一些数值
#其中的{"hello":"world", "okk":"I'm fine"})部分是放入键值对，encoding部分是指明封装方式
#将data作为传递给post的内容
response = urllib.request.urlopen("http://httpbin.org/post", data = data)
print(response.read())

将获得以下输出：

b’{\n “args”: {}, \n “data”: “”, \n “files”: {}, \n “form”: {\n “hello”: “world”, \n “okk”: “I’m fine”\n }, \n “headers”: {\n “Accept-Encoding”: “identity”, \n “Content-Length”: “26”, \n “Content-Type”: “application/x-www-form-urlencoded”, \n “Host”: “httpbin.org”, \n “User-Agent”: “Python-urllib/3.7”, \n “X-Amzn-Trace-Id”: “Root=1-60222e69-39dcfbe84c1e319e043a2a14”\n }, \n “json”: null, \n “origin”: “222.134.56.138”, \n “url”: “http://httpbin.org/post”\n}\n’

使用decode你将获得一个漂亮的格式（屁话。。。）

print(response.read().decode("utf-8"))

{
“args”: {},
“data”: “”,
“files”: {},
“form”: {
“hello”: “world”,
“okk”: “I’m fine”
},
“headers”: {
“Accept-Encoding”: “identity”,
“Content-Length”: “26”,
“Content-Type”: “application/x-www-form-urlencoded”,
“Host”: “httpbin.org”,
“User-Agent”: “Python-urllib/3.7”,
“X-Amzn-Trace-Id”: “Root=1-60222ef0-1ca5e0bb40c0ae356f29180e”
},
“json”: null,
“origin”: “222.134.56.138”,
“url”: “http://httpbin.org/post”
}

等等，熟悉不熟悉！！就是刚刚图片里的response body！！
即，我们用urllib模拟了以下浏览器中真实发出的请求
所以说，你想用post方式去访问人家时，必须按照post方式来封装数据，就可以用data来传递参数，其中data需要为二进制文件
如果有更多的用户名密码想要模拟的话都用这种方式来，所以post请求一般用在模拟网站有真实用户登录时使用，未来可能还会放入cookies信息，那么网页就会认为有人在真实地登录

测试get并解决超时问题（人家发现你是爬虫不想让你进）
正常测试：

response = urllib.request.urlopen("http://httpbin.org/get")
print(response.read().decode("utf-8"))

{
“args”: {},
“headers”: {
“Accept-Encoding”: “identity”,
“Host”: “httpbin.org”,
“User-Agent”: “Python-urllib/3.7”,
“X-Amzn-Trace-Id”: “Root=1-602231b0-2f0e30a20c3b9165304cb72e”
},
“origin”: “222.134.56.138”,
“url”: “http://httpbin.org/get”
}

而response body是：

Response body
Download
{
“args”: {},
“headers”: {
“Accept”: “application/json”,
“Accept-Encoding”: “gzip, deflate”,
“Accept-Language”: “zh-CN,zh;q=0.9”,
“Host”: “httpbin.org”,
“Referer”: “http://httpbin.org/”,
“User-Agent”: “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.146 Safari/537.36”,
“X-Amzn-Trace-Id”: “Root=1-60223260-59afb88d3040904b4cb8ed68”
},
“origin”: “58.58.13.50”,
“url”: “http://httpbin.org/get”
}

与response body相比，少了两个accept，并且user-agent毫无伪装，直愣愣地告诉人家：俺就是爬虫。

超时处理：

针对于碰到死链接或者人家链接不让你进去的情况，当然timeout一般没有例子中这么短，程序会选择先跳过，运行完全部后在回来针对性爬取

#超时处理
try:
    response = urllib.request.urlopen("http://httpbin.org/get", timeout = 0.01)
    print(response.read().decode("utf-8"))
except urllib.error.URLError as e:
    print("TimeOut")

用请求来查看一些内容或者信息

#用一个请求来看这样的一些内容或者信息
response = urllib.request.urlopen("http://httpbin.org/get")    
print(response.status)#把网页的内容做一些简单得解析，会获得一个状态码，httpbin.org会是200

改成http://douban.com 你会获得下图：

runfile('C:/Users/86155/Desktop/test.py', wdir='C:/Users/86155/Desktop')
Traceback (most recent call last):

  File "<ipython-input-15-5b835a2ae8c3>", line 1, in <module>
    runfile('C:/Users/86155/Desktop/test.py', wdir='C:/Users/86155/Desktop')

  File "C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 668, in runfile
    execfile(filename, namespace)

  File "C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 108, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "C:/Users/86155/Desktop/test.py", line 36, in <module>
    response = urllib.request.urlopen("http://douban.com")

  File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 222, in urlopen
    return opener.open(url, data, timeout)

  File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 531, in open
    response = meth(req, response)

  File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)

  File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 569, in error
    return self._call_chain(*args)

  File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 503, in _call_chain
    result = func(*args)

  File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)

HTTPError

但教程里最后是。。。：在这里插入图片描述
这个疑惑待会问一下别人吧。。。 学长说是软件的问题，那么到了后期需要用到错误码时再进行吧

那么我们继续，将代码改成：

#用一个请求来看这样的一些内容或者信息
response = urllib.request.urlopen("http://www.baidu.com")    
#print(response.status)#把网页的内容做一些简单得解析，会获得一个状态码
print(response.getheaders())

将获得以下结果：（是一个列表）

[(‘Bdpagetype’, ‘1’), (‘Bdqid’, ‘0xef3923fc001f9475’), (‘Cache-Control’, ‘private’), (‘Content-Type’, ‘text/html;charset=utf-8’), (‘Date’, ‘Tue, 09 Feb 2021 12:14:19 GMT’), (‘Expires’, ‘Tue, 09 Feb 2021 12:14:09 GMT’), (‘P3p’, ‘CP=" OTI DSP COR IVA OUR IND COM "’), (‘P3p’, ‘CP=" OTI DSP COR IVA OUR IND COM "’), (‘Server’, ‘BWS/1.1’), (‘Set-Cookie’, ‘BAIDUID=CDC88F588E1CFA34918D9FE31EC04239:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com’), (‘Set-Cookie’, ‘BIDUPSID=CDC88F588E1CFA34918D9FE31EC04239; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com’), (‘Set-Cookie’, ‘PSTM=1612872859; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com’), (‘Set-Cookie’, ‘BAIDUID=CDC88F588E1CFA34300A244D0C591C05:FG=1; max-age=31536000; expires=Wed, 09-Feb-22 12:14:19 GMT; domain=.baidu.com; path=/; version=1; comment=bd’), (‘Set-Cookie’, ‘BDSVRTM=0; path=/’), (‘Set-Cookie’, ‘BD_HOME=1; path=/’), (‘Set-Cookie’, ‘H_PS_PSSID=33425_33582_33273_33392_26350; path=/; domain=.baidu.com’), (‘Traceid’, ‘1612872859143537997817237848613978084469’), (‘Vary’, ‘Accept-Encoding’), (‘Vary’, ‘Accept-Encoding’), (‘X-Ua-Compatible’, ‘IE=Edge,chrome=1’), (‘Connection’, ‘close’), (‘Transfer-Encoding’, ‘chunked’)]

可以发现，与网页页面的显示信息重合：
在这里插入图片描述
放大点看（就是一开始的部分打开文件的headers）~~不知称呼文件是否恰当~~

当然也可以选择这样：

print(response.getheader("Server"))

于是你将获得这样：

BWS/1.1

关于如何伪装成一个浏览器

就是要把浏览器中的信息（键值对）模拟出来

代码：

#关于如何伪装成一个浏览器,就是要把浏览器中的信息（键值对）模拟出来
url = "http://httpbin.org/post"
headers = {
        "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.146 Safari/537.36"
        }#需要被转化成字符串，里面要是有单引号就用双引号转化为字符串，反之亦然
#想要伪装的更真实一些，可以提供更多一些键值对
data = bytes(urllib.parse.urlencode({"name":"Elsa"}), encoding = "utf-8")
req = urllib.request.Request(url = url, data = data, headers = headers, method = "POST")
response = urllib.request.urlopen(req)
print(response.read().decode("utf-8"))

关于如何获取`user_agent`

在这里插入图片描述
当然，要想伪装的更加真实，可以加入更多的键值对，以豆瓣为例：

以上滑到底把键值对全部录入，模仿就很到位啦

封装的是什么：

req = urllib.request.Request(url = url, data = data, headers = headers, method = "POST")

网址，数据（请求时可以带一些数据）， headers（使其做出对应的相应），访问方式

豆瓣网实操：

那么对于豆瓣网，我们可以不用post（毕竟不用登陆嘛），直接get所以少了一些内容：

url = "http://www.douban.com"
headers = {
        "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.146 Safari/537.36"
        }
req = urllib.request.Request(url = url, headers = headers)
response = urllib.request.urlopen(req)
print(response.read().decode("utf-8"))

于是你可以获得：(太长了大概看一下就好)
在这里插入图片描述

附：

关于如何更改后缀名：
首先，打开文件夹（这里我打开的是对应文件夹但其实没必要）

然后点击查看，点击选项

取消勾选隐藏已知文件类型的扩展名，并点击右下角的应用

最后，像正常更改文件名一样，右键重命名然后改后缀名
如何用python读入写出文件
似乎，，和C++差不离儿

读入：

with open('/path/to/file', 'r') as f:
    print(f.read())

这里推荐博客：python 文件读写操作
我猜我一定会懒得打开所以粘过来了：(这三个都会读入最后的\n)

read() 每次读取整个文件，它通常用于将文件内容放到一个字符串变量中。如果文件大于可用内存，为了保险起见，可以反复调用read(size)方法，每次最多读取size个字节的内容。

readlines() 之间的差异是后者一次读取整个文件，象 .read() 一样。.readlines() 自动将文件内容分析成一个行的列表，该列表可以由 Python 的 for … in … 结构进行处理。

readline() 每次只读取一行，通常比readlines() 慢得多。仅当没有足够内存可以一次读取整个文件时，才应该使用 readline()。

写出：

mytxt = open('out.txt', mode='a', encoding='utf-8')
print(response.read().decode("utf-8"),  file = mytxt)
#其中，【response.read().decode("utf-8")】部分可以随意替换
mytxt.close()

風坞

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
补充urllib

视频学习链接：Python爬虫基础5天速成urllib.request虽然不知道为什么我的spyder在importurllib时不能自动补齐但是这不重要我只需要知道怎样应用功能就可以了获取一个get请求import urllib.requestresponse = urllib.request.urlopen("http://www.baidu.com")#打开一个网页并将获取到的数据保存到response中打印response会产生：<http.client.HTTPRespons
复制链接

扫一扫