python —— 爬虫入门

最新推荐文章于 2024-08-20 14:41:57 发布

一只工程狮

最新推荐文章于 2024-08-20 14:41:57 发布

阅读量322

点赞数 1

分类专栏： Python 文章标签：爬虫

本文链接：https://blog.csdn.net/qq_40913465/article/details/103044220

版权

Python 专栏收录该内容

21 篇文章 2 订阅

订阅专栏

python 如何访问互联网：
使用urllib包之urllib.request模块

1.引入模块
2.使用urlopen()函数打开网址
3.read（）函数读取网页内容
4.decode(‘utf-8’)表示用‘uft-8’解码，否则会以二进制的形式显示。

import urllib.request
response = urllib.request.urlopen('http://www.baidu.com')#打开网页
html = response.read() #读取网页内容
html = html.decode('utf-8') #解码
print(html)

使用爬虫获取图片
代码演示：

import urllib.request
response = urllib.request.urlopen('https://placekitten.com/g/500/300') #打开网页
cat_img= response.read() #读取网页
with open('cat_img.jpg','wb') as f:  #打开文件
    f.write(cat_img) #保存图片

urlopen既可以传入一个字符串又可以传入一个对象
代码也可以这样写：

import urllib.request
req = urllib.request.Request('https://placekitten.com/g/400/500')#实例化一个request对象
response = urllib.request.urlopen(req)#传入实例对象
cat_img= response.read()
with open('cat_img.jpg','wb') as f:
    f.write(cat_img)

respond返回一个类文件（和文件对象很相似），因此可以用read方法来读取内容，还包括其他一些方法，
例如：
geturl()

response.geturl()
Out[9]: 'https://placekitten.com/g/400/500'#返回访问的网页url地址

info() #获取HTTPMessage对象，可以查看其具体信息

response.info()
Out[10]: <http.client.HTTPMessage at 0x28dd9282208>

print(response.info())
Date: Wed, 13 Nov 2019 03:00:05 GMT
Content-Type: image/jpeg
Transfer-Encoding: chunked
Connection: close
Set-Cookie: __cfduid=dfbcaadd02a50377b3206e90ea871919a1573614005; expires=Thu, 12-Nov-20 03:00:05 GMT; path=/; domain=.placekitten.com; HttpOnly
Access-Control-Allow-Origin: *
Cache-Control: public, max-age=86400
Expires: Thu, 31 Dec 2020 20:00:00 GMT
CF-Cache-Status: HIT
Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
Vary: Accept-Encoding
Server: cloudflare
CF-RAY: 534d7e4d9eb89298-SJC

getcode()#得到http状态

response.getcode()
Out[12]: 200 #表示网页正常相应

一只工程狮

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录