20180311_python学习笔记——爬虫

最新推荐文章于 2023-03-07 22:32:44 发布

权威小土豆

最新推荐文章于 2023-03-07 22:32:44 发布

阅读量228

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/A11085013/article/details/79513941

版权

python 专栏收录该内容

25 篇文章 0 订阅

订阅专栏

爬虫
urllib
使用urllib.request.urlopen函数就可以访问网页了

>>>import urllib.request
>>>response = urllib.request.open("http://www.fishc.com")
>>>html = response.read()
>>>print(html)

查看源代码，发现跟网页的不一样，是由于编码不一样的缘故。因此，代码需要这样优化
>>>html =html.decode('utf-8')
>>>print(html)

下载一张图片：

import urllib.request

resopnse = urllib.request.urlopen('http://imgsrc.baidu.com/forum/w%3D580/sign=d9d897c810d8bc3ec60806c2b28aa6c8/72acec36afc37931bf0ca4dbe2c4b74542a911b3.jpg')
cat_img = resopnse.read()
with open('cat.jpg','wb') as f:
f.write(cat_img)

urlopen实际上返回的是一个类文件对象，因为可以使用read()方法来读取内容，除此之外还有一下几个函数：
geturl():返回请求的url
info():返回一个httplib.HTTPMessage对象，包含远程服务器返回的头信息
getcode():返回HTTP状态码