Urllib数据抓取

最新推荐文章于 2020-10-28 20:53:04 发布

幺猫折耳鹿

最新推荐文章于 2020-10-28 20:53:04 发布

阅读量144

点赞数

分类专栏：爬虫文章标签： python 爬虫 urllib

本文链接：https://blog.csdn.net/qq_36292182/article/details/102944657

版权

爬虫专栏收录该内容

2 篇文章 0 订阅

订阅专栏

urllib简介

Python 3中，Urllib是一个收集几个模块来使用URL的软件包，具备以下几个功能：

urllib.request:用于打开和读取URL
urllib.error: 包含提出的例外urllib.request
urllib.parse:解析URL
urllib.rebotparser:解析robots.txt文件

发送请求

urllib.request.urlopen(url,data=None,[timeout,]*)

url:需要访问的网站URL地址
data: 默认值为None，表示请求方式为GET，反之为POST
timeout：超时设置

下面例子实现Urllib模块对网站发送请求并将响应内容写入文本文档：

#导入urllib
import urllib.request
#打开URL
response = urllib.request.urlopen('https://movie.douban.com/',None,2)
#读取返回的内容
html = response.read().decode('utf8')
#写入txt
f = open('html.txt','w',encoding = 'utf8')
f.write(html)
f.close()

复杂的请求
urllib.request.Request(url, data=None, headers={},method=None)

headers: 设置request请求头信息
method：设定请求方式，POST、GET

#导入urllib
import urllib.request
url = 'https://movie.douban.com/'
# 自定义请求头
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)'
    'Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3',
    'Referer': 'https://movie.douban.com/',
    'Connection': 'keep-alive'}
#设置request请求头
req= urllib.request.Request(url,headers=headers)
#使用urlopen打开req
html = urllib.request.urlopen(req).read().decode('utf8')
#写入txt
f = open('html.txt','w',encoding = 'utf8')
f.write(html)
f.close()