爬虫第一节&使用urllib.request爬取

最新推荐文章于 2022-04-07 14:43:52 发布

Purple Coast

最新推荐文章于 2022-04-07 14:43:52 发布

阅读量389

点赞数 1

分类专栏：爬虫

本文链接：https://blog.csdn.net/weixin_44165224/article/details/95113450

版权

爬虫专栏收录该内容

4 篇文章 0 订阅

订阅专栏

1.直接爬取百度首页

#导入模块
import urllib.request as urllib2

#向指定的url发送请求，并返回服务器响应的类文件对象
request = urllib2.urlopen('http://www.baidu.com/')

#类文件对象支持 文件对象的操作方法，如read()方法读取文件全部内容，返回字符串
html = request.read()

#转码，把byte转换为字符串格式
html = html.decode(encoding='utf-8')

#打印出返回的内容
print(html)

#打印提示
print('正在保存')

#把返回的内容保存为html格式
with open('baidu.html','w',encoding='utf-8') as f:
	f.write(html)

2.使用Request爬取百度首页

import urllib.request as urllib2

# url 作为Request()方法的参数，构造并返回一个Request对象
request = urllib2.Request('http://www.baidu.com/')

# Request对象作为urlopen()方法的参数，发送给服务器并接收响应
response = urllib2.urlopen(request)

html = response.read()

print(html)

上面2种方式的运行结果是完全一样的：

新建Request实例，除了必须要有 url 参数之外，还可以设置另外两个参数：
data（默认空）：是伴随 url 提交的数据（比如要post的数据），同时 HTTP 请求将从 "GET"方式改为 "POST"方式。
headers（默认空）：是一个字典，包含了需要发送的HTTP报头的键值对

3.使用User-Agent访问

浏览器就是互联网世界上公认被允许的身份，为了使爬虫程序更像一个真实用户，那我们就是需要伪装成一个被公认的浏览器。不同的浏览器在发送请求的时候，会有不同的User-Agent头。

#使用User-Agent访问

import urllib.request as urllib2

#使用变量存储url
url = 'http://www.itcast.cn/'

#User-Agent，包含在header里
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0'}

# url连同headers，一起构造Request请求，这个请求将附带User-Agent
request = urllib2.Request(url,headers = header)

# 向服务器发送请求
response = urllib2.urlopen(request)

html =response.read()

html =  html.decode(encoding='utf-8')

print(html)

with open('itcast.html','w',encoding='utf-8') as f:
	f.write(html)

4.添加更多的Header信息

#添加更多的Header信息

import urllib.request as urllib2

url = 'http://www.itcast.cn/'

header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0'}

request = urllib2.Request(url,headers = header)

#通过调用Request.add_header() 添加/修改一个特定的header
request.add_header('Connection','keep-alive')

#可以通过调用Request.get_header()来查看header信息
print(request.get_header(header_name="Connection"))

response = urllib2.urlopen(request)

 #可以查看响应状态码
print(response.code)

html = response.read()

print(html)

Purple Coast

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
爬虫第一节&使用urllib.request爬取

1.直接爬取百度首页#导入模块import urllib.request as urllib2#向指定的url发送请求，并返回服务器响应的类文件对象request = urllib2.urlopen('http://www.baidu.com/')#类文件对象支持文件对象的操作方法，如read()方法读取文件全部内容，返回字符串html = request.read()#转码...
复制链接

扫一扫

专栏目录