1.用urllib来获取百度html代码
import urllib.request
//定义url
url = 'http://www.baidu.com'
//模拟浏览器发送请求
response = urllib.request.urlopen(url)
content = response.read().decode('utf-8')
print(content)
urllib方法
response.read(5) 返回五个字节
response.read()一个一个读取
response.readline()读取一行
response.readlines()读取全部行
print( response.getheaders() )获取状态信息
print( response.getcode() ) 返回状态码 如果是200则逻辑没有错误
urllib下载
import urllib.request
url = 'http://img-blog.csdnimg.cn/68e9d9890d99472395cb2de7d36234cb.jpeg?x-oss-process=image/resize,m_fixed,h_300,image/format,png'
response = urllib.request.urlretrieve(url,filename='csdn.jpg')
则返回url对应的图片
注意格式(url,filename)
视频,网页等下载同理
伪装
如果不进行伪装(请求代码的定制)如下:
import urllib.request
url = 'https://www.baidu.com'
response = urllib.request.urlopen(url)
content = response.read().decode('utf8')
print(content)
面对ssl协议网址,只能获得head部分html代码
因此需要进行请求代码的定制,用自己的浏览器模拟请求
headers = {'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/115.0'}
import urllib.request
url = 'https://www.baidu.com'
headers = {'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/115.0'}
request = urllib.request.Request(url=url,headers=headers)
//注意此处url=url,headers=headers
response = urllib.request.urlopen(request)
content = response.read().decode('utf8')
print(content)