一、导入requests库,以及lxml库中的etree
from lxml import etree
import requests
二、网页分析,获取url、headers通过requests.get()请求网页内容
代码段为:
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36 Edg/92.0.902.55'}
url = 'https://movie.douban.com/subject/27119724/'
resp = requests.get(url,headers=headers)
resp.encoding = 'utf-8'
print(resp.text)
URL为所爬取页面的网址
url = 'https://movie.douban.com/subject/27119724/'
有时只使用requests.get(url)无法获取html信息,需要添加请求头 headers 来解决
请求头 headers