爬虫学习笔记1——基本思路
源代码来自崔庆才《python3网络爬虫开发实战》
获取网页html
使用requests库的get(url , headers)方法headers是用户代理,在浏览器地址栏输入“about:version”就可看到。代码如下:
def get_one_page(url):
try:
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64)'
+'AppleWebKit/537.36 (KHTML, like Gecko)'
+'Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3742.400'
+'QQBrowser/10.5.3864.400'
}
response = requests.get(url,headers=headers)
if response.status_code == 200:
return response.text
return None
except RequestException:
return None
<