爬虫
基础爬虫训练
import urllib.request
respone = urllib.request.urlopen("http://www.baidu.com")
print(respone.read().decode('utf-8'))
超时处理
try:
respone = urllib.request.urlopen("http://www.baidu.com",timeout=0.1)
except urllib.error.URLError as chaoshi:
print("超时!") #超过0.1s打印超时
欺骗服务器
418表示不让爬虫
import urllib.request
url = "http://httpbin.org/get"
respone = urllib.request.urlopen(url)
print(respone.read().decode('utf-8'))
返回
上面User-Agent 直接表明我们是Python,所以我们可以用Post传递表单,将User-Agent设置为浏览器访问时的用户代理
import urllib.parse,urllib.request
url = "http://httpbin.org/post"
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) ..."
}
#数据data在form中体现
data = bytes(urllib.parse.urlencode({"name":"123"}),encoding="utf-8")
#设置请求对象
res = urllib.request.Request(url = url, data = data, headers = headers, method= "POST")
respone = urllib.request.urlopen(res)
print(respone.read().decode('utf-8'))
显示User-Agent为浏览器