python爬虫之web服务器连接

最新推荐文章于 2022-05-16 20:18:09 发布

雪小妮

最新推荐文章于 2022-05-16 20:18:09 发布

阅读量626

点赞数 1

原文链接：http://www.tup.com.cn

版权

一、web服务器整体处理过程

1、输入：URL

http(https)://域名部分：端口号/目录/文件名.文件后缀

http(https)://域名部分：端口号/目录/

2、处理过程

爬虫抓取多个页面只需解析robots.txt 一次，，HTTP1.1中设置的Connection属性设置为keep-alive,表示连接会保持，服务端不会主动断开连接

2、requests和response的使用

requests.request():用于构造一个请求

requests.get():获取HTML网页的GET方法

requests.head()：获取HTML网页的头部信息

requests.post()：向HTML网页提交POST请求T方法

requests.put()：向HTML网页提交PUT请求T方法

requests.向HTML网页提交局部修改请求方法

requests.delete()：向HTML网页提交删除

requests.Session()：在不同次请求中Web服务器保持某些参数

请求参数：

params:url的额外参数

Proxies：字典，设置访问代理服务器

import requests

url ='http://thelion.com/bin/aio_msg.cgi'
headers = {'User-Agent':'Mozilla/5.0'}
kw = {'cmd':'search','symbol':'APP'}

response = requests.get(url,headers=headers,timeout = 10,params=kw)
response.encoding= 'utf-8'
print(response.text)

3、错误异常处理

import requests
from requests.exceptions import ReadTimeout,ConnectionError,RequestException

url = 'https://baijiahao.baidu.com/s?id=1666848455598255839&wfr=spider&for=pc'  #"http://www.fudan.edu.cn/"
url2 = 'http://www.fudan.edu.cn/'
try:
    req = requests.get(url2,timeout = 5)
    print(req.status_code)
except ReadTimeout:
    # 超时异常
    print('Timeout')
except ConnectionError:
    # 连接异常
    print('Connection error')
except RequestException:
    # 请求异常
    print('Error')
else:
    if req.status_code == 200:
        print("访问正常！")
        # 将爬取的网页保存在本地
        fb = open('t.html','wb')
        fb.write(req.content)
        fb.close()
    if req.status_code == 404:
        print("页面不存在")
    if req.status_code == 403:
        print("页面禁止访问！")
    if req.status_code == 503:
        print("页面临时不可访问！")