python爬虫之普通爬虫

最新推荐文章于 2022-05-18 16:25:18 发布

雪小妮

最新推荐文章于 2022-05-18 16:25:18 发布

阅读量712

点赞数 1

本文链接：https://blog.csdn.net/qq_35249586/article/details/106297696

版权

一、普通爬虫的体系架构

web服务器连接：向指定web服务器发送请求（Requsets中的get、post请求），建立爬虫与web服务器的网络连接，连接作为发送URL和接收信息的通道。

DNS缓存：为了减少域名到IP地址的映射时间消耗。

URL过滤与提取：页面解析器对获得的HTML文件进行分析，提取包含的URL，根据robots.txt协议判断访问许可列表、是否已经爬行过等基本规则，再对提取的URL过滤。

爬行策略：深度优先、宽度优先、基于PageRank的重要排序、在线页面重要指数（On-Line Page Importance Computation，OPIC）等

二、普通爬虫流程图

post 、get请求内容提取与保存

import requests

url = 'https://www.baidu.com'
headers = {'User-Agent':'Baiduspider'}
#get请求 url 后面的参数
kw = {'wd':'哪吒'}

#get请求
response = requests.get(url,headers=headers,params=kw)
print(response.url) #查看url的网址
print(response.headers)#查看发过去的报头信息
print(response.status_code)#查看状态码

#打印网页源码的两种方式 content text
#content
'''print(response.content.decode()) #返回的是二进制响应内容 可直接使用decode'''

#text  根据网页的响应来猜测编码 一般是ISO-8859-1 采用response.encoding= 'utf-8'#修改编码
'''
print(response.encoding)#查看响应的编码，查看响应头字符编码
response.encoding= 'utf-8'#修改编码
print(response.encoding)
print(response.text)
'''
#保存打印信息
data = response.content
fb = open('d.txt','wb')
fb.write(data)
fb.close()

#####post请求
url1 = 'https://www.taobao.com'
formdata = {'wb':'小猪佩奇'} #post参数
headers1 = {'User-Agent':'Googlebot'}
#post请求
response1 = requests.post(url1,headers=headers1,data=formdata)
print(response1.url) #查看url的网址
#保存打印信息
data1 = response1.content
fb1 = open('t.html','wb')
fb1.write(data1)
fb1.close()

雪小妮

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
python爬虫之普通爬虫

一、普通爬虫的体系架构web服务器连接：向指定web服务器发送请求（Requsets中的get、post请求），建立爬虫与web服务器的网络连接，连接作为发送URL和接收信息的通道。DNS缓存：为了减少域名到IP地址的映射时间消耗。URL过滤与提取：页面解析器对获得的HTML文件进行分析，提取包含的URL，根据robots.txt协议判断访问许可列表、是否已经爬行过等基本规则，再对提取的URL过滤。爬行策略：深度优先、宽度优先、基于PageRank的重要排序、在线页面重要指数（On-Li
复制链接

扫一扫