python实现网络爬虫简述

最新推荐文章于 2023-05-20 15:04:05 发布

coding_xian

最新推荐文章于 2023-05-20 15:04:05 发布

阅读量729

点赞数 1

分类专栏： python

本文链接：https://blog.csdn.net/Carl_changxin/article/details/100566792

版权

python 专栏收录该内容

15 篇文章 2 订阅

订阅专栏

1、网络爬虫分类

（1）通用网络爬虫（又叫全网爬虫）

特点是爬取范围和数量巨大，要求爬取速度和存储空间较高；在爬取页面的顺序要求较低，由于待刷新的页面太多，采用并行工作方式，需要较长时间才能刷新一次页面。这种爬虫主要应用于大型搜索引擎中。

（2）聚焦网络爬虫（又叫主题网络爬虫）

按照预先定义好的主题，有选择地进行相关网页的爬取。与通用相比，它不会将目标资源定位在整个互联网当中，而是将爬取的目标网页定位在与主题相关的网页中。

（3）增量式网络爬虫

在爬取网页的时候，只会在需要的时候爬取新产生或更新的页面，对没有发生变化的页面，不会爬取。

（4）深层网络爬虫

web页面分为表层网页和深层网页；表层网页是指不需要提交表单，使用静态的超链接就可以直接访问静态页面。深层网页指那些需要用户提交一些关键词才能获得的页面。

2、网络爬虫的基本原理

一个通用的网络爬虫基本工作流程：

3、网络爬虫的常用技术

（1）网络爬虫需要和HTTP打交道，python实现HTTP请求常见的三种方式：urllib，urllib3，requests

1、urllib模块（python自带模块）

urllib中的子模块
urllib.request 该模块定义了打开URL的方法和类，例如身份验证，重定向，cookie等
urllib.error 该模块包含异常类，基本的异常类是URLError
urllib.parse 该模块定义的功能分为，URL解析和URL引用
urllib.robotparser 该模块用于解析robots.txt
import urllib.request
import urllib.parse



response = urllib.request.urlopen("http://www.baidu.com")  #get方式请求网页内容
html = response.read()
print(html)

data = bytes(urllib.parse.urlencode({'word':'hello'}), encoding='utf8')
response = urllib.request.urlopen('http://httpbin.org/post', data=data) #post方式请求网页内容
html = response.read()
print(html)
2、urllib3模块（需要安装 pip install urllib3）
import urllib3
http = urllib3.PoolManager()
response1 = http.request("GET", "http://www.baidu.com")
print(response1)

response2 = http.request("POST", "http://httpbin.org/post", fields={'word':'hello'})
print(response2)
3、requests模块（需要安装）

该模块在实现HTTP请求时要比urllib模块简化很多，操作更加人性化。
import requests

response = requests.get('http://www.baidu.com') #get请求
print(response.status_code) #打印状态码
print(response.url)
print(response.headers)
print(response.cookies)
print(response.text) #以文本形式打印网页源码
print(response.content) #以字节流形式打印网页源码

data = {'word':'hello'}
response = requests.post('http://httpbin.org/post', data=data) #post请求
print(response.content)
模拟浏览器头部信息进行访问：请求一个网页内容时，出现403错误，一般是因为网页为了防止恶意采集信息，所使用的反爬虫设置；因此可以通过模拟浏览器的头部信息进行访问，就能解决该种反爬设置。
import requests

url = 'https://www.baidu.com'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'}
response2 = requests.get(url, headers=headers)
print(response2.content)
模拟网络超时：网页长时间未响应，系统会判定该网页超时，无法打开该网页；模拟网络超时，可以在爬虫时若遇到无法下载的网页，也不会一直卡在该种状态下，能够继续爬取其它网页。
import requests

for a in range(1, 50):
    try:
        response = requests.get('https://www.baidu.com', timeout=0.5) #设置超时为0.5秒 即0.5秒未爬取成功 则进行异常处理
        print(response.status_code)
    except Exception as e:
        print('异常'+str(e))
requests模块种有三种常见的网络异常类
import requests
from requests.exceptions import ReadTimeout,HTTPError,RequestException

for a in range(1, 50):
    try:
        response = requests.get('https://www.baidu.com', timeout=0.5) #设置超时为0.5秒 即0.5秒未爬取成功 则捕获该异常
        print(response.status_code)
    except ReadTimeout:
        print('timeout')
    except HTTPError:
        print('http error')
    except RequestException:
        print('req error')
代理服务：由于频繁地爬取某个网页，可能会被对方服务器屏蔽我们的IP，此时就需要代理服务解决此类反爬设置。
import requests

proxy = {'http':'122.114.31.177:888', 'https':'122.114.31.177:888'} #设置代理和对应的端口号
response = requests.get('http://www.mingrisoft.com', proxies=proxy)
print(response.content)
（2）Beautiful Soup是python的一个库，最主要的功能是从网页抓取数据

https://blog.csdn.net/qq_21933615/article/details/81171951

（3）Scrapy爬虫框架

https://blog.csdn.net/luanpeng825485697/article/details/78439210

urllib中的子模块
urllib.request	该模块定义了打开URL的方法和类，例如身份验证，重定向，cookie等
urllib.error	该模块包含异常类，基本的异常类是URLError
urllib.parse	该模块定义的功能分为，URL解析和URL引用
urllib.robotparser	该模块用于解析robots.txt

coding_xian

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
python实现网络爬虫简述

1、网络爬虫分类（1）通用网络爬虫（又叫全网爬虫）特点是爬取范围和数量巨大，要求爬取速度和存储空间较高；在爬取页面的顺序要求较低，由于待刷新的页面太多，采用并行工作方式，需要较长时间才能刷新一次页面。这种爬虫主要应用于大型搜索引擎中。（2）聚焦网络爬虫（又叫主题网络爬虫）按照预先定义好的主题，有选择地进行相关网页的爬取。与通用相比，它不会将目标资源定位在整个互联网当中，而是将爬取...
复制链接

扫一扫