爬虫基础知识-CSDN博客

本文链接：https://blog.csdn.net/fhiceng/article/details/135485508

一、爬虫分类

通用爬虫：网络覆盖率尽可能大

聚焦爬虫：自动下载与某一特定内容有关的页面

增量式爬虫：及时更新已爬取的内容

深层网络爬虫：深层网页可指用户登录后才显示的网页。一般获取Cookies放在请求中。

二、工作原理

1. 结构

控制器：给爬虫线程分配任务。根据系统传回的URL分配并启动线程。

解析器：下载和处理网页，如过滤、抽取特殊HTML标签、分析数据。

资源库：存放下载资源并自动生成索引，一般为数据库。

2. 流程

(1)将种子URL（比较重要、出度较大）放入待爬取队列；

(2)待爬取URL->解析DNS->取得IP->下载网页并储存，将其放入已爬取队列；

(3)分析已爬取队列URL中的URL，将其放入待爬取队列。

三、python爬虫架构

调度器：调度其他三个器

URL管理器：包括待爬取队列和已爬取队列，通过内存、数据库、缓存数据库实现。

网页下载器：将网页转换为一个字符串。

网页解析器：对上述字符串进行解析，解析方法多样。

应用程序：提取的有效数据组成的应用。

1. 页面分类

已下载未过期页面

已下载已过期页面：互联网上内容已经变化

待下载页面：待爬取队列的页面

可知页面：理论上能从URL中获取的页面

不可知页面：爬虫爬不到的页面

2. 爬取策略

深度优先

广度优先

回头补充

3. 网络更新策略

网页动态变化，何时更新已下载页面，更新多少

回头补充

四、代理

(1)突破自身IP访问限制

(2)访问内部资源

(3)提高访问速度：代理服务器一般配置一个较大的硬盘缓冲区

(4)隐藏真实IP

爬虫代理一般防止IP被封

五、爬虫示例

在中国天气网爬取某地区天气并存在csv文件中。

#coding:utf-8
import requests #解析页面HTML代码
import csv #写入csv文件
import random
import time
import socket
import http.client #这两个异常处理
import urllib.request #也能取HTML代码
from bs4 import BeautifulSoup #提取响应标签内容

def get_content(url,data=None):
    header={'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;',
            'Accept-Encoding':'gzip, deflate',
            'Accept-Language':'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
            'Connection':'keep - alive',
            #'Cookie':'Hm_lvt_c758855eca53e5d78186936566552a13 = 1704799336;_trs_uv = lr69iohs_6252_68my;_trs_ua_s_1 = lr69iohs_6252_hqn2;Hm_lpvt_c758855eca53e5d78186936566552a13 = 1704799356',
            'User - Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0'
            }
    timeout=random.choice(range(80,180))
    while True:
        try:
            rep=requests.get(url,headers=header,timeout=timeout) #获取网页源代码
            rep.encoding='utf-8' # req=urllib.request.Request(url,data,header)
            break
        except socket.timeout as e:
            print('3:',e)
            time.sleep(random.choice(range(8,15)))
        except socket.error as e:
            print('4:',e)
            time.sleep(random.choice(range(20,60)))
        except http.client.BadStatusLine as e:
            print('5:',e)
            time.sleep(random.choice(range(30,80)))
        except http.client.IncompleteRead as e:
            print('6:',e)
            time.sleep(random.choice(range(5,15)))
    return(rep.text)

def get_data(html_text):
    final=[]
    bs=BeautifulSoup(html_text,"html.parser")
    body=bs.body
    data=body.find('div',{'class':'weather-current-wrap'})

    a=data.find_all('a')
    for day in a:
        temp=[]
        date=day.find('p').string
        temp.append(date)
        inf=day.find_all('p')[1:]
        temp.append(inf[0].string)
        temperature=inf[1]
        temp.append(temperature.string)
        final.append(temp)
    return final

def write_data(data,name):
    file_name=name
    with open(file_name,'a',errors='ignore',newline='') as f:
        f_csv=csv.writer(f)
        f_csv.writerows(data)

if __name__=='__main__':
    url='http://pc.weathercn.com/weather/week/58064/?partner=&p_source=&p_type=jump'
    html=get_content(url)
    result=get_data(html)
    write_data(result,'sliver_weather.csv')